1 Introduction
In August 2017, the New Zealand Government approved amendments to the National Policy Statement (NPS) for Freshwater Management—the ‘Clean Water’ package. These amendments, in particular the criteria for a river to be deemed ‘swimmable’, had been the subject of extensive background research and were also of great public interest. Their adoption followed a period of some decades of refinements. Historical, public health, international comparison, and statistical and freshwater science issues related to the NPS are discussed by McBride [8]. The subject continues to be of national importance, as each regional council in New Zealand seeks to implement the NPS and to manage its freshwater monitoring and improvement programme. A new NPS, finalising the criteria, came into force on 3 September 2020 [9].
The swimmability criteria place rivers into 5 categories (A–E, also called Blue, Green, Yellow, Orange, and Red, of which A, B, and C are deemed ‘swimmable’), based on four E. coli criteria, shown in Table 1^{1}^{1}1The cabinet paper clarifies that “all four measures must be met and should be determined using a minimum of 60 samples over a maximum of five years collected on a regular basis regardless of weather and flow conditions. The only exception to this is if there is insufficient monitoring data to establish the 95th percentile, in which case that measure/statistic does not apply.”
Two of the four criteria are based on percentiles (P50 and P95, i.e., the 50th and 95th percentiles of the E. coli
counts), and two are based on ‘percentage exceedances’: G260 and G540 refer to the percentage of samples that should not exceed bacteria counts of 260 and 540, respectively. There is a further classification for sites that are likely to be used for swimming, based solely on 95% thresholds. In practice this classifies B, C, D, and E sites as Poor, and subdivides A sites into Excellent, Good, and Fair. We call the standard method based on P50, P95, G260, and G540 the
percentile classification.Regional councils are responsible for measuring and reporting against all targets. The data is generally collected monthly, from a set of predefined locations across the river systems of the region. These locations may be chosen for a variety of reasons, including upstream and downstream points close to potential polluters, convenience of sampling, and proximity to human settlements. New measurement points may be added if pollution is detected in a particular place.
Escherichia coli (E. coli) are bacteria found in the digestive tract of warmblooded animals, some of which can cause sickness in humans. E. coli readings in freshwater, usually given as the number of bacteria present in a 100 mL sample, are highly variable. E. coli generally gets into rivers from human sewage outlets or from farm runoff. It then flows downstream. If its presence in the river is because of a contamination episode such as heavy rain or other adverse weather event, it will be present in the river system for only a few days, but if it comes from a persistent polluter it will be present continuously. Usually, monthly readings are essentially uncorrelated in time. A widelyused approach models E. coli
readings as independent and lognormallydistributed random variables. For example, Sørensen
et al. [12] use the lognormal parametric model to develop a sampling protocol for swimmability.Moreover, E. coli concentrations are only a proxy for freshwater quality. The guidelines were developed by modelling the risk of Campylobacter infection using the lognormal parametric model of E. coli (see, e.g., [9, p. 48].) The thresholds were determined from a combination of this risk, historical precedent, and international comparisons. Percentile criteria (and categories) were adopted for easier comprehension by the public and to be clear and simple for councils to measure and report. This process of turning a multidimensional state into discrete ordinal categories (sometimes called dichotomization or ordinalization) necessarily loses information; the ordinal data should almost never be used for subsequent statistical analysis [4]. In addition it should be remembered that the goal of this and any other clean water strategy is not the specific targets (e.g., a fraction of rivers by length to be ‘A’ quality by a certain date), but an improvement in the underlying multidimensional state. A further complication is that the water quality criteria are used for several different purposes, namely (i) determining the state of a particular site; (ii) determining the trend in state of a particular site; (iii) determining a broad measure of the state or trend of all measured sites; (iv) predicting the state or trend of all rivers in the region.
Category  G540  P50  P95  G260 

samples above 540  median  95th percentile  samples above 260  
A — Blue  130  540  
B — Green  5–10%  20–34%  
C — Yellow  10–20%  20–34%  
D — Orange  20–30%  
E — Red 
The Manawatū–Whanganui Regional Council commissioned reports into the state and trend of regional rivers, and these were used to develop swimmability targets for 2030 and 2040 [11]. These studies examined each E. coli
threshold separately, and the importance of sampling errors figure prominently. Notably, the 95th percentile criterion (P95) was discarded because it is estimated with lower precision than the other three criteria P50, G260, and G540. Trends in G260 and G540 were estimated from the trends in their annual values, which lowers their precision.
In this paper we compare the parametric lognormal model to the percentile classification and show that the parametric model can reduce uncertainty and can incorporate more useful information, especially from very high E. coli readings, and is suitable for censored data. We apply the parametric model for state and trend to 135 sites in the Manawatū–Whanganui region and make recommendations for points that are worth further consideration in the analysis of swimmability.
2 State determination in the parametric model
2.1 Interpretation of the NPS criteria in the parametric model
For state estimation, we consider a river in a steady state whose E. coli measurements at times are independent random variables drawn from a lognormal distribution, i.e.:
Here is the natural logarithm.
The water quality criteria have a simple interpretation in terms of this parametric model. Each individual criterion corresponds to a halfplane in the parameter space, with each category A–E being an intersection of such halfplanes, which corresponds to a polygon in parameter space.
For example, the criterion that the median be less than or equal to 130 becomes the halfplane , and the criterion that 95th percentile is less than or equal to 1200 becomes the halfplane . Here is the
value corresponding to the quantile
for the normal distribution, i.e., the inverse of the cumulative density function. From Table
1, the relevant values are , , , , , and .The individual criteria are shown in terms of the parametric model in Figure 1, and the resulting boundaries of these polygons for swimmability categories A–E in Figure 2. We summarize them as follows:

For the A–B boundary, only the P95 criterion (identical to G540 in this case) is relevant. All sites meeting P95 already greatly exceed the requirements for P50 and G260.

The B–C boundary is determined for most sites by the G540 criterion. (For very clean sites P95 becomes active, and there is also a B–D boundary that is relevant for dirty sites, determined by P50, although in our data no sites came near either of these two parts of the boundary.) As G540 here is determined by the percentage of samples being above or below 10%, this is essentially equivalent to the 90th percentile.

The C–D boundary—perhaps the most important, as it determines swimmability—is determined by P95 and P50. Sites meeting these criteria already satisfy G260 and G540.

The D–E boundary is determined by G540, in this case equivalent to the 70th percentile.
Thus, the four criteria are not equally important for all sites.
Remarks

G260 and G540, on which a lot of attention has been placed, are likely to be irrelevant for determining whether a particular site is swimmable.

The criteria have the effect of making the category C region relatively small. For the sample data, far more sites are in category B than category C, which diminishes the usefulness of the categories. In terms of this model, it would make more sense to replace the B–C boundary at P95 = 800 instead of P95 = 1000.

The crucial C–D (pass–fail) boundary is mostly determined by P95. But P95 is the least reliable of the four criteria to measure, and is often dropped.

Dropping the 95th percentile requirement for the D–E boundary in favour of G540 requirement does make sense in this model.
2.2 Sampling errors in the parametric and percentile models
When samples are drawn from a normal distribution with mean
, the sampling error in the mean is , and the sampling error in the standard deviation is . Therefore, the sampling error in , the natural estimate of the th percentile under the normal distribution, is where .There are different methods of estimating percentiles from data. In E. coli studies, this is often done by applying the Hazen method to the log of the bacteria counts [6], which is implemented in software provided by the Ministry for the Environment [8].
Sampling errors in percentiles are larger than sampling errors in , as was pointed out in a study based on simulations [6]. The precise statement is that the sampling error in the
th percentile for a distribution with probability density function
and cumulative density function is where [13]. For example, the sampling error in the median for a normallydistributed random variable is , a factor of 1.253 times larger than the sample error in the mean: see Figure 3 for an illustration of the differences as they affect the swimmability categories. Comparisons for different percentiles are given in Table 2.The above values are approximations that are valid in the limit of large . For small , the bias of these estimators may be significant as well. The 50th percentile estimators (the median and mean, respectively) are unbiased. The Hazen percentiles are biased: for example, for normal data the 95th percentile has bias when and when . In contrast, the sample mean and sample standard deviation
easily provide the unbiased estimator
, where [10].sampling factor  

0.5  1.253  1  1.570 
0.66  1.293  1.042  1.540 
0.7  1.318  1.067  1.526 
0.8  1.429  1.163  1.510 
0.9  1.709  1.350  1.603 
0.95  2.113  1.534  1.897 
Coefficients of the standard error in the
th percentile of data in in the limit of large , as calculated by the percentile method (standard error ) and from the parametric method (standard error ), together with the ratio between the number of samples required by the percentile and parametric methods for the same sample error (so for the 95th percentile, the percentile method needs times as many samples to achieve the same sampling errors).An illustration of 67% (1sigma) confidence intervals when estimating the state of a single site with either 60 (large ellipses) or 120 (small ellipses) independent samples. Left: using the estimates of
from the parametric model. Right: Using percentiles in the parametric model leads to larger confidence intervals and larger sampling errors.An example is shown in Figure 4 for synthetic data generated for a site with mean E. coli count of 150 and P95 count of 1750. With 60 independent samples, the percentile method falsely reports P95 1200 18% of the time; in contrast, the parametric method falsely reports P95 1200 only 10% of the time.
2.3 E. coli data for the Manawatū–Whanganui region
Data were obtained from LAWA [7] for 135 sites in Manawatū–Whanganui. The 18,567 measurements ranged from January 2005 to December 2019. The number of observations per site ranged from 15 (there were 4 sites with fewer than 60 data points) to 189, with a median of 145 (i.e., 12 years of monthly sampling data). Observations are generally monthly with some missing observations.
A plot of all the data points for all 135 sites is shown in Figure 5. Changes in the measurement (or reporting) system over time can be seen, particularly with regard to precision and censoring (that is, reporting only a range in which the value lies). The reported precision varies with both time and value. Prior to 2008, midrange values (near 400) were rounded to the nearest 5 units (i.e., a precision of 1%); values less than 60 were rounded to the nearest 10 units; large values were reported as 9,800, 13,000, 14,100, 17,300, 19,900, 24,200, or 51,700. Some readings were censored, i.e., reported as ‘10’ or ‘24,200’.
From 2008 to mid2012, there was no rounding, and the only censoring involved values reported as ‘1’ (presumably because no bacteria were observed in the 100 mL sample). However, from mid2012, values up to 2300 were reported to 2 significant figures, and higher values with gradually decreasing precision. High values for most sites were reported as 6,200, 6,900, 7,900, and 9,700. Low values were censored and reported as ‘4’. Most high values were censored and reported as ‘9,700’, although some sites continued to report higher values (up to 120,000, and ‘240,000’).
Rounding is presumably related in some way to the underlying accuracy of the measurements; analytic errors in E. coli measurement are typically around 20% [2]. However, the rounding itself adds errors of 10–20%, especially for low and high values that are reported with less precision (consider values near 130, the critical threshold for P50). It may be unavoidable, but the question of whether it is necessary to add extra errors by rounding deserves a closer look. In this study we ignore the effect of rounding and analytic errors.
Censoring is more important. For example, the value ‘9,700’ occurs 7 times for site 113, all of them since 2015, whereas the previous maximum reading was 3,683. These high E. coli counts are important to understanding the underlying state and trend of the river. Changing them to 9,700 would bias downwards the estimates of the mean, the standard deviation, and the trend. The value 9,700 is often only 1 standard deviation above the mean; 12 of the 135 sites have 95th percentile values over 9,700.
We deal with the censored values in the parametric model with a simple, standard approach in which censored values are replaced by their expected values in the model, conditioned on their known ranges. As the parameters now enter into both the model and the data, they obey nonlinear equations. We solve these by an iterative procedure in which the imputed values of the censored data values are iteratively updated. (Of course, the procedure becomes unreliable if too high a proportion of the data is censored.)
In an earlier study [11], censored values were removed from the data set before analysing trends. Since they tend to occur later in time, removing them biases trends downwards. The effect can be noticeable. For site 124, in which 3 of 51 values were censored, deleting the censored values gives a trend of (i.e, 1.2% improvement per year) and a median E. coli count in 2020 of 209; replacing censored values with their bounds gives a trend of +0.077 and a median E. coli count in 2020 of 360; using imputed values gives a trend of +0.103 and a median E. coli count in 2020 of 421.
Collecting and analysing the water samples is timeconsuming and expensive. A version of the data on LAWA that was available in 2018, but is no longer available, also listed the time of collection, which showed that nearby sites are sampled very nearly at the same time. This makes them correlated. We find that 434 pairs of sites have at least 30 measurements taken on the same day, with correlation between the sameday measurements greater than 0.5. The most highly correlated sites are 42 and 43, two sites on the Manawatū river in Palmerston North, less than 3 km apart. Their correlation coefficient is 0.93 (see Figure 6), which means that the data cannot be pooled to reduce uncertainty in the trend. Sample collection takes place mostly in the middle of the month (Fig. 6), even more so from 2013, which increases the chances of sameday collection, even between distant sites. The correlation may be due to an underlying cause, such as E. coli outbreaks linked to the same source, or rainfall triggering outbreaks.
3 Trends in the parametric model
The lognormal parametric model is convenient for trend determination. We first performed linear regressions to the data
for each site. The standard deviations of the residuals in the first half and in the second half of each site were compared. Over the set of all sites, the differences were not significant. Therefore we adopted the model , where the parameter is the trend, and is the standard deviation of the residuals. A value of means a 5% reduction in E. coli levels per year. In this model, the trends in all percentiles are equal, although the G260, P50, P95, and G540 criteria could be crossed at different times. The measured trends and their standard errors are shown in Figure 7.The state A–E at time is then determined from the parameters . Note that is the standard deviation of the residuals and is smaller than the standard deviation of the data, as used for state determination. The standard errors of the parameters are also obtained. We report these at time 2020.
The results are shown in Table 3 (numbers of sites in each category in 2020); Figure 9 (a visualisation of the impact of 10year trends on the 2020 category); Figures 10–12 (each data set plotted separately along with its trend); and Table LABEL:tab:allres (numerical state and trend results for each site).
Of the 135 sites, 17 sites are improving at the 1sigma level, 9 at least at the 2sigma level, and 6 at the 3sigma level. 18 sites are deteriorating at least at the 1sigma level, 13 at least at the 2sigma level, and 5 at the 3sigma level. At 67 sites (50%) the trend was not significant at the 1sigma level. The average trend is indistinguishable from 0.
However, each measured trend is subject to a sampling error. These have a median value of 0.034. This has the effect of smearing out the distribution of observed trends. (If all true trends were equal, we would observe a normal distribution of trends with standard deviation 0.053.) If the true trends are normally distributed with variance
and the sampling errors are independent and normally distributed with mean 0 and variance , then the observed trends will be normally distributed with variance . Under these assumptions, the true distribution of trends is likely to be more concentrated than shown in the figure. However, the correlation between measurements at nearby sites makes it difficult to take this observation further.Trends that are significant at less than the 1sigma level should be treated with caution. (Sites with any improving trend have been called ‘more likely to be improving than not’ [11].) Site 47, Manawatū at Teacher’s College, showed a quite strong improving trend of about (i.e. 15% per year) with consistently from early 2013 to mid 2017. This was then reversed by several particularly high readings of up to 23,000. It is now showing a deteriorating trend of with . It is not possible to determine from the data whether the water quality was actually improving until mid2017 or whether this was just an artefact of noise, as the pattern seen here is typical of independent random samples.
How many samples are needed to detect a trend?
For data drawn from the model with uniformlyspaced samples per year for years, the expected standard error of the trend is [1]. Adopting the (very minimal) requirement that to detect a trend requires the expected standard error to be less than , we need
Consider a site with a typical value of . With 10 years of data and 12 samples per year, trends are detectable; with 52 samples per year, trends are detectable.
However, with 5 years of data and 12 samples per year, trends are detectable; with 52 samples per year, trends are detectable. In our dataset, only 5 of the 135 sites had (sites 9, 17, 19, 26, 29) and some of these are exceptional (e.g., site 17, significant at due to the installation of a new water treatment plant, and site 19 due to highly acidic water in the Whangaehu river, as shown in site 21 at which scarcely any E. coli are ever detected.)
Conversely, a number of sites have trends in the range to —good enough to move from the grade D/E boundary to the C/D boundary in 10 years. To detect this in 5 years even at the 1 level needs weekly monitoring.
We conclude that to get useful trends in 5 years requires more frequent measurements than is the current practice.
A  B  C  D  E  

Percentile state  12  8  2  60  53 
Average state  12  10  2  44  67 
2020 state  16  7  3  44  61 
4 Discussion
Observations

Looking at the data in Figure 8, the classification system based on four different criteria of two different types (percentiles and percentage exceedances) seems to be unnecessarily complex. Very similar classifications, having a very similar relationship to health risks, can be obtained by a single criterion based on a single percentile. The 67% percentile is easier to measure reliably than the 95%, and would classify the data just as well. It would also allow the river state to be summarized by the single number , which also retains more information that a classification into five classes.

This is backed up by Hunter [6], who concluded, “the estimation of the 95th percentile using any of the methods examined here offers little advantage over the current percentage exceedence approach, other than offering a false sense of increased accuracy…It is concluded that a move to use of 95th percentile calculations in determining compliance with bathing water standards has no statistical validity and has a number of disadvantages.”

The definition of category C means that very few sites fall into category C. Changing the category boundaries so that more nearly equal number of sites fall into categories B and C would have allowed the categorisation system to convey more information.

The system used in LAWA [7] for reporting states and trends is not transparent. For example, nearly all sites in the region are reported as showing either an improving or a deteriorating trend. Most likely, many of these are not significant. In the parametric model, only 24% of the trends are significant at the 2 level, not accounting for multiple testing or correlation between sites.

We have not tested the validity of the lognormal model in detail. 95% of the residuals have (consistent with the normal distribution) but 1.2% have and 4% have (cf. 2.2% in the normal distribution). The discrepancy in could repay further examination.
Recommendations

Examine why the data was rounded and if possible, avoid rounding.

Examine why the data was truncated and if possible, avoid truncation (censoring). These processes may be losing useful information for no good reason.

If possible, use an E. coli analysis method sensitive down to 1 bacteria/100 ml.

Faster, more reliable detection of trends needs more frequent sampling. It is more costeffective to sample more frequently from fewer sites, as sampling nearby sites at the same time yields almost no independent information.

Reexamine the criteria used for reporting water quality states and trends. In particular, the 95% criterion is the deciding factor for swimmability for many sites, and should be examined carefully.
Acknowledgements
This project arose from a problem presented by Horizons Regional Council (HRC) at the 2018 meeting of the Mathematics in Industry Study Group of New Zealand. We thank HRC (particularly Stacey Binsted, Abby Matthews and Manas Chakraborty) for their advice and the other members of the working group (including Ali Abdul Adheem, Jamas Enright, Anton Good, Christian Offen, Cami Sawyer, Sunchai Tongsuksai, Alex White, and Matt Wilkins) who examined various aspects of freshwater quality monitoring in the Horizons region.
References
 [1] S Ahn and J A Fessler (2003). Standard errors of mean, variance, and standard deviation estimators. EECS Department, The University of Michigan, 12.
 [2] D E Angelescu, V Huynh, A Hausot, G Yalkin, V Plet, J M Mouchel, S GuérinRechdaoui, S Azimi, V Rocher, Autonomous system for rapid field quantification of Escherichia coli in surface waters, J. Appl. Microbiology 126 (2019), 332–343.J
 [3] S Chatterjee and D L McLeish, Fitting linear regression models to censored data by least squares and maximum likelihood methods, Commun. Stat. Th. Meth. 15 (1986): 3227–3243.
 [4] V Fedorov, F Mannino, and R Zhang, Consequences of dichotomization, Pharm. Stat. 8 (1) (2009), 5061.
 [5] C Fraser and T Snelder, State and Trends of River Water Quality in the Manawatū–Whanganui Region, Horizons Report 2018/EXT/1619, November 2018
 [6] P R Hunter, Does calculation of the 95th percentile of microbiological results offer any advantage over percentage exceedence in determining compliance with bathing water quality standards?, Lett. Appl. Microbiology 34 (2002), 283–286.
 [7] LAWA (Land Air Water Aotearoa), http://lawa.org.nz, accessed December 2020.
 [8] G McBride, National Objectives Framework: Statistical considerations for design and assessment, NIWA, September 2016.
 [9] Ministry for the Environment, National Policy Statement for Freshwater Management 2020, Publication ME 1518, 2020. Available online at https://www.mfe.govt.nz/publications/freshwater/nationalpolicystatementfreshwatermanagement2020
 [10] R S Parrish, Comparison of quantile estimators in normal sampling, Biometrics 46 1990, 247–257.
 [11] T Snelder, Assessment of recent reductions in E. coli and sediment in rivers of the Manawatū–Whanganui Region, LWP Client Report 201706, January 2018.
 [12] P S Sørensen, J La Cour Jansen, and H Spliid, Statistical control of hygienic quality of bathing water, Environmental Monitoring and Assessment 17 (1991), 217–226.
 [13] A Stuart and K Ord, Kendall’s Advanced Theory of Statistics, Volume 1, Distribution Theory, 6th ed., Wiley 2010.