• Sonuç bulunamadı

2.3 Methodology

2.3.2 Outlier Detection Test: Adjusted Tukey’s Method

Another concern that affects the quality of datasets is the extreme values which are called “outliers”. The outlier is a value that is extremely different from the rest of the data distribution (Ohio-EPA, 2012). Outliers may originate from several reasons like the existence of point anthropogenic inputs and/or measurement errors in the water quality data used (Rousseeuw & Hubert, 2011). Outlier removal not only provides the removal of misleading values from data but also enables us to clean data from the impacts of anthropogenic effects. However, US EPA (2009) pointed out that discarded outliers may be a part of the background population, and there is a probability that these outliers may be generated from natural sources even if they are

outlier detection method should not be restrictive in order to prevent the loss of real natural-based observations that belong to the dataset and meaningful part of the background concentration. Outliers disrupt data distribution, and they generally cause overestimation or underestimation of statistical analysis. In this study, the detection of outliers was performed based on a methodology that eliminates values that are apparently high in magnitude without being strict because only having a high concentration is not enough for the indication of anthropogenic inputs.

For the selection of the outlier detection method to be implemented, qualitative and quantitative characteristics of data should be taken into account based on some criteria like the number of data and type of data distribution. The Standard Deviation Method, Dixon’s Method, and Rosner Method are some of the techniques that are developed for outlier detection. Most of these methods require a low number of data and homogeneously distributed datasets. However, with regard to the data obtained from the Yeşilırmak River Basin, it was examined that these methods cannot be implemented due to the high number of data and non-homogeneous data distribution of the available datasets, which do not satisfy the requirements of these techniques.

Besides, the range between the individual value of the data is quite wide; hence, this situation also leads to sharp fluctuations through the dataset. In other words, the difference between the smallest and largest value of the available data is considerably high for the case of measurement results from the Yeşilırmak River Basin. This situation creates the need for an advance and relatively strict method that does not cover a broad range of data as an outlier. On the basis of this idea, the adjusted Tukey’s Method, which is preferred for high in number, widely ranged, and non-homogeneously distributed datasets was implemented for outlier detection purposes.

The conventional methods mentioned above assume a large portion of the dataset as an outlier and exclude these values from the dataset. On the other hand, the Adjusted Tukey Method provides a substantially strict approach by eliminating only excessively high-magnitude data points. In this study, upper and lower limits were determined for the detection of outlier values present in the datasets of each metal and metalloid by implementing the Adjusted Tukey’s Method. Since the below-LOD

values were treated with the three different approaches as explained in the previous section, the lower limits of the data were not further treated within the context of the outlier detection method. Therefore, only upper limits were evaluated as the main concern for outlier detection. One of the main advantages of this method is to provide a reliable high upper limit for the exclusion of outliers so that any risk to loose data which may belong to natural input was prevented by considering largely ranged characteristic of the existing data. By using Equation (1), data values that exceed the upper limit determined by the Adjusted Tukey’s Method were excluded from the datasets (European Commission Joint Research Center, 2015).

where,

1st Quartile (Q1)= 25% of the numbers in the dataset 3rd Quartile (Q3)= 75% of the numbers in the dataset Interquartile (IQR) = Q3-Q1

K=1000 (constant for non-normal distribution)

In principle, the upper fence limit is given in Equation (1) defines the maximum distance from the median value of the dataset. As it is illustrated in Figure 2.1, the distance between the upper fence limit and the lower fence limit represents the maximum allowable data spread. Values above this boundary are defined as outliers.

This approach gives information about the extent of deviation from the common data distribution. Another compound of Equation (1) is the K constant, which determines the extent of interquartile spread. This constant can be used as a different numerical value in different versions of the Tukey’s Method from 10 to 1000, depending on the characteristics of data. However, small K values bring about a quite narrow spread Upper Fence Limit = (3rd Quartile) + k ∗ (Interquartile) (1)

Therefore, the adjusted version of the Tukey’s Method, which provides rigorous and high certainty practice, was preferred to apply for outlier determination by suggesting the use of high-magnitude constant (K=1000). The European Commission Joint Research Center (2015) reported that the use of K=1000 enables datasets to protect variation knowledge that is part of the data population by ensuring that a maximum of 31 % of data population is discarded as an outlier. Moreover, the impact of elevated data on the dataset was decreased further by using low percentile analysis for the determination of background concentration. Hence, the compensative approach was followed by implementing the only removal of extremely elevated outliers before conservative low percentile analysis. Within the scope of this principle, in the outlier test, it was aimed to obtain high distance upper limit, which is desirable for widespread datasets in order to implement sound practice while eliminating outliers.

Figure 2.1. Visual Representation of Outlier Detection Methodology by Adjusted Tukey’s Method (European Commission Joint Research Center, 2015)