• Sonuç bulunamadı

Comparing the methods of measuring multi-rater agreement on an ordinal rating scale: a simulation study with an application to real data

N/A
N/A
Protected

Academic year: 2021

Share "Comparing the methods of measuring multi-rater agreement on an ordinal rating scale: a simulation study with an application to real data"

Copied!
15
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=cjas20

ISSN: 0266-4763 (Print) 1360-0532 (Online) Journal homepage: https://www.tandfonline.com/loi/cjas20

Comparing the methods of measuring multi-rater

agreement on an ordinal rating scale: a simulation

study with an application to real data

Y. Sertdemir , H. R. Burgut , Z. N. Alparslan , I. Unal & S. Gunasti

To cite this article: Y. Sertdemir , H. R. Burgut , Z. N. Alparslan , I. Unal & S. Gunasti (2013) Comparing the methods of measuring multi-rater agreement on an ordinal rating scale: a simulation study with an application to real data, Journal of Applied Statistics, 40:7, 1506-1519, DOI:

10.1080/02664763.2013.788617

To link to this article: https://doi.org/10.1080/02664763.2013.788617

Published online: 14 Apr 2013.

Submit your article to this journal

Article views: 368

View related articles

(2)

Vol. 40, No. 7, 1506–1519, http://dx.doi.org/10.1080/02664763.2013.788617

Comparing the methods of measuring

multi-rater agreement on an ordinal rating

scale: a simulation study with an

application to real data

Y. Sertdemir

a∗

, H.R. Burgut

a

, Z.N. Alparslan

a

, I. Unal

a

and S. Gunasti

b

aDepartment of Biostatistics and Medical Informatics, Cukurova University School of Medicine, Adana,

Turkey;bDepartment of Dermatology, Cukurova University School of Medicine, Adana, Turkey

(Received 24 July 2012; accepted 19 March 2013)

Agreement among raters is an important issue in medicine, as well as in education and psychology. The agreement among two raters on a nominal or ordinal rating scale has been investigated in many articles. The multi-rater case with normally distributed ratings has also been explored at length. However, there is a lack of research on multiple raters using an ordinal rating scale. In this simulation study, several methods were compared with analyze rater agreement. The special case that was focused on was the multi-rater case using a bounded ordinal rating scale. The proposed methods for agreement were compared within different settings. Three main ordinal data simulation settings were used (normal, skewed and shifted data). In addition, the proposed methods were applied to a real data set from dermatology. The simulation results showed that the Kendall’s W and mean gamma highly overestimated the agreement in data sets with shifts in data. ICC4for bounded data should be avoided in agreement studies with rating scales <5, where this

method highly overestimated the simulated agreement. The difference in bias for all methods under study, except the mean gamma and Kendall’s W , decreased as the rating scale increased. The bias of ICC3was

consistent and small for nearly all simulation settings except the low agreement setting in the shifted data set. Researchers should be careful in selecting agreement methods, especially if shifts in ratings between raters exist and may apply more than one method before any conclusions are made.

Keywords: agreement; multi-rater; bounded ordinal scale; normal distribution; skewed distribution

1. Introduction

Agreement among raters is a well-known issue in medical statistics; for instance, agree-ment between two radiologists evaluating radiographic findings on a nominal scale (i.e. positive/negative). The first method used for this purpose was the kappa proposed by Cohen [4].

Corresponding author. Email: yasarser@cu.edu.tr

(3)

Variants of the kappa have been proposed by Scott [19] and Maxwell and Pilliner [18]. Some of these authors used the kappa statistic as an intraclass coefficient. Cohen [4] and Spitzer [25] generalized the kappa for inter-rater agreement to the case in which the relative importance of each possible disagreement could be quantified. Fleiss and Cohen [10] proposed squared weights based on the difference of ordinal categories; they showed that the weighted kappa is identical to the intraclass correlation coefficient (ICC), aside from a term involving the factor 1/n. The gamma coefficient, which measures association, can be used as a measure of inter-rater agreement for ordinal scale ratings. It should be noted that shifts in ratings may result in strong association without strong agreement. Therefore, measuring agreement and measuring association may not be the same. Detailed discussions of the gamma can be found in Goodman and Kruskal [11], Siegel [23], and Siegel and Castellan [24]. Krippendorff’s α is another measure of agreement which has the advantage that it can be applied to nominal, ordinal or continuous ratings, and to incomplete data sets where some raters do not rate all items or subjects [15].

Intraclass correlation is commonly used to quantify the degree to which individuals with a fixed degree of relatedness resemble each other in terms of a given quantitative trait; it is also used to assess the consistency or reproducibility of quantitative measurements by different observers measuring the same quantity. Beginning with Fisher [8], intraclass correlation has been regarded within the framework of analysis of variance (ANOVA) and, more recently, in the framework of random effect models. Fleiss [9] provided an overview of the ICC as a measure of agreement for two or more methods or raters.

De Mast and Van Wieringen [6] proposed two modified ICC approaches (bounded and unbounded) to evaluate the agreement in multi-rater ordinal data, and compared these approaches with previously proposed methods for agreement in ordinal ratings using one set of artificial data and one real data set. However, no statements were made as to advantages and/or disadvantages of the methods under comparison. Therefore, questions still remain regarding which approach to employ as a measure of agreement with more than two raters on a bounded or unbounded equidistant or close to equidistant ordinal scale.

In this study, we aimed to compare the bias of the methods under study using simulated ordinal data under different settings, and to observe under which simulated settings they perform better or worse. Another aim was to analyze rater agreement for a real data set. The data used were from a study conducted in a dermatology department in which several raters rated skin lesions on an ordinal scale from 0 to 6 and compared the agreement among raters before training and after training [12].

2. Methods for the evaluation of agreement

Besides the standard methods already proposed for the evaluation of agreement among multiple raters (i.e. ICC, Kendall’s W , Krippendorff’s α, ICC bounded (ICC3) and ICC unbounded (ICC4)),

we applied other approaches such as the weighted (squared) kappa and the gamma coefficient of association to all possible pair combinations of raters and then used the mean value of these com-binations to evaluate reliability/agreement, keeping in mind that agreement implies correlation but not vice versa.

2.1 The gamma coefficient of association

The gamma measures the strength of association within cross-tabulated data when both variables are measured on an ordinal scale [20]. The value of a gamma test statistic, G, depends on two quantities: Nsis the number of pairs of cases ranked in the same order on both variables (i.e. the

(4)

both variables (i.e. the number of discordant pairs)

G=Ns− Nd Ns+ Nd

. (1)

The test statistic can be understood as estimating the theoretical quantity γ

γ = Ps− Pd Ps+ Pd

, (2)

where Psand Pdare the probabilities that a randomly selected pair of observations will place in

the same or opposite order, respectively, when ranked by both variables.

Critical values for the gamma test statistic are sometimes found by approximation, whereby a transformed value z of the test statistic has a normal or Student’s t distribution

z=



Ns+ Nd

N(1− G2), (3)

N denotes the number of observations rather than the number of pairs [20].

2.2 The weighted kappa (kappa W)

Cohen [4] proposed the kappa measure, which is designed for nominal classifications. When categories are ordered, the degree of disagreement depends on the difference between the ratings. Even under nominal classification, some disagreements may be considered more substantial than others. The measure-weighted kappa [5] uses weights{wij} to describe the closeness of agreement.

For weights satisfying 0≤ wij≤ 1, with all wji= wij and wii= 1, the weighted agreement is

 

wijπij, and the weighted kappa is

Kw=   wijπij−   wijπi+π+j 1− wijπi+π+j . (4)

For squared weights{wij= 1 − (i − j)2/(I− 1)2}, where i and j are the ratings given by two

raters and I is the number of categories as suggested by Fleiss and Cohen [10], agreement is greater for cells nearer the main diagonal [2,10,25].

2.3 Kendall’s W

Kendall’s coefficient of concordance is a non-parametric statistic. It is a normalization of the statis-tic of the Friedman test, and can be used to assess agreement among multiple raters. Kendall’s W ranges from 0 (no agreement) to 1 (complete agreement). There is a close relationship between Friedman’s two-way ANOVA without replication by ranks and Kendall’s coefficient of concor-dance. They can be used to address the hypotheses concerning the same data table, and they use the same χ2statistic for testing. They differ only in the formulation of their respective null hypothesis [13]:

Friedman’s H0: The n objects (sites) are drawn from the same statistical population. Kendall’s H0: The k judges (species) produced independent rankings of the objects.

(5)

S is computed first from the row-marginal sums of ranks Ri=

k

j=1ri,j, where object i is given

the rank ri,jby rater number j, and the mean of these total ranks is ¯R= (1/2)k(n + 1) [12,15,24]

S =

n



i=1

(Ri− ¯R)2. (5)

Note that S is a sum-of-squares statistic over the row sums of ranks Ri. Kendall’s W statistic can

be obtained from the following formula:

W = 12S

k2(n3− n) − kT, (6)

where n is the number of objects, k the number of raters and T a correction factor for tied ranks [13,23,24,26].

2.4 Krippendorff’s α

Krippendorff’s α is a reliability coefficient developed to measure agreement between observers, coders, judges, raters, or measuring instruments [15]. The general form of α is as follows:

α= 1 − Do De

, (7)

here Dois the observed disagreement and Dedenotes the disagreement that would be expected

when the rating of units is attributable to chance rather than to the properties of these rated objects. If the observed disagreement Do= 0 and α = 1, the reliability will be perfect. When observers

agree by chance then Do= De and α= 0, which indicate the absence of reliability. α would

be equal to 0 if observers simply fabricated their data, as if by throwing dice [15]. For ordinal metric differences, values indicate ranks, and the differences between ranks depend on how many ranks they are apart from each other [15]. For more detailed description, the reader is referred to Krippendorff [15].

2.5 Intraclass correlation coefficient

The ICC measures the correlation between different measurements Xijand Xikof a single object i.

The ICC is the ratio of object variation to the total observed variation or, equivalently, the correla-tion among multiple measurements of the same object. Reliability or agreement is often expressed in the form of an ICC [16,21,24]. In the past, intraclass correlation had been addressed within the framework of ANOVA; but recently, it has been considered in the framework of random effect models. Several ICC estimators have been proposed. Most of these can be defined in terms of the random effect models, where the observations Xijare generally assumed to follow the model

Xij= Zi+ εij, where Zi∼ N(μp, σp2)is the true value of object i and εij∼ N(0, σe2)is the random

measurement error [16,21]. According to this model, the distribution of the measurement error is symmetric around and independent of the object’s true value.

The population of ICC in this framework is as follows: ICC= Cov(Xij, Xik) Var(Xij)Var(Xik) = σ 2 p σ2 p + σe2 , (8)

where Xijand Xikare the ratings of rater j and rater k for object i. The ICC can only assume values

in the interval [0,1]. A value of 1 corresponds to perfect reliability, and a value of 0 indicates a measurement system that is no more consistent than would be expected by chance [16,21].

(6)

One-way ANOVA gives estimates for the variance components. Using MSwand MSbto denote

the within and between group mean squares, respectively, a biased but consistent estimator of ICC is as follows [16,21]:

ICC1=

MSb− MSw

MSb+ (k − 1)MSw

. (9)

This estimate is only acceptable if the objects i= 1, . . . , n are sampled randomly from the popu-lation. If this is not the case, σ2

p should be estimated from a historical sample. In practice, it should

be easier to estimate σ2

total= σp2+ σe2because it is generally not possible to obtain measurements

without measurement spread [6,21,22].

In the case where a random sample of k raters is selected from a larger population, and each rater rates n objects together [21], the estimator of ICC is

ICC2=

MSb− MSw

MSb+ (k − 1)MSw+ (k(MSb− MSw)/n)

. (10)

Several different ICC statistics have been proposed, but not all of which estimate the same popula-tion parameter. There has been considerable debate regarding which ICC statistics are appropriate for a given use, as they may produce markedly different results for the same data [14].

Shrout and Fleiss [22] have discussed ICC statistics in detail and listed them by measurement type (i.e. single or average) as well as whether they are one-way or two-way and random or mixed. ICC or similar methods assume that errors follow a normal distribution. In case of ordinal data, assuming that the true value of the ordinal ratings underlies a normal distribution allows one to apply ICC like methods.

2.6 Measurement system analysis for bounded ordinal data

The choice between the two main approaches (parametric/non-parametric) for application with bounded ordinal data relates to the distinction between the situation where one deals with a scale that is intrinsically bounded and ordinal, and the situation where one is in fact dealing with a continuous variable that is mapped by the measurement system onto a bounded ordinal scale. In the first situation one cannot use methods based on standard deviations and correlations, because these methods assume a distance metric on the measurement scale. One has to resort to non-parametric methods. In the second situation, the ordinal scale can be equipped with a distance metric, which it inherits via the map (formed by the measurement system) from the underlying continuous scale. This enables the use of methods based on standard deviations and correlations. The underlying continuous scale need not be known and the object’s true value is treated as a latent variable [6].

When trying to apply ICC or ICC like methods to ordinal data, we come across two serious problems, which relate to

(1) a distance metric for the measurement scale and (2) distributional properties of the measurement error.

Problem 1: Ordinal scales only have an order defined, and not a distance metric. The ICC

method, however, makes use of standard deviations and correlations, which are only defined for measurement scales for which there is a well-defined distance metric. Not until the ordinal scale is extended with a metric can we apply ICC-type methods. In effect, this extension transforms an ordinal scale into a discrete scale.

Problem 2: The standard ICC method assumes that (a) the measurement error is symmetrically

(7)

true value is. Both of these assumptions are natural in the study of measurement error and we wish to introduce similar assumptions for the bounded ordinal case. Neither assumption can, however, be retained for bounded scales in a straightforward form: the measurement error for objects close to a bound will be skewed away from the bound. In order to apply the ICC methods for use with bounded ordinal data, it is unavoidable to make bold assumptions on both issues. It appears possible to derive both a distance metric and a distribution for the measurement error if one is prepared to assume that underlying the measurements there is a continuous variable (the ‘true’ value of the object). If one is not willing to assume a continuous true value that underlies the measurements, one has to resort to non-parametric methods [6].

2.6.1 ICC for unbounded ordinal data (ICC3)

Repeated measurements Xi1, Xi2, . . . , Xik of objects i= 1, 2, . . . , n are used to estimate the

ICC/agreement. Following standard ICC methodology, the ICC can be estimated based on a ratio of mean squares. However, for discrete data, mean squares have a bias. After correcting for this bias, the estimate is as follows [6]:

ICC3=

MSb− MSw+ (k − 1)/12k

MSb+ (k − 1)MSw− (k2− k + 1)/12k

. (11)

2.6.2 ICC for bounded ordinal data (ICC4)

The ICC for bounded ordinal data as proposed by De Mast and Van Wieringen [6] is derived as follows. It is assumed that D is a finite set where categories are labeled 1, 2, . . . , a. R is denoted as the domain of the true value Z and used to define the map LRD: R→ D

LRD(Z)=  a exp(Z) 1+ exp(Z)  . (12)

The reverse for k∈ D is LDR(k) = log((k − 0.5)/(a − k + 0.5)) = Z.

LRD is similar to the logistic transformation used in logistic regression. Zi∼ N(μp, σp2)is used

for Z. The measurement error as

P(X= k | Z) = pk(Z)=

 LRD(k+0.5)

LRD(k−0.5) fμ=Z;σedt. (13)

The measurement system’s reliability is defined as in Equation (8). Due to the nonlinearity of LDR, mean squares yield heavily biased estimators for the variances in Equation (8). To derive suitable estimators, De Mast and Van Wieringen [6] considered the statistics Nik= (#Xij, j= 1, . . . , m:

Xij= k), for i = 1, . . . , n and k ∈ D. They regard the true values Zias fixed and given that for a

single product i, the tuple (Ni1, . . . , Nia) has a multinomial distribution which allows computation

of the log-likelihood L L= n  i=1 log P(Ni1= ni1, . . . , Nia= nia) = n  i=1 log m! ni1! · · · nia! + n  i=1 a  k=1 niklog(φ(A(+)) − φ(A(−))). (14)

Note that φ denotes the cumulative standard normal distribution function. A(+) and A(−) are defined as A(+) = LDR(k+ 0.5) − Zi σe and A(−) = LDR(k− 0.5) − Zi σe .

(8)

Estimates for Z1, . . . , Znand σe2are derived from

ˆZ1, . . . , ˆZn, σe,ml2 = arg max n  i=1 a  k=1 niklog(φ(A(+)) − φ(A(−))). (15)

σe2 = (m/(m − 1))σe,ml2 is used to correct for the bias that maximum likelihood estimators generally entail.

σ2

p is estimated using formula (16)

ˆσ2 p = 1 n− 1 n  i=1 ˆZi− 1 n n  i=1 ˆZi 2 −σe2 m. (16)

The sample ICC is given by formula (17) ICC4= ˆσ2 p ˆσ2 p + ˆσe2 . (17) 3. Simulation study

In this section, a simulation study comparing methods of measuring multi-rater agreement on an ordinal rating scale was provided. In this simulation study, three main simulation settings, the normal distribution setting, the skewed distribution setting and the simulation setting with shift’s in ratings of two raters were constructed. The notations used were i, objects; j, rating scales (number of rating categories); k, number of raters. The performances were evaluated according to different number of raters (k= 3, 5, 7), rating scale (j = 3, 5, 7), the sample size (n = 30, 60, 150) and the agreement (.40, low; .60, medium; .80, high) within each of the three main simulation settings. The sample sizes 30, 60 and 150 were the smallest sample sizes needed for three or more raters for agreement studies with agreement values of .80, .60 and .40, respectively [3].

Data were simulated using formula (18)

Yij= Zi+ εij, (18)

where Zidenotes the true value of object i and the distribution of Ziwas allowed to change between

a normal and a χ2 distribution, whereas the stochastic measurement error ε

ij∼ N(0, σe2)was

simulated from a normal distribution for all simulation settings. The simulation parameters were

μ= 0, σp2 = 1 and σe2= (0.25, 0.6667, 1.5), which result in simulated agreement (SA) values of

.8, .6 and .4, respectively. For the skewed data simulation, Ziwere simulated from a χ2distribution

with three degrees of freedom (χ2(3)) and εij∼ N(0, σe2). In the shifted data simulation setting,

Yijwere calculated by applying Equation (18) to Ziand εijsimulated from a normal distribution

but the values of rater 1 and rater 2 were shifted by adding 1 to their values. The last step for all three main simulation settings was the calculation of Xij, where Yijwere transformed to a bounded

ordinal scale using Equation (19), where RND returns the nearest integer.

RND LDR(a, Yij)= (a+ 0.5 exp(Yij))+ 0.5 1+ exp(Yij) = Xij. (19)

(9)

4. Results

We observed that the mean estimated agreement did not change significantly with sample size for any of the tests under study, but as expected their variance decreased with larger sample size. Since we are not interested in coverage probabilities, in the interest of saving space, we present the results for only one sample size, n= 60, which is reasonable for agreement studies in medical research. The results for the methods under study will be given for each rating scale within the normal, skewed and shifted simulating settings.

4.1 Normal distribution

Rating scale= 3: Comparable results were observed for mean kappa W, Krippendorff’s α and

ICC2. They underestimated the SA for SA= .40, .60, .80 and all values of k (raters). As expected,

the ICC2is not a good choice for such short scales. The mean gamma method slightly overestimated

the agreement for SA= .40. The bias of the mean gamma method increased with higher SA values but did not change with k. Kendall’s W over estimated the agreement for all k with SA= .40 and had the smallest bias for k= 5 and SA = .60. Kendall’s W underestimated the agreement for SA= .80. The ICC4overestimated the agreement for all k and SA values. The ICC3had the

smallest bias for all k and SA with j= 3 (Figure 1, first row).

(10)

Rating scale= 5: The difference in bias for the methods under study decreased. Comparable

bias was observed for mean kappa W , mean gamma, Krippendorff’s α and ICC2, except for

SA= .80, where the mean gamma method results in smaller bias than the other methods. Kendall’s

W still highly overestimated the agreement, though with smaller bias than for j= 3. This bias

still changed with j, k and SA. ICC3 yield the smallest bias for SA= .60 and .80, independent

of k. ICC4yield the smallest bias for SA= .4 but the magnitude of bias changed with SA and k

(Figure 1, second row).

Rating scale= 7: The difference in bias for the methods under study decreased except

for Kendall’s W and mean gamma. Comparable results were observed for mean kappa W , Krippendorff’s α, ICC3 and ICC4. The highest bias was observed for Kendall’s W (Figure 1,

last row).

4.2 Skewed distribution

Rating scale= 3: Comparable results were observed for mean kappa W, Krippendorff’s α and

ICC2. They underestimated the SA for SA= .40, .60, .80 and all number of raters (k). The lowest

bias was observed for the mean gamma method. Kendall’s W overestimated the agreement for SA= .40 and underestimated the agreement for SA = 80. ICC4overestimated the agreement for

all settings of SA and k. ICC3method slightly underestimated the agreement but the bias did not

change with k or SA (Figure 2, first row).

(11)

Rating scale= 5: Comparable underestimations were observed for mean kappa W, mean

gamma, Krippendorff’s α and ICC2. Slightly lower bias was observed for ICC3 compared with

mean kappa W , mean gamma, Krippendorff’s α and ICC2. The smallest bias was observed for ICC4

whereas Kendall’s W highly overestimated the agreement for SA= .40, .60 but highly underes-timated the agreement for SA= .8 depending on the number of raters (Figure 2, second row).

Rating scale= 7: The difference in bias for the methods under study decreased at overall

compared with rating scale= 5 except Kendall’s W and mean gamma. Comparable results were observed for mean kappa W , Krippendorff’s α, ICC3and ICC4. The highest bias was observed

for Kendall’s W , where the bias varied with SA and k (Figure 2, last row).

4.3 Shift in ratings

It needs to be kept in mind that the data manipulation performed to obtain shifted data will result in lower agreement values than simulated. The zero bias line is just a reference line for the simulated nominal agreement value. The true agreement value should always be below the simulated nominal agreement level. Even thought we do not know the true level of agreement after shifting the data, we are still able to compare the results across methods.

Rating scale= 3: Comparable results were observed for mean kappa W, Krippendorff’s α,

ICC2and ICC3, though ICC3 was slightly closer to the zero line. All methods were below the

zero line except the mean gamma, Kendall’s W and ICC4indicating that they overestimated the

agreement for these settings (Figure 3, first row).

Rating scale= 5: Comparable results were observed for all methods except the mean gamma

and Kendall’s W , which highly overestimated the agreement (Figure 3, second row).

Rating scale= 7: The difference between the methods under study decreased to the minimum,

except for Kendall’s W and mean gamma. Kendall’s W still highly overestimated the agreement but the mean gamma was slightly below the zero bias reference line (Figure 3, third row).

4.4 Case study: dermatology data set

Five professors, five residents, five nurses and 10 medical student interns from the department of dermatology at Cukurova University Hospital in Adana, Turkey were asked to evaluate pictures of skin lesions using ABCD (i.e. asymmetry, border irregularity, color variation and diameter greater than 6 mm) criteria for diagnosing malignant melanoma [1,17]. The criteria can take values from 0 to 6. Raters were asked to evaluate 54 out of 108 randomly chosen pictures before being trained regarding ABCD; the remaining 54 pictures were evaluated after this training [12]. A frequency polygon of ratings for each professor (before and after training) is given in Figure 4.

Rater 2 (R2) and rater 4 (R4) seem to rate similar but different from the remaining raters. Rater

1 (R1), rater 3 (R3) and rater 5 (R5) avoid to use the lowest rating value zero (Figure 4, before

training). This difference between raters seems to decrease after training and only R1and R3still

avoid using the lowest rating value zero (Figure 4, after training). This may cause shifts in ratings and need to be tested. The Friedman test for differences between raters was significant before training (p < .001) as well as after training (p < .001) which indicated significant shifts in ratings across raters.

The weighted kappa results showed that the agreement before training among R1, R3and R5

were relatively high, whereas the agreement among R2 and R4 with R1, R3 and R5 were from

moderate to low resulting a mean kappa W value of .506. The kappa results between R1, R3

and R5did not change substantially after training but changed for R2and R4with R1, R3and R5

resulting a higher mean kappa W value of .682 (Table 1). The gamma results showed similarly but higher association between R1, R3and R5and lower association between R2and R4with R1, R3and R5(Table 1).

(12)

Figure 3. Agreement bias for simulation with shift in ratings.

Figure 4. Frequency polygon of ratings before and after training for each rater.

Bootstrap estimates for the mean of all two rater combinations of weighted kappa and mean gamma values before training were .451± .103 and .600 ± .072, respectively, and .668 ± .059 for Kendall’s W , .420± .079 for Krippendorff’s α, .463 ± .083 for ICC3and .479± .082 for ICC4.

After training, the bootstrap results for the mean weighted kappa and mean gamma were .659± .05 and .774± .039, respectively, and .800 ± .032 for Kendall’s W, .665 ± .046 for Krippendorff’s

(13)

Table 1. Values for kappa with squared weights (kappa W ) and gamma coefficient of association (gamma) before and after training for all rater combinations.

Before training After training

Rater Kappa W Gamma Kappa W Gamma

1,2 .391 .658 .675 .799 1,3 .760 .839 .682 .849 1,4 .280 .315 .649 .697 1,5 .791 .845 .741 .813 2,3 .491 .639 .497 .773 2,4 .653 .754 .713 .774 2,5 .374 .644 .683 .776 3,4 .412 .408 .585 .678 3,5 .673 .766 .756 .832 4,5 .240 .264 .842 .818 Mean .506 .613 .682 .781

Figure 5. Bootstrap results for dermatology data set before and after training (250 replications).

5. Discussion

5.1 Simulation

The simulation results showed that the tests applied to evaluate agreement should be carried out carefully depending on the rating scale and type of distribution of ratings. Kendall’s W should be used carefully; since this test relies on ranks, it is obvious that it tends to overestimate the agreement increasingly with the rating scale when a continuous true value underlies the

measure-ment and shifts in data exist. The biases observed for the mean kappa W and Krippendorff’s α

were close to each other. These two methods tend to underestimate the agreement for almost all settings.

Even though the gamma coefficient of association was not proposed for use in agreement studies, the mean gamma method performed better than the other methods under study for simulations under normal and skewed distribution with lower rating scales (Figures 1 and 2). The mean gamma method should be one of the first choices for lower rating scales with skewed ratings (Figure 2).

(14)

ICC2was comparable to the mean kappa W and Krippendorff’s α, indicating that this method

tends to underestimate the agreement in all settings with normal and skewed distributed ratings, especially in the case of a three point scale where it should be avoided.

The ICC4method relies on the evaluation of σe2using the ziwith a maximization procedure.

The three point scale might be too short to carry sufficient information for the accurate evaluation of σ2

e and σp2. Similarly, applying ICC2to three point scaled data is not recommended. Thus, ICC4

should be avoided in agreement studies with rating scales <5 where it tends to highly overestimate the agreement. ICC4should be one of the first choices in studies with rating scales of five or higher

(Figures 1–3) but should be avoided if the sample size is large, for instance, higher than 100 with a high number of raters and bootstrapping is under consideration. In such cases, it could take more than 24 h computer time for an I7-2600 CPU with 3.4 GHz.

ICC3 seems to be the first method to apply in agreement studies with ordinal data and larger

sample sizes and rating scales <5. The bias observed for this method did not change for lower or higher agreement settings, which is important in cases where the researcher does not have any information on the expected agreement.

5.2 Case study: dermatology dataset

The simulation study showed that the Kendall’s W and mean gamma tend to overestimate the agreement for agreement settings with shifts in ratings. Figure 4 and the Friedman test showed that there were shifts in the ratings before as well as after training. After training, the agreement among raters increased for all applied methods. This increase might be related to R5who starts

using the lowest rating value zero after training, resulting in higher agreement of R5with R2and R4.

Table 1 might be useful in deciding which raters agree better or who benefited from training. The bootstrap results given in Figure 5 indicated that comparable results were observed for the mean kappa W , Krippendorff’s α, the ICC3 and the ICC4 methods but relatively high estimates were

observed for Kendall’s W and mean gamma which was similar to findings from our simulation study with a seven point rating scale and shifts in ratings (Figure 5). This shift in ratings should be taken into account and tests using ranks should be avoided.

The best-performing methods in this simulation study were ICC3 followed by ICC4 for the

normal distribution case and the reverse for the skewed distribution case. It should be kept in mind that ICC4is difficult to implement and greatly overestimates the true agreement for rating

scales of <5 points. Tables including agreement on all possible two rater combinations might be useful in detecting raters who rate different from the remaining group. Our simulation study showed that if the rating scale was five or larger all methods were comparable, except Kendall’s W . Hence, researchers conducting multi-rater agreement studies should check their data for skewness and any significant shifts in ratings and the nature (i.e. strictly ordinal or underlying a continuous scale) of their ordinal data and the agreement among all possible rater combinations.

Our study does not cover all available methods for evaluating agreement on ordinal scale. There are other methods that could be used to evaluate agreement in the multi-rater case on an ordinal rating scale, such as latent trait models and Bayesian approaches. Latent trait models require a large number of subjects and ratings on a relatively short scale because of sparseness problems. De Mast and Van Wieringen [7] proposed an approach where they used item response theory methods to model and evaluate the repeatability and reproducibility of ordinal classifications. Although this method has some interesting features which need attention in further studies, the aim of this approach is not directly comparable with this study.

Acknowledgement

(15)

References

[1] N.R. Abbasi, H.M. Shaw, D.S. Rigel, R.J. Friedman, W.H. McCarthy, I. Osman, A.W. Kopf, and D. Polsky, Early

diagnosis of cutaneous melanoma: revisiting the ABCD criteria, Clin. Exp. Dermatol. 30 (2005), p. 707.

[2] A. Agresti, Categorical Data Analysis, Wiley, New York, 1990, p. 367.

[3] D.G. Bonett, Sample size requirements for estimating intraclass correlations with desired precision, Stat. Med. 21 (2002), pp. 1331–1335.

[4] J.A. Cohen, Coefficient of agreement for nominal scales, Educ. Psychol. Meas. 20 (1960), pp. 37–46.

[5] J.A. Cohen, Weighted kappa: nominal scale agreement with provisions for scaled disagreement or partial credit, Pol. Psychol. Bull. 70 (1968), pp. 213–220.

[6] J. De Mast and W. Van Wieringen, Measurement system analysis for bounded ordinal data, Qual. Reliab. Eng. Int. 20 (2004), pp. 383–395.

[7] J. De Mast and W. Van Wieringen, Modeling and evaluating repeatability and reproducibility of ordinal

classifications, Technometrics 52 (2010), pp. 94–106.

[8] R.A. Fisher, Statistical Methods for Research Workers, 12th ed., Oliver and Boyd, Edinburgh, 1954. [9] J.L. Fleiss, The Design and Analysis of Clinical Experiments, John Wiley & Sons, New York, 1986.

[10] J.L. Fleiss and J. Cohen, The equivalence of weighted kappa and the intraclass correlation coefficient as measures

of reliability, Educ. Psychol. Meas. 33 (1973), pp. 613–619.

[11] L.A. Goodman and W.H. Kruskal, Measures for association for cross-classification, I, II, III and IV, J. Am. Stat. Assoc. 49 (1954), pp. 732–764; 54, pp. 123–163; 58, pp. 310–364; and 67, pp. 415–421, respectively.

[12] S. Gunasti, M.K. Mulayim, B. Fettahlioglu, A. Yucel, R. Burgut, Y. Sertdemir, and V.L. Aksungur, Interrater

agreement in rating of pigmented skin lesions for border irregularity, Melanoma Res. 18 (2008), pp. 284–288.

[13] M.G. Kendall and S.B. Babington, The problem of m rankings, Ann. Math. Stat. 10 (1939), pp. 275–287. [14] O. Kenneth Mcgraw and S.P. Wong, Forming inferences about some intraclass correlation coefficients, Psychol.

Methods 1 (1996), pp. 30–46.

[15] K. Krippendorf, Content Analysis: An Introduction to Its Methodology, Sage, Beverly Hills, CA, 1980, pp. 129–154. [16] F.M. Lord and M.R. Novick, Statistical Theories of Mental Test Scores, Addison-Wesley, London, 1968. [17] R.M. MacKie, Clinical recognition of early invasive malignant melanoma, BMJ 301 (1990), pp. 1005–1006. [18] E. Maxwell and A.E.G. Pilliner, Deriving coefficients of reliability and agreement for Ratings, Br. J. Math. Stat.

Psychol. 21 (1968), pp. 105–116.

[19] W. Scott, Reliability of content analysis: the case of nominal scale coding, Public Opin. Q. 17 (1955), pp. 321–325. [20] D.J. Sheskin, The Handbook of Parametric and Nonparametric Statistical Procedures, Chapman & Hall/CRC, Boca

Raton, FL, 2007.

[21] P.E. Shrout, Measurement reliability and agreement in psychiatry, Stat. Methods Med. Res. 7 (1998), pp. 301–317. [22] P.E. Shrout and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability, Pol. Psychol. Bull. 86 (1979),

pp. 420–427.

[23] S. Siegel, Nonparametric Statistics for the Behavioral Sciences, McGraw-Hill International, Tokyo, 1956. [24] S. Siegel and N.J. Jr. Castellan, Nonparametric Statistics for the Behavioral Sciences, 2nd ed., McGraw-Hill, New

York, 1988.

[25] R. Spitzer, J. Cohen, J.L. Fleiss, and J. Endicott, Quantification of agreement in psychiatric diagnosis, Arch. Gen. Psychiatry 17 (1967), pp. 83–87.

Şekil

Figure 1. Agreement bias for simulation under normal distribution.
Figure 2. Agreement bias for simulation under skewed distribution.
Figure 4. Frequency polygon of ratings before and after training for each rater.
Figure 5. Bootstrap results for dermatology data set before and after training (250 replications).

Referanslar

Benzer Belgeler

4) Distinctiveness: In order to measure the distinctiveness, we calculate the percentage of Hamming Distance values between all iriscodes. The optimum value for this percentage is

Sensitivity of episode duration detection was greater than 90% for calculation applying the CM and the CDM to the signals disturbed with Gaussian noise (for all

As far as the method and procedure of the present study is concerned, the present investigator conducted a critical, interpretative and evaluative scanning of the select original

Sabahattin Ali, geleneksel öyküleme bi­ çimlerini ustaca kullanmanın yanı sıra ge­ tirdiği gerçekliğin yeni oluşu ve şaşırtıcı öl­ çüde yalın Türkçesiyle tarihsel

This study focuses on the performances of the two commonly used seasonal adjustment methods, X-12 ARIMA and TRAMO/SEATS, on Turkish monetary aggregates and some critical issues

tedavi i çin multiplligasyon teknikleri uygulandı (Tablo IV,V). Kurşunla yaralanma olgusunda fibula kınğı ile birlikte gelişen arteriovenöz fistül olgusunda

Exploring further the depression scale score correlations, we found that living with parents, perceiv ed health status, deviant behavior, commuting to school convinently,

[r]