• Sonuç bulunamadı

Verification bias on sensitivity and specificity measurements in diagnostic medicine: a comparison of some approaches used for correction

N/A
N/A
Protected

Academic year: 2021

Share "Verification bias on sensitivity and specificity measurements in diagnostic medicine: a comparison of some approaches used for correction"

Copied!
16
0
0

Yükleniyor.... (view fulltext now)

Tam metin

(1)

On: 22 April 2014, At: 06:41 Publisher: Taylor & Francis

Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Applied Statistics

Publication details, including instructions for authors and subscription information:

http://www.tandfonline.com/loi/cjas20

Verification bias on sensitivity and

specificity measurements in diagnostic

medicine: a comparison of some

approaches used for correction

İlker Ünalab

& H. Refik Burguta a

Department of Biostatistics, School of Medicine, Çukurova University, Balcali 01330, Saricam, Adana, Turkey

b Department of Biostatistics, Faculty of Medicine, İzmir University, Gursel Aksel Bulv. No: 14, Uckuyular 35350, İzmir, Turkey

Published online: 25 Nov 2013.

To cite this article: İlker Ünal & H. Refik Burgut (2014) Verification bias on sensitivity and specificity measurements in diagnostic medicine: a comparison of some approaches used for correction, Journal of Applied Statistics, 41:5, 1091-1104, DOI: 10.1080/02664763.2013.862217 To link to this article: http://dx.doi.org/10.1080/02664763.2013.862217

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

(2)
(3)

Vol. 41, No. 5, 1091–1104, http://dx.doi.org/10.1080/02664763.2013.862217

Verification bias on sensitivity and specificity

measurements in diagnostic medicine:

a comparison of some approaches used

for correction

˙Ilker Ünal

a,b∗

and H. Refik Burgut

a

aDepartment of Biostatistics, School of Medicine, Çukurova University, Balcali 01330, Saricam, Adana, Turkey;bDepartment of Biostatistics, Faculty of Medicine, ˙Izmir University, Gursel Aksel Bulv. No: 14,

Uckuyular 35350, ˙Izmir, Turkey

(Received 3 October 2012; accepted 31 October 2013)

Verification bias may occur when the test results of not all subjects are verified by using a gold standard. The correction for this bias can be made using different approaches depending on whether missing gold standard test results are random or not. Some of these approaches with binary test and gold standard results include the correction method by Begg and Greenes, lower and upper limits for diagnostic measurements by Zhou, logistic regression method, multiple imputation method, and neural networks. In this study, all these approaches are compared by employing a real and simulated data under different conditions.

Keywords: verification bias; Begg and Greenes correction; multiple imputation; MAR; NMAR

1. Introduction

The diagnosis of a disease depends on many factors such as medical history, physical examination, existence of risk factors, and most significantly, result of diagnostic test(s). Using this information, together with his experience the doctor makes a decision about the status of his patient (a disease either present or absent). This final clinical decision is generally based on the result of a perfect or an imperfect gold standard tests.

In many clinical studies, the gold standard test, which verifies the true disease status, might be so expensive and/or invasive that many patients are reluctant for the process. Hence, new tests which are cost effective and easily applied need to be developed and their accuracy be estimated. Diagnostic accuracy of a test is most commonly measured by its sensitivity and specificity [24] and can be obtained from the information about patients’ status which is generally the result of a gold standard test. However in some studies, some patients undergoing a new test may not have

Corresponding author. Email:ilkerunal@yahoo.com

© 2013 Taylor & Francis

(4)

their status verified by a gold standard. Usually, the patients without their status verified represent not a random sample but rather a selected group [26]. For example, if a gold standard test is based on invasive surgery, the patients with negative test results would be less likely to receive a gold standard evaluation than the patients with positive test results. When this occurs in the studies designed to evaluate the accuracy of new diagnostic tests, the sensitivity would often be higher and the specificity would often be lower than the true values, and in such a case the bias is called a verification bias.

The verification bias can be corrected by using different approaches. These approaches may differ whether patients, whose statuses are not verified by a gold standard test, are randomly selected or not. For the randomly selected patients with binary test and gold standard test results, an approach proposed by Begg and Greenes [1] and the so called multiple imputation (MI) for correction by Harel and Zhou [8] can be applied. However, the approach by Harel and Zhou [8] was challenged by Hanley et al. [7] and de Groot et al. [3,4]. For the non-randomly selected patients, Zhou proposed a correction for the verification bias [25]. Regardless of patient selection mechanism, Kosinski and Barnhart [9] proposed the logistic regression (LR) models based on likelihood-approach for the estimation of the sensitivity and the specificity of a diagnostic test. Martinez et al. [10], Buzoianu and Kadane [2] and also Pennello [11] proposed a Bayesian approach to adjust the verification bias in a diagnostic test evaluation.

Many approaches for correcting the verification bias have been compared with each other in order to specify their advantages and/or disadvantages. Harel and Zhou [8] compared five MI methods with Begg and Greenes (BG) correction method and concluded that BG correction method underestimates the sensitivity, whereas overestimates the specificity, cared with that of the MI methods. However, Hanley et al. [7] commented on conclusions reached by Harel and Zhou that there were considerable doubts about the strong conclusions reached on the superiority of the MI methods in using for this type of applications. Afterwards, De Groot et al. [3] showed that BG method leads to similar results as that of MI, but they still recommend that additional researches are needed to better understand which correction methods should be preferred in various missing scenarios on gold standard test in diagnostic medicine. The aim of this study is to provide answer to the need for better approaches to the problem of verification bias. The correction methods proposed recently have been mainly focused on treating the verification bias problem as a missing data problem. In this study, the well-known method of neural networks (NNs) is used and is compared with the other known methods for correcting the verification bias. The comparisons were made by using BG’s correction, lower and upper limits for diagnostic measurements by Zhou, and some imputation techniques such as the MI, LR, and NNs approaches. Some simulated data, which were obtained under a variety of experimental conditions, such as changing missing mechanism, sample size, varying coefficient of determination, and changing ratio of missing values, and a real data from the field of nuclear medicine were both employed in comparison of approaches under consideration.

As a motivating example, a data set from the field of nuclear medicine imaging was used. That data, from a study of Sukan et al. [21], were used in the diagnosis of primary and sec-ondary hyperparathyroidism. The primary hyperparathyroidism (pHPT) is a generalized disorder of calcium (Ca), phosphate (P), and bone metabolism caused by an increased secretion of parathyroid hormone. The secondary hyperparathyroidism (sHPT) is usually seen with renal failure owing to various causes, such as osteomalacia, familial hypocalciuric hypocalcaemia, and lithium therapy [21]. These two diseases can be diagnosed by using the nuclear medicine imaging methods such as ultrasound (US) or 99mTc methoxyisobutylnitrile (MIBI) parathy-roid scintigraphy. In that study, the operative procedure in combination with the results of the histopathological evaluation was considered as the gold standard. Being the gold standard an invasive test as not all patients had gold standard test results ends up with verification bias in accuracy evaluation.

(5)

The framework of this paper is as follows: the following section highlights methods for correct-ing the verification bias with respect to assumptions whether the verification process is misscorrect-ing at random (MAR) or not MAR, and describes the imputation methods which have been proposed under the condition that significant covariates exist. The results based on some simulated and real data are given in Sections3 and4, respectively. Section5contains some concluding remarks. Section6 highlights the limitations of the study, and finally, the appendix describes the data simulation.

2. Considered methods

2.1 Correction for a single binary scale test

To handle verification bias problem, those cases that do not receive the gold standard test can be regarded as missing (disease status). The ‘missingness’ is characterized by the conditional distribution of the missing data given complete data and/or some unknown parameters. If the missingness (in our study, selecting a patient for disease verification) does not depend on the values of the observed and unobserved data, the data are called missing completely at random (MCAR). If the mechanism depends on the observed data but not on the unobserved data, then the data are called MAR. Finally, if the mechanism depends only on the unobserved data, the data are called not missing at random (NMAR) [14].

The bias-correction methods for estimating the sensitivity and specificity can be given with and without the MAR assumption. When the mechanism is MCAR, the ‘complete case’ sensitivity and specificity are unbiased. If the mechanism is MAR, that is the selection for verification depends on the test result or on other measured covariates, the MAR corrected sensitivity and specificity proposed by Begg and Greenes [1] and MI may be used. However, if the selection for verification depends only on the unobserved data (i.e. missing disease status) some imputation methods which include LR approach have been proposed in the literature. The details are given in the following sections.

2.2 Correction methods with the MAR assumption

In a study, if all patients undergo a new test but some receive gold standard test, observed data will be as in Table1.

Using the data in Table1, Begg and Greenes [1] proposed a bias-correction method under the condition that the verification process being MAR and gave the maximum likelihood (ML) estimator for the sensitivity and specificity, using the counts giving cells in Table1, as

ˆSe = m1s1/[N(s1+ r1)]

m0s0/[N(s0+ r0)] + m1s1/[N(s1+ r1)] ,

Table 1. Observed data for a single binary scale test. Result of new test

T= 1 T = 0

Result of gold standard test D= 1 s1 s0

D= 0 r1 r0

Not verified patients u1 u0

Total m1 m0

(6)

ˆSp = m0r0/[N(s0+ r0)]

m0r0/[N(s0+ r0)] + m1r1/[N(s1+ r1)] ,

and their variance estimators as

Vˆar(ˆSe) = [ˆSe(1 − ˆSe)]2

 N m0m1 + r1 s1(s1+ r1) + r0 s0(s0+ r0)  , Vˆar(ˆSp) = [ˆSp(1 − ˆSp)]2  N m0m1 + s1 s1(s1+ r1) + s0 s0(s0+ r0)  ,

where N is the total number of patients. 2.2.1 Multiple imputation

Imputation, the practice of ‘filling in’ missing data with plausible values, is an attractive approach for analyzing an incomplete data. It apparently solves the missing data problem at the beginning of the analysis. However, a naive or unprincipled imputation method may create more problems than it solves, distorting estimates, standard errors, and hypothesis tests as documented by Rubin [14] and others.

The MI is a Monte Carlo technique in which the missing data values are replaced by m simu-lated versions, where m is larger than one and typically small (e.g. m: 3-10). In Rubin’s method for ‘repeated imputation’ inference, each one of the simulated complete data sets is analyzed by standard methods, and the results are combined to produce estimates and confidence intervals that incorporate missing data uncertainty. Rubin [14] addresses the potential uses of MI primarily for large public-use data files from sample surveys and censuses. With the advent of new computa-tional methods and software for creating MI’s, however, the technique has become increasingly attractive for researchers in the biomedical, behavioral, and social sciences that the investigations in those areas are hindered by missing data.

The question of how to obtain valid inferences from an imputed data was addressed by Rubin’s book [14] on MI. The full mathematical definition of a proper MI is given by Rubin [14]: let Q be a scalar quantity of interest to be estimated and that if there are no nonresponse, the inference for

Q would be based on the statement that Q− ˆQ ∼ N(0, U), where ˆQ and U are statistics given the

estimate of Q and the variance of Q− ˆQ, respectively. Supposing that the data can be separated into X, all observed covariates, and Y= (Yobs, Ymis), observed and missing values, and using the imputed Ymistogether with the Yobs, ˆQ and U can be obtained. As m tends to infinity, for j= 1, . . . ,

m imputations, the large m averages will be E( ¯Q|X, Y) = ˆQ and E( ¯U|X, Y) = U, while the

between imputation variance will be E(B|X, Y) = Var( ¯Q|X, Y) for large m. Rubin [14] used Bayesian arguments, however it has been shown that the well-calibrated methods for the inference can also be obtained from frequentists standpoint [15–18]. Therefore, it is a good idea to use the Bayesian procedure for the imputation stage and frequentist procedures for the analysis stage. In this study, we have followed the schema proposed by Harel and Zhou [8]. In that schema, the authors used the MI technique without any covariates. We have additionally tested to measure the effect of the covariates on the correction for the verification bias when the MI was used. To accomplish this, three simulated continuous variables were converted separately into three dichotomous variables with equal frequency, totaling five binary variables in this procedure. Data on gold standard and new test were split into two subgroups according to the value of each binary covariate. We then proceeded on running MI procedure for all subgroups ending up assigning five different values to the missing data on gold standard. Among these five assigned values, the most repeated one was taken as the final value for missing gold standard.

(7)

2.3 Correction methods without the MAR assumption

If the verification process depends on unobserved variables, the verification process will not be MAR. This situation is most likely to occur when the cases are with the following features: a long lag time between initial test and the verification or multiple investigators join from at various institutions or a very heterogeneous patient population is used or a not well understood disease process is used [27].

When the verification process is not MAR, using the proposed approach by Zhou, the lower and upper bounds of the ML estimators for sensitivity and specificity can be obtained. However, usually, the range of ML estimators is wide and consequently there is no clear information about the diagnostic accuracy.

If patients’ characteristics and other information related to diagnosis are observed, one can use them to get the estimated values for the missing gold standard test results by employing the imputation approaches, such as LR, NNs or MI.

2.3.1 ML estimates by Zhou

In 1993, Zhou proposed a general ML method for estimating sensitivity and specificity of a test without the MAR assumption. He tried to model the verification process to get inferences about the test’s sensitivity and specificity.

Without the MAR assumption, the likelihood function can be more complex and contains conditional probabilities (λ) that

- λ00is the probability of selection of a patient with negative test result and not diseased. - λ01is the probability of selection of a patient with positive test result and not diseased. - λ10 is the probability of selection of a patient with negative test result and positive disease

status.

- λ11 is the probability of selection of a patient with positive test result and positive disease status.

By setting et= λ1t/λ0tand using the counts giving cells in Table1, the log likelihood function,

l, can be given as follows:

l= 1  t=0 mtlog ϕ1t+ 1  t=0 st log(etλ0tϕ2t)+ rtlog[λ0t(1− ϕ2t)] + ut log[(1 − etλ0t2t+ (1 − λ0t)(1− ϕ2t)]. Where ϕ1t = P(T = t) and ϕ2t = P(D = 1|T = t).

The ML estimators for sensitivity and specificity can be obtained under the assumption that e0 and e1are known, and can be given as

ˆSe (e0, e1)= s1m1/(s1+ e1r1) (s1m1/(s1+ e1r1))+ (s0m0/(s0+ e0r0)) and ˆSp(e0, e1)= e0r0m0/(s0+ e0r0) (e1r1m1/(s1+ e1r1))+ (e0r0m0/(s0+ e0r0)) .

If e0= e1= 0, the verification process will become MAR and the previous ML estimators for sensitivity and specificity can be used (Section2.2). In general, e0 and e1 cannot be estimated from an observed data, however the observed data can be used to obtain the lower and upper

(8)

bounds for e0and e1as s1 s1+ u1 ≤ e1≤ r1+ u1 r1 and s0 s0+ u0 ≤ e0≤ r0+ u0 r0 .

Using these bounds, one can easily calculate the bounds for the ML estimators of sensitivity and specificity [25].

2.3.2 Logistic regression

In 2003, Kosinski and Barnhart proposed a general likelihood-based regression approach which can accommodate various forms of the missing data mechanism, and allows the use of categor-ical and continuous covariates [9]. They assumed some p covariates observed for all patients undergoing the diagnostic test. Their likelihood function based on the observed data is

Lobs= N  i=1

P(Vi, Ti, Di|xi)ViP (Vi, Ti|xi)1−Vi,

where Vishows whether disease status of the ith patient is verified or not, Tiis the test result of the ith patient, Dithe disease status of the ith patient and xithe covariates of the ith patient.

Using this likelihood function they have modeled the missing data mechanism P(V|D, T, x) and parameterized the components of likelihood with LR models as

Disease component: logit P(Di= 1|xi)= z0i α,

Diagnostic test component: logit P(Ti= 1|Di, xi)= z1i β, and

Missing data mechanism component: logit P(Vi= 1|Di, Ti, xi)= z2iγ,

where logit(p)= log(p/1 − p), θ = (α, β, γ) is the vector of parameters and vector zmi denotes the ith row (i= 1, . . . , N) of the design matrix formed for the mth logistic model (m= 0, 1, 2), with a choice of Di, Ti, xi(and possibly their interactions or transformations).

Finally employing the Expectation Maximization (EM) algorithm, they obtained the ML estimator of θ = (α, β, γ). After having the parameter estimates, and using the probability

P(Ti= 1|Di, xi), they calculated the estimates for sensitivity and specificity. The details can be found in their article [9].

2.3.3 Neural networks

NNs, also known as ‘parallel distributed processor’, are computational methodologies performing multi-factorial analysis. These computational methods have some particular properties such as the ability to adapt or learn, to generalize, to cluster, or to organize data, and have operations based on parallel processing [13]. The NNs are designed by researchers from many scientific disciplines to solve a variety of problems in pattern recognition, prediction, forecasting, optimization, clustering, and categorizing [12].

In many studies, NNs have been proposed as an alternative method to the LR [5,19,20,23]. In some studies, one method has been found to be superior to the other and in some both have given similar results. Therefore, the two methods are still comparable in many areas where applicable. An artificial NN has commonly three layers: a layer of input neurons, a hidden layer, and a layer of output neurons. Those layers are connected via synapses which store the parameters called ‘weights’ that manipulate the data in the calculations. In the hidden and output layers, a network function f (x), a composition of the other functions, converts the input to the output. A widely used type of composition is the nonlinear weighted sum where sigmoid (logistic) function is commonly referred to as the activation function [13].

For correcting the verification bias, Kosinski and Barnhart [9] proposed the LR approach. Based on the idea proposed by Kosinski and Barnhart, we have constructed three different networks that

(9)

include all components (disease, diagnostic test, and missing mechanism) of likelihood function. Using the sigmoid function as the activation function, ykcan be given as

For the disease component (D) : yk= φ(wDhφ (z0iwDi)),

For the diagnostic test component (T ) : yk= φ(wThφ (z1iwTi)), and For the missing data mechanism component (M) : yk= φ(wMhφ (z2iwMi)),

where yk’s are the output of each case, φ is the sigmoid function, zmithe vector of covariates of the ith case and mth model with a choice of Di, Ti, xi, the weights, wi, are the network weights between the input and hidden layers, and the weights, wh, are the synaptic weights between the hidden and output layers. We have used the network weights wi as regression coefficients in a similar way as in Kosinski and Barnhart’s study.

We have created a pseudo-data formed by considering the initial N–U rows as observations for patients with verified disease status, following U rows as observations for patients with unverified disease status accepting their status as negative, and finally the last U rows as observations for patients with unverified disease status accepting their status as positive. We have defined a new variable, V , assigning the first N–U rows as 1, which means verified case, and the last 2U rows as 0, which means unverified case.

Following the idea proposed by Kosinski and Barnhart, and using the network weights between the input and hidden layers, we have defined a parameter pkas

pk= 2  m=0

{pmk}ymk{1 − pmk}1−ymk,

with p0k = {1 + exp(−vDkwDi)}−1, p1k= {1 + exp(−vTk wTi)}−1, and p2k = {1 + exp(−vMk

wMi)}−1, and used it to get the case weights, wk, as follows:

wk= ⎧ ⎪ ⎨ ⎪ ⎩ 1, for k= 1, . . . , N − U, pk/{pk+ pk+U}, for k = N − U + 1, . . . , N, 1− wk− U for k= N + 1, . . . , N + U.

For each epoch, we get the case weights and then use them for next epoch to get the net-work weights. The procedure is stopped when the stopping criteria were fulfilled (i.e. reaching a predefined number of epoch or having small enough error rate).

3. Simulation study

3.1 Study design

In this section, some of the different approaches for correcting the verification bias are com-pared by employing some simulated data with a variety of experimental conditions; changing the sample size, varying the coefficient of determination, or changing the ratio of missing values. A gold standard test result (D), a new test result (T ), two binary variables (B1and B2) related to the gold standard test, two correlated continuous variables (C1 and C2) related to the gold standard tests, and a continuous noise variable (C3), uncorrelated with any of the other variables were generated using a multivariate normal distribution in the simulated data. We generated two overlapping distributions for both diseased and healthy populations. We changed the parameters in the multivariate normal distribution to provide different values of coefficient of determination, i.e. explained variation by the fit model for the disease component in LR approach. Two columns of generated data were categorized into two groups to obtain binary covariates named B1and B2.

(10)

Simulated data were formed in two different sizes: a large sample (1000 cases) and a small sample (100 cases). In order to be able to measure the effect of missing rates on estimates, three different missing rates (30%, 50%, and 80%) were set. The coefficient of determination (R2) for the following ranges: 0.6–0.7, 0.4–0.5, 0.2–0.3, and 0.07–0.15 were considered in this study. The purpose of such a simulation was to measure the performance of the imputation methods with the change in R2. When there were no changes in R2for MAR settings, the R2values were set between 0.4 and 0.5. After simulating the data, gold standard test results of some cases were assigned as missing, for example in the case of MAR, gold standard test results of some cases with negative test result and zero value of B1 covariate (this scenario is similar to the scenario B given in the article by de Groot et al. [4]) are deleted. However, in the case of NMAR, the conditional probabilities which are the probabilities of the selection of an individual with a positive/negative test outcome for disease verification were changed. We assigned the conditional probabilities as

p(V= 1|T = 1, D = 1) = 0.15, p(V = 1|T = 0, D = 1) = 0.8, p(V = 1|T = 1, D = 0) = 0.1,

and p(V= 1|T = 0, D = 0) = 0.15. Thus the missingness depended on the unobserved data (i.e. gold standard test result) and hence the mechanism was not MAR. Later all correction methods were run to get the estimates of the sensitivity and specificity. The difference between these estimates and the real values of sensitivity and specificity was recorded. This procedure was repeated 500 times. The details of the procedure are given in Appendix 1.

In this work, for the correction of verification bias, the Kosinski and Barnhart [9] approach was applied by: (a) the application of the full model containing the variables that were pre-viously known to be related to the disease and (b) using the significant model containing the variables which were shown to be significant in the analysis done with the gener-ated data. The Generalized Linear Model (GLM) procedure in R was used for all the LR analyses. On the other hand, for the NNs approach, three single hidden layer NNs were constructed by changing the number of nodes in the hidden layer as 1 or 2 nodes and chang-ing the variables in the defined full and significant models. When the number of nodes in hidden layer was 2, the network weights, which were used for correcting the verifica-tion bias, were obtained by averaging the weights of different nodes. The library called NNET was used for the NN analysis in R, ‘Multiple Imputation by Chained Equations’

Table 2. The absolute mean difference (×10−3) and standard deviation (×10−3) for estimations

of the sensitivity and specificity assuming MAR.

Sample size

Small (n= 100) Large (n= 1000)

Missing rate Approaches Sens Spec Sens Spec

Real value 739± 38 735± 37 743± 9 748± 8 Naïve estimator 804± 22 652± 46 815± 5 647± 15 30% BG 21± 14 22± 14 6± 4 6± 5 MI 27± 20 26± 21 8± 7 8± 6 MI with covariates 20± 19 21± 19 6± 6 7± 5 50% BG 41± 32 39± 30 14± 11 14± 11 MI 54± 36 52± 38 18± 12 17± 11 MI with covariates 42± 33 41± 31 14± 11 15± 12 80% BG 90± 70 91± 80 24± 21 26± 22 MI 97± 75 101± 81 34± 26 33± 26 MI with covariates 93± 72 94± 80 27± 22 28± 21

(11)

Journal

of

Applied

Statistics

1099

large sample.

Ranges of coefficient of determination

0.60–0.70 0.40–0.50 0.20–0.30 0.07–0.15

Missing

rate Approaches Sens Spec Sens Spec Sens Spec Sens Spec

Real value 855± 10 860± 8 747± 14 752± 12 645± 11 639± 13 580± 15 588± 14

Naïve estimator 924± 11 758± 12 814± 20 658± 18 713± 19 542± 16 655± 22 479± 26

30% Zhou lower bound 127± 2 127± 2 113± 2 113± 2 89± 2 89± 2 75± 2 75± 2

Zhou upper bound 119± 9 29± 3 187± 10 57± 5 292± 10 121± 7 347± 8 167± 8

LR – 1 64± 27 29± 20 80± 41 16± 12 45± 29 41± 11 215± 47 112± 14

LR – 2 11± 4 14± 1 72± 15 11± 3 67± 83 20± 16 92± 4 9± 7

NNs – 1 78± 33 45± 26 104± 62 21± 14 129± 105 42± 16 207± 110 91± 36

NNs – 2 85± 25 43± 26 116± 59 19± 10 165± 108 44± 15 199± 121 85± 38

NNs – 3 103± 21 58± 25 164± 31 26± 8 260± 39 55± 9 281± 55 111± 10

50% Zhou lower bound 313± 8 208± 8 362± 7 260± 8 430± 5 348± 8 463± 4 404± 7

Zhou upper bound 312± 8 208± 8 259± 9 160± 7 164± 8 89± 6 104± 7 52± 4

LR – 1 30± 39 18± 22 144± 68 133± 31 46± 54 159± 70 50± 55 223± 82

LR – 2 6± 5 1± 2 39± 17 12± 27 117± 6 1± 0 160± 10 2± 17

NNs – 1 24± 35 19± 22 53± 63 77± 43 63± 61 153± 78 72± 72 217± 96

NNs – 2 11± 18 12± 13 33± 33 60± 34 57± 54 123± 92 80± 73 168± 127

NNs – 3 10± 34 16± 23 4± 20 80± 18 2± 7 208± 17 2± 7 285± 16

80% Zhou lower bound 125± 6 187± 3 163± 7 205± 2 232± 6 214± 1 278± 6 206± 2

Zhou upper bound 50± 2 50± 2 40± 2 40± 2 25± 1 25± 1 15± 1 15± 1

LR – 1 61± 20 97± 36 114± 10 123± 13 171± 36 113± 18 189± 63 84± 13

LR – 2 49± 34 82± 61 2± 12 18± 12 186± 48 119± 19 235± 54 90± 8

NNs – 1 46± 31 75± 57 113± 16 124± 20 155± 62 104± 28 184± 78 83± 14

NNs – 2 42± 31 68± 58 106± 19 115± 27 119± 79 88± 34 150± 96 77± 14

NNs – 3 50± 34 86± 61 107± 15 114± 20 158± 78 108± 31 188± 103 85± 13

Note: LR – 1, full model; LR – 2, significant model; NNs – 1, full model with one hidden node; NNs – 2, full model with two hidden nodes; NNs – 3, significant model with two hidden nodes.

(12)

˙I.

Ünal

and

H.R.

Bur

gut

Table 4. The mean absolute error difference (×10−3) and standard deviation (×10−3) for estimations of the sensitivity and specificity assuming NMAR for small sample.

Ranges of coefficient of determination

0.60–0.70 0.40–0.50 0.20–0.30 0.07–0.15

Missing

rate Approaches Sens Spec Sens Spec Sens Spec Sens Spec

Real value 847± 11 856± 9 751± 14 755± 12 650± 10 641± 13 590± 13 582± 18

Naïve estimator 918± 13 761± 14 816± 21 661± 19 715± 17 544± 15 661± 21 482± 29

30% Zhou lower bound 125± 9 125± 9 113± 10 113± 10 88± 10 88± 10 75± 9 75± 9

Zhou upper bound 117± 24 31± 10 192± 30 59± 16 290± 29 121± 19 344± 28 166± 26

LR – 1 22± 40 18± 15 34± 24 13± 7 70± 14 13± 5 91± 10 5± 5

LR – 2 15± 8 15± 6 32± 10 13± 4 69± 12 13± 5 91± 10 5± 5

NNs – 1 58± 36 40± 27 89± 64 24± 17 117± 98 37± 26 165± 109 80± 44

NNs – 2 70± 33 41± 26 101± 67 22± 15 144± 110 39± 27 174± 121 79± 46

NNs – 3 103± 30 56± 29 171± 40 26± 17 259± 43 55± 27 300± 41 110± 29

50% Zhou lower bound 316± 27 212± 25 360± 23 258± 27 429± 19 348± 26 461± 11 404± 21

Zhou upper bound 312± 27 208± 25 260± 29 163± 25 162± 26 89± 17 106± 22 53± 13

LR – 1 22± 23 11± 18 44± 26 8± 14 116± 2 8± 20 160± 20 8± 20

LR – 2 18± 13 5± 4 43± 23 6± 4 117± 20 6± 4 160± 17 7± 4

NNs – 1 40± 38 35± 28 62± 54 73± 47 76± 61 159± 74 90± 68 213± 95

NNs – 2 29± 28 31± 24 45± 37 64± 40 66± 51 132± 87 83± 66 185± 114

NNs – 3 12± 9 32± 25 15± 12 78± 39 14± 10 209± 36 17± 15 282± 35

80% Zhou lower bound 441± 41 346± 37 528± 48 418± 37 645± 27 519± 26 672± 14 542± 11

Zhou upper bound 476± 48 368± 40 379± 50 292± 39 229± 38 175± 32 155± 34 113± 26

LR – 1 38± 36 41± 24 74± 39 36± 19 197± 36 26± 18 214± 89 163± 129

LR – 2 34± 25 39± 16 74± 37 35± 12 197± 35 26± 16 268± 29 23± 16

NNs – 1 118± 92 57± 44 114± 73 80± 56 164± 92 178± 100 202± 104 246± 117

NNs – 2 117± 92 51± 42 110± 66 71± 50 138± 78 155± 105 201± 83 201± 139

NNs – 3 177± 74 40± 28 141± 53 85± 48 90± 66 224± 66 156± 59 288± 101

Note: LR – 1, full model; LR – 2, significant model; NNs – 1, full model with one hidden node; NNs – 2, full model with two hidden nodes; NNs – 3, significant model with two hidden nodes.

(13)

(MICE) package, a library distributed for S-Plus and R [22], was used for MI. In the fol-lowing sections, the simulation results under the MAR and NMAR assumptions will be introduced.

3.2 Simulation results with MAR assumption

Assuming the missing mechanism being MAR, for different values of sample sizes with different missing rates, the absolute mean difference (bias) between the estimates of the correction methods and the real values of sensitivity and specificity, and their standard deviations are given in Table2. As can be seen from Table2, for both the large and small samples, the biases of the sensitivities and specificities estimated by different approaches increase with increasing missing rate. It seems that including covariates into the MI models has an effect on the estimation, however the BG approach gives the smallest biases in all missing rates and sample sizes. Although there is no superior approach, BG, having no additional requirements and has less computational work, seems to be the first choice for correcting the verification bias.

3.3 Simulation results without MAR assumption

Assuming the missing mechanism being not MAR with different missing rates, the coefficient of determination (R2) in different ranges, the absolute mean difference (bias) between the estimates obtained from the correction methods and the real values of sensitivity and specificity, and their standard deviations for the large and small samples are given in Tables3and4(for large and small samples, respectively).

As can be seen from Table3, for the large sample, Zhou’s bounds are highly biased for all the different values of R2; however, they may still be useful when the missing rate is small or large. The biases in both accuracy measures obtained by the LR and NNs approaches increase slightly with decreasing R2. The smallest biases are obtained for the sample with 50% missing rate and higher R2 values. For the LR approach, the correction with significant model (LR-2) seems to be superior to the correction with full model (LR-1). The NNs approaches used with two hidden nodes (NN-2 or NN-3) give smaller biases than that with one hidden node (NN-1). It is interesting that, the NNs approach using two hidden nodes within full model (NN-2) gives almost the same results as that of the NNs approach using two hidden nodes within significant model (NN-3).

As a result, the biases obtained by using the NNs approaches with two hidden nodes are smaller than those obtained by the LR approach with full model, but they are higher than those obtained by employing LR with the significant model.

The NNs approaches with both full model and significant model give similar results. Therefore, it may not be important which models are used for the NNs approach.

For small samples (Table4), the levels of biases for all approaches remain unchanged, but the standard deviations increase. With 50% missing rate, both the LR and NNs approaches have the smallest biases in both accuracy measures.

As a result, Zhou’s approach, because of its high bias, may not be suitable for correction. The NNs approach with two nodes works quite well for correction, whereas the LR approach with significant model works much better in all experimental conditions used in this study.

4. Example

Sukan et al. [21] evaluated the efficacy of dual-phase MIBI parathyroid scintigraphy and US in primary (pHPT) and secondary (sHPT) hyperparathyroidism. In that study, a total of 69 patients who had histopathology test result were enrolled, but 48 patients who had not a histopathology test were excluded. For the patients considered, preoperative serum intact parathyroid hormone

(14)

Table 5. Results of all methods in real data example.

Methods Sensitivity Specificity

Complete casea 0.71± 0.07 0.87± 0.14 Observed resultb 0.72± 0.05 0.93± 0.09

BG 0.75± 0.06 0.90± 0.04

MI 0.76± 0.06 0.90± 0.03

MI with covariates 0.80± 0.09 0.89± 0.06 Zhou lower bound 0.56± 0.04 0.35± 0.09 Zhou upper bound 0.87± 0.04 0.95± 0.05

LR – 1 0.70± 0.07 0.90± 0.06

LR – 2 0.74± 0.06 0.85± 0.07

NNs – 1 0.74± 0.05 0.82± 0.08

NNs – 2 0.75± 0.05 0.83± 0.08

NNs – 3 0.75± 0.06 0.85± 0.07

Notes: After 1000 bootstrap replications, the mean values and standard deviations of estimation by approaches were given.

aExcludes unverified subjects.

bIncludes unverified subjects with negative test result assumed.

levels, calcium (Ca), phosphate (P), alkaline phosphates, and 24-h urinary-free Ca measurements were obtained and the diagnostic accuracy of MIBI and US test in pHPT and sHPT patients were calculated separately. According to their results, in primary and secondary hyperparathyroidism the sensitivity and specificity of combined US+ MIBI were 71% and 87%, respectively [21].

However, the complete data obtained the study of Sukan et al. [21], which include patients with and without gold standard test results is used in this study in order to obtain the corrected diagnostic accuracy of combined US+ MIBI (both pHPT and sHPT).Although they had 117 patients, Sukan

et al. included the results of 69 patients for calculating the sensitivity and specificity of combined

US+ MIBI. If they had included all patients under the assumption that all unverified cases will have negative test results, they would have attained the results given in Table5as the observed result. All the considered methods are applied to the data and the estimated measures and their standard errors are obtained based on 1000 bootstrap samples (see in Table5).

For the combined US+ MIBI test, the sensitivity and specificity by Sukan et al. are different (lower in sensitivity and higher in specificity) than that based on observed results, As shown in Table5, all the considered methods, except the Zhou’s bounds, give almost the same results. Based on the missing data mechanism component in the LR approach, the missing mechanism is more likely as being NMAR (p < 0.05 for D in the fitted model for V ). When the missing mechanism is not MAR, results based on simulation study show that the LR approach with significant model yields the smallest bias in estimations. Therefore, the corrected diagnostic accuracy of combined US+ MIBI test should be approximately 74% for the sensitivity and 84% for the specificity. These results show that, if the verification bias is ignored, the sensitivity and specificity will be under and overestimated, respectively.

5. Discussions and conclusions

The verification bias is a very common problem in diagnostic medicine, and thus many methods have been proposed to deal with it in the literature. In this work, the performance of some of those methods has been compared under various assumptions.

The verification bias can be corrected by approaching the problem as a missing value prob-lem, and using the imputation methods like the LR, the NNs, or the MI. We have shown that these techniques may be useful for correcting the verification bias and have concluded that the

(15)

appropriate approach among these methods should be specified according to the type of missing mechanism. We confirmed that when the missing data mechanism is MAR, BG correction is the most appropriate approach for correcting verification bias. On the other hand, if the mechanism is not MAR, the approach proposed by Zhou gives rather large biased estimators, whereas impu-tation methods work quite well and give better results. Therefore, if there are some additional variables that may be related to the gold standard test result, we recommend the use of imputation methods – especially the LR approach with significant model – for adjusting verification bias.

Our studies on real data have shown that the verification bias should not be ignored. Otherwise, the diagnostic accuracy of test will be inaccurate, and resulting diagnostic accuracy measurements will be either under or overestimated.

For the future works, comparative studies will include the Bayesian approach.

6. Limitations of this study

This study only focused on correction of verification bias on binary markers. Additional studies can be undertaken for the continuous cases as in the study by Fluss et al. [6].

This study seeks for the most appropriate method used for correcting the verification bias by a simulated data with MAR, and not MAR. The data with MAR are easy to generate, however this is not the case for not MAR data, which can be generated in many ways. In this work, the conditional probabilities (λ), which are the probabilities of the selection of an individual with a positive/negative test for disease verification, have been used. Thus the missingness is dependent on the unobserved data. Other possible data generation mechanisms for not MAR data are not considered in this study.

For the situation of the missing mechanism being not MAR, Zhou’s bounds and two additional imputation methods (LR and NNs) for correcting the verification bias have been also examined. Both of these imputation methods require the existence of significant covariates. For a future work, the development of a methodology with better point or interval estimates than Zhou’s bounds and not requiring significant covariates might be an important contribution to the correction of the verification bias.

Acknowledgements

This research was supported by the Scientific Research Project Unit of Cukurova University, Grant No. TF2007D10.

References

[1] C.B. Begg and R.A. Greenes, Assessment of diagnostic tests when disease verification is subject to selection bias, Biometrics 39 (1983), pp. 207–215.

[2] M. Buzoianu and J.B. Kadane, Adjusting for verification bias in diagnostic test evaluation: A Bayesian approach, Stat. Med. 27 (2008), pp. 2453–2473.

[3] J.A.H. De Groot, K.J.M. Janssen, A.H. Zwinderman, K.G.M. Moons, and J.B. Reitsma, Multiple imputation to

correct for partial verification bias revisited, Stat. Med. 27 (2008), pp. 5880–5889.

[4] J.A.H. De Groot, K.J.M. Janssen, A.H. Zwinderman, P.M.M. Bossuyt, J.B. Reitsma, and K.G. Moons, Correcting

for partial verification bias: A comparison of methods, Ann. Epidemiol. 21 (2011), pp. 139–148.

[5] D. Delen, G. Walker, and A. Kadam, Predicting breast cancer survivability: A comparison of three data mining

methods, Artif. Intell. Med. 34 (2005), pp. 113–127.

[6] R. Fluss, B. Reiser, D. Faraggi, and A. Rotnitzky, Estimation of the ROC curve under verification bias, Biom. J. 51 (2009), pp. 475–490.

[7] J.A. Hanley, N. Dendukuri, and C.B. Begg, Letter to editor: Multiple imputation for correcting verification bias by

Ofer Harel and Xiao-Hua Zhou, Stat. Med. 25 (2006), pp. 3769–3786, 26 (2006), pp. 3046–3047.

[8] O. Harel and X.H. Zhou, Multiple imputation for correcting verification bias, Stat. Med. 25 (2006), pp. 3769–3786.

(16)

[9] A.S. Kosinski and H.X. Barnhart, Accounting for nonignorable verification bias in assessment of diagnostic tests, Biometrics 59 (2003), pp. 163–171.

[10] E.Z. Martinez, J.A. Achcar, and F.L. Neto, Estimators of sensitivity and specificity in the presence of verification

bias: A Bayesian approach, Comput. Stat. Data Anal. 51 (2006), pp. 601–611.

[11] G.A. Pennello, Bayesian analysis of diagnostic test accuracy when disease state is unverified for some subjects, J. Biopharm. Stat. 21 (2011), pp. 954–970.

[12] S. Raudys, Statistical and Neural Classifiers: An Integrated Approach to Design, Springer, New York, 2001. [13] B.D. Ripley, Pattern Recognition and Neural Network, Cambridge University Press, Cambridge, 1996. [14] D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Wiley, New York, 1987.

[15] D.B. Rubin, Multiple imputation after 18+ years, J. Am. Stat. Assoc. 91 (1996), pp. 473–489.

[16] D.B. Rubin and N. Schenker, Multiple imputation for interval estimation from simple random samples with ignorable

nonresponse, J. Am. Stat. Assoc. 81 (1986), pp. 366–374.

[17] J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, London, 1997.

[18] N. Schenker and A.H. Welsh, Asymptotic results for multiple imputation, Ann. Stat. 16 (1988), pp. 1550–1566. [19] M. Schumacher, R. Robner, and W. Vach, Neural networks and logistic regression: Part I, Comput. Stat. Data Anal.

21 (1996), pp. 661–682.

[20] H.S. Stern, Neural networks in applied statistics (with discussion), Technometrics 38 (1996), pp. 205–220. [21] A. Sukan, M. Reyhan, M. Aydin, A.F. Yapar, Y. Sert, T. Canpolat, and A. Aktas, Preoperative evaluation of

hyper-parathyroidism: The role of dual-phase parathyroid scintigraphy and ultrasound imaging, Ann. Nucl. Med. 22

(2008), pp. 123–131.

[22] S. Van Buuren and K. Oudshoorn, Flexible multivariate imputation by mice, Tech. Rep. TNO Prevention 13 and Health, Leiden, The Netherlands, 1999. Available at http://www.stefvanbuuren.nl/publications/Flexible% 20multivariate%20-%20TNO99054%201999.pdf(accessed 20 February 2012).

[23] B. Warner and M. Misra, Understanding neural networks at statistical tools, Am. Stat. 50 (1996), pp. 284–293. [24] J. Yerushalmy, Statistical problems in assessing methods of medical diagnosis with special reference to X-ray

techniques, Public Health Rep. 62 (1947), pp. 1432–1449.

[25] X.H. Zhou, Maximum likelihood estimators of sensitivity and specificity corrected for verification bias, Commun. Stat. Part A: Theory Methods 22 (1993), pp. 3177–3198.

[26] X.H. Zhou, Correcting for verification bias in studies of a diagnostic test’s accuracy, Stat. Methods Med. Res. 7 (1998), pp. 337–353.

[27] X.H. Zhou, N.A. Obuchowski, and D.M. Obuchowski, Statistical Methods Is Diagnostic Medicine, 2nd ed., Wiley, New York, 2010.

Appendix 1

The variables D and T have prevalence 0.5, while B1and B2have changing prevalence between 0.25 and 0.55 with the

intention to obtain different coefficient of determination. The odds ratio between D and T , ORD,Tis 16.0. Similarly,

ORD,B1 is between 0.1 and 0.2, ORD,B2is between 10.0 and 12.0, ORT ,B1is between 0.3 and 0.5, ORT ,B2is between 3.5 and 5.0 and thus the dichotomous variables are related to the gold standard test result and new test result to varied degrees. The variables B1and B2have been simulated under the condition that ORB1,B2|D=0is between 0.98 and 1.02 and ORB1,B2|D=1is between 0.99 and 1.1. Also the odds ratio between B1and B2ranges from 0.35 to 0.50. These odds ratios

show that there is an interaction and collinearity between B1and B2.

We simulate multivariate normal distributions for C1and C2so that C1and C2are correlated to each other according

to D either being 0 or 1 as (C1 C2)|D = 0 ∼ N 10 μ20), 1.0 σ1 σ1 1.0 , (C1 C2)|D = 1 ∼ N 11 μ21), 1.0 σ2 σ2 1.0 ,

where μijranges from 2 to 4, and σ1and σ2take on values in between 1 and 3. For the third continuous variable C3we

use also normal distribution with changing mean and standard deviation.

Şekil

Table 1. Observed data for a single binary scale test. Result of new test
Table 2. The absolute mean difference (×10 −3 ) and standard deviation (×10 −3 ) for estimations
Table 4. The mean absolute error difference ( ×10 −3 ) and standard deviation ( ×10 −3 ) for estimations of the sensitivity and specificity assuming NMAR for small sample.
Table 5. Results of all methods in real data example.

Referanslar

Benzer Belgeler

Alanı içerisinde bulunan eyleyicilerin mücadeleleri ve iktidar alanıyla ilişkisi bakımından bunu analiz eden Bourdieu bu nedenledir ki çocukluk alanlarından biri olan okulu,

Tarihsel olarak, çocuk doğurma ve çocuk bakımına ilişkin gerçek fiziksel ve bi- yolojik gereksinimlerin azalmasına rağmen, kadınların annelik rolü psikolojik ve ideolojik

Abdülmecit çağında, askeri ve mülki okulların yatılı öğrencileri, Hıdrellezlerde, buraya gelirler, koşup oynarlar, türlü eğlencelere dalıp çıkarak kuzu

Merhaba, benim adım Berna Tankişi. Düzce Üniversitesi Sosyal Bilimler Enstitüsü Toplam Kalite Yönetimi A.B.D. yüksek lisans öğrencisiyim. Tez araştırmam üzerinde

Kıpçak Türkçesi Sözlüğü, Kırgız Sözlüğü, Azerbaycan Türkçesi Sözlüğü I, II, Tuva Türk- çesi Sözlüğü , Çuvaş Türkçesi-Türkiye Türkçesi Sözlük , Derleme

Halkla ilişkiler alanında çalışan akademisyenlerin ve uygulamacıların halkla ilişkilerin meslek olarak görülmesi gerektiğine ilişkin bir kanısı bulunmasına rağmen,

Key Words: Michel Foucault, Knowledge-power, Gaze-power, Discourse, Self, Archaeology, Genetic Science, Eugenics, Genetic Counseling, Molecular risk, Somatic

98 Mustafa ARAT, (2011), Paslanmaz Çelik 310 ve 316 Metalinin Plazma Borlama ve Nitrürleme Metodu İle Mekanik Özelliklerinin Geliştirilmesi, Yüksek Lisans