Does the decision in a validation process of a surrogate endpoint change with level of significance of treatment effect? A proposal on validation of surrogate endpoints

(1)

Does the decision in a validation process of a surrogate endpoint change

with level of signi

ﬁcance of treatment effect? A proposal on validation of

surrogate endpoints

Y. Sertdemir

⁎

, R. Burgut

Cukurova University School of Medicine, Department of Biostatistics, 01130 Balcali-Adana, Turkey

a r t i c l e i n f o

a b s t r a c t

Article history: Received 2 April 2008 Accepted 25 August 2008

Background:In recent years the use of surrogate end points (S) has become an interesting issue. In clinical trials, it is important to get treatment outcomes as early as possible. For this reason there is a need for surrogate endpoints (S) which are measured earlier than the true endpoint (T). However, before a surrogate endpoint can be used it must be validated. For a candidate surrogate endpoint, for example time to recurrence, the validation result may change dramatically between clinical trials. The aim of this study is to show how the validation criterion (R2_{trial) proposed by Buyse et al.} are inﬂuenced by the magnitude of treatment effect with an application using real data.

Methods:The criterion R2

trialproposed by Buyse et al. (2000) is applied to the four data sets from colon cancer clinical trials (C-01, C-02, C-03 and C-04). Each clinical trial is analyzed separately for treatment effect on survival (true endpoint) and recurrence free survival (surrogate endpoint) and this analysis is done also for each center in each trial. Results are used for standard validation analysis. The centers were grouped by the Wald statistic in 3 equal groups.

Results:Validation criteria R2

trialwere 0.641 95% CI (0.432–0.782), 0.223 95% CI (0.008–0.503), 0.761 95% CI (0.550–0.872) and 0.560 95% CI (0.404–0.687) for C-01, C-02, C-03 and C-04 respectively. The R2

trialcriteria changed by the Wald statistics observed for the centers used in the validation process. Higher the Wald statistic groups are higher the R2

trialvalues observed.

Conclusion:The recurrence free survival is not a good surrogate for overall survival in clinical trials with non significant treatment effects and moderate for significant treatment effects. This shows that the level of significance of treatment effect should be taken into account in validation process of surrogate endpoints.

Keywords: Surrogate endpoint Validation criteria Colon cancer Meta-analytic Clinical trials 1. Introduction

In the year 2006, 542 of 100.000 man and 404 of 100.000 women developed cancer. In the same year 234 of 100,000 man and 160 of 100,000 women died from cancer[1]. In a population of 300,000,000 people this would mean 1,500,000 new cancer cases and 600,000 deaths for the year 2007. Every

year, new treatments are developed to lower the deaths from cancer and other fatal diseases. These new treatments need to be tested in clinical trials. But clinical trials often take up to 10 years when the primary (true) endpoint is survival. Because of this reason there is need for alternative (surrogate) endpoints which can give the same information on treatment effect earlier than the primary endpoint. The use of surrogate endpoints can shorten the follow up time and reduce the number of patients needed for a clinical trial[2,3]. A surrogate end point has been deﬁned as an alternative end point (such as a biological marker, physical sign, or precursor event) that

⁎ Corresponding author.

E-mail addresses:yasarser@cu.edu.tr(Y. Sertdemir),reﬁk@cu.edu.tr

(R. Burgut).

Contents lists available atScienceDirect

Contemporary Clinical Trials

(2)

can be used as a substitute for a clinically meaningful end point that measures directly how a patient feels, functions, or survives[4].

However before a surrogate endpoint can be used it needs to be validated. For a candidate surrogate endpoint, for example recurrence free survival (RFS), the validation result may change dramatically among clinical trials. The aim of this study is to show how the validation Criterion (R2trial) proposed

by Buyse et al. [5] are inﬂuenced by the magnitude of treatment effect using real data.

2. Surrogate endpoint validation

In clinical research, the endpoint of greatest relevance to inferences concerning therapeutic efﬁcacy is frequently not practical or even feasible to measure. Sometimes the determi-nation of the true endpoint (T) is difﬁcult, requiring an expensive, invasive or uncomfortable procedure. Sometimes it is unobservable for an impractically long interval. Occasionally the true endpoint is not directly measurable at all. In these cases we must rely on alternative or surrogate endpoints (S)[6].

In the past, the use of the surrogate was based on the correlation between S and T. The existence of such a correlation between the endpoints is not sufficient for using it as a surrogate. As Fleming and DeMets [7] stated; “A correlate does not a surrogate make”. It is required that the effect of treatment on the surrogate predicts the effect of treatment on the true endpoint. To be useful, a surrogate endpoint should be strongly associated with the true outcome, lie in the causal pathway for the definitive outcome, should manifest early in the course of follow-up, and should be relatively easy to measure. However, the defining characteristic is that the surrogate outcome should be affected by treatment in the same way (direction and relative magnitude) as the definitive outcome. And this last characteristic is the one, which is difficult to verify[8].

The validation of surrogate endpoints is a difficult task. Prentice [9] defined a surrogate endpoint as “a response variable for which a test of the null hypothesis of no relationship to the treatment groups under comparison is also a valid test of the corresponding null hypothesis based on the true endpoint_{”. Prentice proposed some criteria to} validate a surrogate endpoint. Since there were problems in proving Prentice criteria, Freedman et al. [10] suggested focusing on the proportion of the treatment effect explained by the surrogate (PE). Freedman et al.[10], also noted that the confidence limits of PE are generally too wide to be informative unless the treatment effect on the true endpoint is highly significant. Others showed that the PE could be larger than 1 or negative, which can hardly be justified for a proportion[5,13]. This indicates that in the case of significant treatment effects this criterion can be useful but for non-significant treatment effects, it could be a misleading quantity to use in the validation of surrogates[11].

Buyse M and Molenberghs G, [12] proposed two other quantities to replace PE. Theﬁrst quantity was called “Relative Effect”; it is the effect of the treatment on the true endpoint relative to that on the surrogate endpoint. This quantity depends on the scales chosen to measure S and T. and the second one was the“Adjusted Association”. The treatment-adjusted association γZis the subject-speciﬁc association between the surrogate and

true endpoints, adjusting for treatment. The slope of the linear

regression between the trial-level effects of treatment upon both endpoints is useful for prediction purposes; the coefﬁcient of determination (R2_{) of this linear regression provides a measure}

of strength of the association between the effects. This measure was termed Rtrial2 and suggests to call a surrogate trial level-valid

if Rtrial2 is sufﬁciently close to 1. By analogy, Buyse et al. [5]

redeﬁned the individual-level association between both endpoints as a coefﬁcient of determination, which they termed Rindividual2 . A surrogate is called individual-level valid

if Rindividual2 is sufﬁciently close to 1.

3. Colon cancer clinical trials

Colon Cancer is a highly treatable and often curable disease when localized to the bowel. It is the second most frequently diagnosed malignancy in the United States as well as the second most common cause of cancer death. The primary treatment is surgery which results in cure in approximately 50% of patients. Recurrence after surgery is a major problem and is often the ultimate cause of death. Surrogate endpoints, like recurrence free survival (RFS) or disease free survival (DFS) as a surrogate for overall survival has been investigated in various studies [14,15]. In one investigation, time to recurrence seems to be a very weak surrogate [14] and in another it is a moderate to good surrogate. The result of validation study may differ from clinical trial to trial, number of center/trial or type of analysis. Four data sets from The National Surgical Adjuvant Breast and Bowel Project (NSABP), with protocol number 01, 02, C-03 and C-04 are used as real data sets.

4. Model descriptions and setting

This section describes the meta-analytic approach and the models used for surrogate endpoint validation.

The meta-analytic approach for two normally distributed endpoints was proposed by Buyse et al. [5]. Here the true endpoint (T) and the surrogate endpoint (S) are continuous, normally distributed random variables and completely observed for all patients. In the notation i = 1, 2…N is used for trials and j = 1, 2_…nifor subjects within a trial. For each

patient the triplets (Sij, Tij, Zij) are assumed to be observed.

Theﬁrst stage is based upon a trial-speciﬁc model: SijjZij¼ μsjþ αiZijþ eSij ð1Þ

TijjZij¼ μTjþ βiZijþ eTij ð2Þ

whereμSiandμTiare trial-speciﬁc intercepts and αiandβiare

effects of treatment (Z) on S and T.ɛSiandɛTiare correlated

error terms, which are assumed to be mean-zero normally distributed with covariance matrix

Σ ¼ σSS σST

σTT

ð3Þ

At the second stage, it is assumed that μSi μTi αi βi 0 B B @ 1 C C A ¼ μS μT α β 0 B B @ 1 C C A þ mSi mTi ai bi 0 B B @ 1 C C A ð4Þ

(3)

The second term on the right hand side of Eq. (4) follows a zero-mean normal distribution, with dispersion matrix.

D¼ dSS dST dSa dSb dTT dTa dTb daa dab dbb 0 B B @ 1 C C A ð5Þ

The random-effects representation is based upon combin-ing both steps:

SijjZij¼ μsþ mSiþ αZijþ aiZijþ eSij ð6Þ

TijjZij¼ μTþ mTiþ βZijþ biZijþ eTij ð7Þ

The next step considered by Buyse et al.[5]focused on prediction. Assuming that we have only data on S in a new trial i = 0 and we are interested in the estimated treatment effect of Z on T (β+ b0|mS0, a0), given the effect on S.

Eðβ þ b0jmS0; a0Þ ¼ β þ d_dSb ab T dSS dSa dSa daa −1 μS0−μS α0−α ð8Þ Varðβ þ b0jmS0; a0Þ ¼ dbb− d_dSb ab T dSS dSa dSa daa −1 _d Sb dab ð9Þ

In relation to the prediction Eqs. (8) and (9), the quantity to assess the quality of a surrogate at the trial level is the coefﬁcient of determination

R2 Trial fð Þ¼ R2bijmSi;ai¼ dSb dab T dSS dSa dSa daa −1 _d Sb dab dbb ð10Þ

The index“Trial (f)” indicates that this coefﬁcient pertains to the distribution ofβiconditional on the full set of trial-speciﬁc

parameters for S in model (4) i.e., onμSiandαi.

In a simpler setting where b0is predicted independently

from mS0the coefﬁcient R2trialin Eq. (10) reduces to Eq. (11).

R2

Trial rð Þ¼ R2bijai¼

dab

daadbb ð11Þ

R2

trialis the square of the correlation between aiand bi.

This coefﬁcient measures how precisely one can predict the effect of treatment on the true endpoint in a new trial, based on the previous data and the observed treatment effect on the surrogate endpoint in the new trial.

It is essential to explore the quality of the prediction of the treatment effect on the true endpoint in trial i by a) information obtained in the validation process based on trials i = 1…N and b) the estimate of the effect of Z on S in a new trial i = 0.

It is worth noting that the D matrix is required to be positive-deﬁnite for R2

trial to be a meaningful measure. A

surrogate is said to be‘perfect at the trial level’ when R2 trialis

equal to 1[5]. After adjustment for the effects of treatment Z, the association between S and T is captured by Σ. The surrogate is said to be‘perfect at the individual level’ if the coefﬁcient of determination

R2 indiv¼ R2eTijeSi¼ σ2 ST σSSσTT ð12Þ is equal to 1. R2

indivis just the correlation between S and T after

accounting for the trial and treatment effect [5]. But this criterion will not be discussed in this work.

An additional model is needed to calculate the PE criterion which is given below:

TijjZij; Sij¼ ~μTþ βSZijþ γSijþ ~eTij ð13Þ

PE is calculated using the formula (14)[10], PE¼ 1−βS

β ð14Þ

5. Analysis of case studies 5.1. Data sets

The four data sets on colon cancer clinical trials (01, C-02, C-03 and C-04) [16–19]from NSABP were analyzed as follows; each trial wasﬁrst analyzed separately. In data set C-01 four treatment arms were recoded into 2 groups; only Operation(OP) and OP + MOF were recoded into ﬁrst group and the Operation + BCG(Pasteur) and Operation + BCG(Con-naught) treatments were recoded into second group. In data set C-04 the treatments FU + LV and FU + LV + LEV were recoded into the same treatment group. This recoding was

Table 1

Percent censoring, p-value for treatment effect, number of centers used for analysis, R2

trialcriterion, PE criterion, and bootstrap conﬁdence intervals by clinical trial

Trial % censoring p-value⁎⁎⁎ # Centers Mean Wald 95% CI Wald R2

trial(95% CI)⁎ PE (95% CI)⁎⁎

C-01 37 0.210 28 .81 .413, 1.22 0.641 (0.432, 0.782) 0.361 (−4.65, 5.31)

C-02 53 0.110 12 .68 .078, 1.29 0.223 (0.008, 0.503) 1.271 (−1.28, 4.95)

C-03 62 b0.001 37 .85 .508, 1.19 0.761 (0.550, 0.872) 0.864 (0.18, 1.64)

C-04 68 0.081 36 .48 .210, .744 0.560 (0.404, 0.687) 1.67 (−1.11, 6.15)

⁎Based on 1000 bootstrap replications, ⁎⁎based on 500 bootstrap replications, ⁎⁎⁎for treatment effect on true endpoint (survival). Table 2

Centers from clinical trials by grouped Wald statistic

Wald statistic C-01 C-02 C-03 C-04 Total

W≤0.072 10 4 9 15 38

0.072bW≤0.52 8 3 14 13 38

WN0.52 10 5 14 8 37

(4)

made to reduce the number of treatment arms into 2 groups resulting in more cases for each center. For each trial, centers with at least 3 observations and with a minimum of 2 events in each treatment arm were considered for analysis; all other centers were grouped into one center. For the true endpoint (overall survival) and the surrogate endpoint (recurrence free survival), a Cox regression analysis within each center was applied whereβˆiis the coefﬁcient of treatment effect on true

endpoint and αˆi is the coefﬁcient of treatment effect on

surrogate endpoint. Theβî, se(βî), Wald(βî),αî, se(αî), Wald

(αˆi) values were recorded. The Wald statistics which are (βˆi/

se(βˆi))2in the Cox-regression analysis for treatment effect on

overall survival shown inTables 1 and 2are the statistics used to test whether the treatment effects are signiﬁcantly different from zero. The mean Wald statistic for βˆi and

their 95%CI for data sets C-01, C-02, C-03, C-04 were 0.81 (0.413–1.22), 0.68 (0.078–1.29), 0.85 (0.508–1.19) and 0.48 (0.210–0.744) respectively. The estimated R2

trialand PE values

for the data sets C-01, C-02, C-03 and C-04 with their 95% bootstrap CI which are based on 1000 and 500 replications respectively are given inTable 1.

The observed βî, se(βî), Wald(βî), αî, se(αî), Wald(αî)

values from 4 clinical trials were merged and grouped by the Wald statistic ofβˆiinto 3 groups by equal spaced percentiles.

The distribution of centers, from each clinical trial by group is given inTable 2.

One thousand bootstrap replications were applied to get the 95% CI for R2

trialin each group. In the ﬁrst group with

Waldb0.072 the estimated R2

trial was 0.09, in the second

group 0.072b=Wb=0.52 R2

trialwas 0.44 and in the last group

Fig. 1. Bootstrapped R2

trialfrom the regression ofβîonαîand Rtrial2 from the regression of SE(βî) on SE(αî) for grouped Wald values (1000 replications for each).

(5)

WN0.52 R2

trialwas 0.77, in the same way we estimated R2from

the regression of SE(βˆi) on SE(αˆi). The bootstrap results are

given inFig. 1.

6. Results for case study As shown inTable 1the R2

trialvalues are higher for trials

with higher Wald values except the C-02 data set where only 12 centers were eligible for analysis. The 95% CI for the R2

trial

value is very wide for this clinical trial.

InFig. 1it is observed that R2from the regression of SE(βˆi) on

SE(αˆi) is relatively stable for grouped Wald values. Whereas R2trial

from the regression of (βˆi) on (αˆi) increases as the Wald statistic

increases and reaches the value of R2_{from the regression of SE}

(βˆi) on SE (αˆi) at the group with highest Wald values (WN0.52).

In Fig. 2the range of PE criteria changes for different clinical trials and has the smallest range between 0-1 for the C-03 clinical trial which has the smallest p-value (highest Wald) for treatment effect on overall survival.

7. Discussion

The bootstrap results for the PE criteria showed that this criterion can take values out of the range [0,1] which is difﬁcult to interpret for a proportion. Whereas it has only an acceptable result for the most signiﬁcant C-03 clinical trial. This shows that the p-value of treatment effect has an important role on validation of surrogate endpoints.

Marc Buyse et al.[20]evaluated Progression-Free Survival (PFS) as a Surrogate for Survival in Advanced Colorectal Cancer using the R2

trialcriterion. They used a proportional

hazard regression model with treatment as the only factor for each trial in their meta-analysis. The effect of treatment on both endpoints was used in a regression analysis.

Using the same approach we estimated the validation criteria for RFS as a surrogate for OS, but in our study our aim is not to prove the validity of RFS. Our aim is to show how these criteria are affected by treatment effect. In the validation process of surrogate endpoints in different data sets with different treatments we observed different R2

trial

estimates with wide range of conﬁdence intervals. In the case study we observed higher R2

trialvalues for groups with higher

values of Wald statistics and relatively stable Corr(SE(βˆi),SE

(_αˆi)). At this point it is difﬁcult to decide whether the bias for

R2

trialdecreases or increases with higher signiﬁcant treatment

effects. Friedman et al.[10]stated that the validation process using the Proportion Explained with adequate statistical power would requireβ/σN4, which supports the observations in our study where we observed that clinical trials or centers with more signiﬁcant treatment effects should have more weight in the validation process.

We conclude also that R2

trialshould be evaluated together

with R2indivat different levels of signiﬁcance and should be

compared to Corr(SE(βî),SE(αî)). The Corr(SE(βî),SE(αî))

might give information on the direction of the bias for R2 trial.

If the observed R2

(SE(βˆi),SE(αˆi))values are higher than the

R2

trialvalues and the observed R2trialvalues increase with higher

signiﬁcant treatment effects, this might be an indication of underestimated R2

trial. In such situation, the use of a weighted

regression analysis where the weights are the test statistics (for example the Wald statistic or its square root) for the

treatment effect observed for each center or trial is recom-mended. The validation criterion R2

trialshould be evaluated

together with R2

individualbecause the value of R2individualhas

also an effect on the R2

trial criterion. But nevertheless

validation results should be handled with care in prediction processes since this is only an approximation procedure for the true endpoint.

Acknowledgments

We are grateful to the late Harry Samuel Wieand, Ph.D., University of Pittsburgh for letting us to use data from the protocol; C-01, C-02, C-03 and C-04 as an example.

References

[1] CANCER STATISTICS WORKING GROUP. United States Cancer Statistics: 2003 incidence and mortality. Atlanta: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; 2006.

[2] Wittes J, Lakatos E, Probstﬁeld J. Surrogate endpoints in clinical trials: cardiovascular diseases. Stat Med 1989;8:415–25 1989.

[3] Herson J. The use of surrogate endpoints in clinical trials (an introduction to a series of four papers). Stat Med 1989;8:403–4. [4] Temple RJ. A regulatory authority's opinion about surrogate endpoints.

In: Nimmo WS, Tucker GT, editors. Clinical Measurement in Drug Evaluation. New York: J. Wiley and Sons; 1995.

[5] Buyse M, Molenberghs G, Burzykowski T, Geys H, Renard D. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics 2000;1:1–19.

[6] Elenberg S, Michael Hamilton J. Surrogate endpoints in clinical trials: cancer. Stat Med 1989;8:405–13.

[7] Fleming TR, DeMetz DL. Surrogate end points in clinical trials: are we being misled? Ann Intern Med 1996;125:605–13 1996.

[8] Piantadosi S. Some statistical issues in the design of cancer clinical trials with surrogate end points. Abstracts From the Program of the Second Annual Meeting of the American Society for Experimental Neurother-apeutics, Washington, DC, March 23–25; 2000.

[9] Prentice RL. Surrogate markers in clinical trials: deﬁnition and operational criteria. Stat Med 1989;8:431–40 1989.

[10] Freedman LS, Graubard MI, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Stat Med 1992;11:167–78. [11] Molenberghs G, Buyse M, Geys H, Renard D, Burzykowski T. Statistical challenges in the evaluation of surrogate endpoints in randomized trials. Control Clin Trials 2002;23:607–25.

[12] Buyse M, Molenberghs G. Criteria for the validation of surrogate endpoints in randomized experiments. Biometrics 1998;54:1014–29 1998. [13] Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. Statistical

validation of surrogate endpoints: problems and proposals. Drug Inf J 2000;34:447–57.

[14] Burzykowski T, Molenberghs G, Buyse M, Geys H, Renard D. Validation of surrogate end points in multiple randomized clinical trials with failure time end points. J R Stat Soc Appl Stat Ser C 2001;50:405–22 2001. [15] Sargent D, Wieand S, Haller DG, et al. Disease-free survival (DFS) vs.

overall survival (OS) as a primary endpoint for adjuvant colon cancer studies: individual patient data from 20,898 patients on 18 randomized trials. J Clin Oncol 2005;23(34):8664–70.

[16] Wolmark N, Fisher B, Rockette H, Redmond C, et al. Postoperative adjuvant chemotherapy or BCG for colon cancer: results from NSABP protocol C-01. J Natl Cancer Inst 1988;80:30–6.

[17] Wolmark N, Rockette H, Wickerham DL, et al. Adjuvant therapy of dukes' A, B, and C adenocarcinoma of the colon with portal-veinﬂuorouracil hepatic infusion: preliminary results of national surgical adjuvant breast and bowel project protocol C-02. J Clin Oncol 1990;8:1466–75. [18] Wolmark N, Rockette HE, Fisher B, et al. The beneﬁt of

leucovorin-modulated 5-FU as postoperative adjuvant therapy for primary colon cancer: results from NSAPB protocol C-03. J Clin Oncol 1993;11 (10):1879–87.

[19] Wolmark N, Rockette H, Mamounas E, et al. A clinical trial to assess the relative efﬁcacy of 5-FU + leucovorin, 5-FU+ Levamisole, and 5-FU + Leucovorin + Levamisole in patientes with dukes B and C carcinoma of the colon: results from NSABP C-04. J Clin Oncolo 1999;17(11):3553–9. [20] Buyse M, Burzykowski T, Carroll K, et al. Progression-free survival is a surrogate for survival in advanced colorectal cancer. J Clin Oncol November 20 2007;25(33).