Background: Clinical validation of a predictive biomarker is especially difficult when the biomarker cannot be assessed retrospectively. A cost-effective, prospective multicenter replication study with rapid accrual is warranted prior to further validation studies such as a marker-based strategy for treatment selection. However, it is often unknown how measurement error and bias in a multicenter trial will differ from that in single-institution studies.
Purpose: Power calculations using simulated data may inform the efficient design of a multicenter study to replicate single-institution findings. This case study used serial standardized uptake value (SUV) measures from (18)F-fluorodeoxyglucose (FDG) positron emission tomography (PET) to predict early response to breast cancer neoadjuvant chemotherapy. We examined the impact of accelerating accrual through increased inclusion of secondary sites with greater levels of measurement error and bias. We also examined whether enrichment designs based on breast cancer initial uptake could increase the study power for a fixed budget (200 total scans).
Methods: Reference FDG PET SUV data were selected with replacement from a single-institution trial; pathologic complete response (pCR) data were simulated using a logistic regression model predicting response by mid-therapy percent change in SUV. The impact of increased error for SUV measurements in multicenter trials was simulated by sampling from error and bias distributions: 20%-40% measurement error, 0%-40% bias, and fixed error/bias values. The proportion of patients recruited from secondary sites (with higher additional error/bias compared to primary sites) varied from 25% to 75%.
Results: Reference power (from source data with no added error) was 0.92 for N = 100 to detect an association between percentage change in SUV and response. With moderate (20%) simulated measurement error for 3/4, 1/2, and 1/4 of measurements and 40% for the remainder, power was 0.70, 0.61, and 0.53, respectively. Reduction of study power was similar for other manifestations of measurement error (bias as a percentage of true value, absolute error, and absolute bias). Enrichment designs, which recruit additional patients by not conducting a second scan in patients with unsuitable pre-therapy uptake (low baseline SUV), did not lead to greater power for studies constrained to the same total cost.
Limitations: Simulation parameters could be incorrect, or not generalizable. Under a different logistic regression model relating mid-therapy percent change in SUV to pCR (with no relationship for patients with low baseline SUV, rather than the modest point estimate from reference data), the enrichment design did have somewhat greater power than the unselected design.
Conclusion: Even moderate additional measurement error substantially reduced study power under both unselected and enrichment designs.