Statistical noise in PD-(L)1 inhibitor trials: unraveling the durable-responder effect

J Clin Epidemiol. 2024 Nov 5:177:111589. doi: 10.1016/j.jclinepi.2024.111589. Online ahead of print.

Abstract

Background and objectives: Programmed-death-1/ligand-1 inhibitors (PD-1/L1is) have emerged as pivotal treatments for many cancers. A notable feature of this class of medicines is the dichotomous response pattern: A small (but clinically relevant) percentage of patients (5%-20%) benefit from deep and durable responses resembling functional cures (durable responders), while most patients experience only a modest or negligible response. Accurately predicting durable responders remains elusive due to the lack of a reliable biomarker. Another notable feature of these medicines is that different PD-1/L1 is have obtained statistically significant results, leading to marketing approval for some cancer indications but not for others, with no discernible pattern. These puzzling inconsistencies have generated extensive discussions among oncologists. Proposed (but not entirely convincing) explanations include true underlying differences in efficacy for some types of cancer but not others; or subtle differences in trial design. To investigate a less-explored hypothesis-the durable-responder effect: An initially unidentified group of durable responders generates more statistical noise than anticipated, leading to low-powered randomized controlled trials (RCTs) that report randomly variable results.

Study design: Employing simulation, this investigation divides participants in PD-(L)1i RCTs into two groups: durable responders and patients with a more modest response. Drawing on published data for melanoma, lung and urothelial cancers, multiple prespecified scenarios are replicated 50,000 times, systematically varying the durable-responder percentage from 5% to 20% and the modest-response hazard ratio for overall survival [HR(OS)] from 0.8 to 1.0. This allowed evaluation of the effect of durable responders on power, point estimates of the treatment effect for OS, and the probability of a misleading signal for harm.

Results: When the treatment effect for the modest responders is similar to the comparator arm, statistical power remains below 80%, limiting the ability to reliably detect durable responders. Conversely, there is a material probability of obtaining a statistically significant result that exaggerates the treatment effect by chance. For instance, with an average HR(OS) of 0.93 (corresponding to 5% durable responders), statistically significant trials (7.2%) show an average HR(OS) of 0.77. Additionally, when 5% are durable responders, there is a 20% probability that the HR(OS) will exceed 1.0-suggesting potential harm when none exists.

Conclusion: This article adds to the possible explanations for the puzzlingly inconsistent results from PD-(L)1i RCTs. Initially, unidentified durable responders introduce features typical of imprecise, low-powered studies: a propensity for false-negative results; estimates of benefit that might not replicate; and misleading signals for harm.

Plain language summary: Programmed-death-1/ligand-1 (PD-1(L)1) inhibitors are crucial cancer treatments, with global spending expected to surpass $75 billion by 2026. Multiple versions of these medicines are available, all designed to boost the immune system to fight cancer. We would expect them all to work similarly, but clinical trials show mixed results-some seem effective for certain cancers but not others, without a clear pattern. This article uses simulations (virtual trials) to suggest that these inconsistent results may be due to chance, caused by a small group of patients who respond very well to the treatment. Larger trials or specific analysis methods could help reduce the chance effects and provide more robust data for clinician and patient decision-making.

Keywords: Pharmaceutical class effects; Randomized controlled trial; Statistical power; Treatment effect heterogeneity.