Rationale, aims, and objectives: Heterogeneity is a critical issue in meta-analysis, because it implies the appropriateness of combining the collected studies and impacts the reliability of the synthesized results. The Q test is a traditional method to assess heterogeneity; however, because it does not have an intuitive interpretation for clinicians and often has low statistical power, many meta-analysts alter to use some measures, such as the I2 statistic, to quantify the extent of heterogeneity. This article aims at providing a summary of available tools to assess heterogeneity and comparing their performance.
Methods: We reviewed four heterogeneity measures (I2 , , , and ) and illustrated how they could be treated as test statistics like the Q statistic. These measures were compared with respect to statistical power based on simulations driven by three real-data examples. The pairwise agreement among the four measures was also evaluated using Cohen's κ coefficient.
Results: Generally, was slightly more powerful than the Q test, while its type I error rate might be slightly inflated. The power of I2 was fairly close to that of Q. The and statistics might have low powers in some cases. Because the differences between the powers of I2 , , and Q were often tiny, meta-analysts might not expect I2 and to yield significant heterogeneity if the Q test failed to do so. In addition, I2 and had fairly good agreement based on the simulated meta-analyses, but all other pairs of heterogeneity measures generally had poor agreement.
Conclusion: The I2 and statistics are recommended for measuring heterogeneity. Meta-analysts should use the heterogeneity measures as descriptive statistics which have intuitive interpretations from the clinical perspective, instead of determining the significance of heterogeneity simply based on their magnitudes.
Keywords: I2 statistic; heterogeneity; meta-analysis; statistical power.
© 2019 John Wiley & Sons, Ltd.