Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data

Iris Ivy M Gauran; Junyong Park; Johan Lim; DoHwan Park; John Zylstra; Thomas Peterson; Maricel Kann; John L Spouge

doi:10.1111/biom.12779

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data

Biometrics. 2018 Jun;74(2):458-471. doi: 10.1111/biom.12779. Epub 2017 Sep 22.

Authors

Iris Ivy M Gauran^{1

2}, Junyong Park¹, Johan Lim³, DoHwan Park¹, John Zylstra¹, Thomas Peterson⁴, Maricel Kann⁴, John L Spouge⁵

Affiliations

¹ Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.
² School of Statistics, University of the Philippines Diliman, Quezon City, 1101, Philippines.
³ Department of Statistics, Seoul National University, Seoul, 08826, Republic of Korea.
⁴ Department of Biological Sciences, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.
⁵ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, U.S.A.

Abstract

In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two-stage testing procedure has superior empirical power.

Keywords: Local false discovery rate; Protein domain; Zero-in ated generalized poisson.

Publication types

Research Support, N.I.H., Intramural

MeSH terms

Biometry / methods*
DNA Mutational Analysis
Data Interpretation, Statistical*
Databases, Protein
Humans
Mutation Rate
Poisson Distribution
Protein Domains*
Statistical Distributions*

Abstract

Publication types

MeSH terms

Grants and funding