Date of Award

1-2014

Embargo Period

10-7-2014

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Statistics

Advisor(s)

William F. Eddy

Abstract

Capture-recapture (CRC) models use two or more samples, or lists, to estimate the size of a population. In the canonical example, a researcher captures, marks, and releases several samples of fish in a lake. When the fish that are captured more than once are few compared to the total number that are captured, one suspects that the lake contains many more uncaptured fish. This basic intuition motivates CRC models in fields as diverse as epidemiology, entomology, and computer science. We use simulations to study the performance of conventional log-linear models for CRC. Specifically we evaluate model selection criteria, model averaging, an asymptotic variance formula, and several small-sample data adjustments. Next, we argue that interpretable models are essential for credible inference, since sets of models that fit the data equally well can imply vastly different estimates of the population size. A secondary analysis of data on survivors of the World Trade Center attacks illustrates this issue. Our main chapter develops local log-linear models. Heterogeneous populations tend to bias conventional log-linear models. Post-stratification can reduce the effects of heterogeneity by using covariates, such as the age or size of each observed unit, to partition the data into relatively homogeneous post-strata. One can fit a model to each post-stratum and aggregate the resulting estimates across post-strata. We extend post-stratification to its logical extreme by selecting a local log-linear model for each observed point in the covariate space, while smoothing to achieve stability. Local log-linear models serve a dual purpose. Besides estimating the population size, they estimate the rate of missingness as a function of covariates. Simulations demonstrate the superiority of local log-linear models for estimating local rates of missingness for special cases in which the generating model varies over the covariate space. We apply the method to estimate bird species richness in continental North America and to estimate the prevalence of multiple sclerosis in a region of France.

Share

COinS