Department of StatisticsCopyright (c) 2014 Carnegie Mellon University All rights reserved.
http://repository.cmu.edu/statistics
Recent documents in Department of Statisticsen-usThu, 17 Jul 2014 12:34:48 PDT3600Mixture models for linkage analysis of affected sibling pairs and covariates
http://repository.cmu.edu/statistics/213
http://repository.cmu.edu/statistics/213Mon, 08 Apr 2013 12:24:05 PDT
To determine the genetic etiology of complex diseases, a common study design is to recruit affected sib/relative pairs (ASP/ARP) and evaluate their genome-wide distribution of identical by descent (IBD)-sharing using a set of highly polymorphic markers. Other attributes or environmental exposures of the ASP/ARP, which are thought to affect liability to disease, are sometimes collected. Conceivably these covariates could refine the linkage analysis. Most published methods for ASP/ARP linkage with covariates can be conceptualized as logistic models in which IBD-status of the ASP is predicted by pair-specific covariates. We develop a different approach to the problem of ASP analysis in the presence of covariates, one that extends naturally to ARP under certain conditions.
]]>
B. Devlin et al.Outlier Detection and False Discovery Rates for Whole-genome DNA Matching
http://repository.cmu.edu/statistics/212
http://repository.cmu.edu/statistics/212Mon, 08 Apr 2013 12:24:03 PDT
We define a statistic, called the matching statistic, for locating regions of the genome that exhibit excess similarity among cases when compared to controls. Such regions are reasonable candidates for harboring disease genes. We find the asymptotic distribution of the statistic while accounting for correlations among sampled individuals. We then use the Benjamini and Hochberg false discovery rate (FDR) method for multiple hypothesis testing to find regions of excess sharing. The p-values for each region involve estimated nuisance parameters. Under appropriate conditions, we show that the FDR method based on p-values and with estimated nuisance parameters asymptotically preserves the FDR property. Finally, we apply the method to a pilot study on schizophrenia.
]]>
Jung-Ying Tzeng et al.Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk
http://repository.cmu.edu/statistics/211
http://repository.cmu.edu/statistics/211Mon, 08 Apr 2013 12:24:01 PDT
We present an iterative Markov chain Monte Carlo algorithm for computing reference priors and minimax risk for general parametric families. Our approach uses MCMC techniques based on the Blahut-Arimoto algorithm for computing channel capacity in information theory. We give a statistical analysis of the algorithm, bounding the numbers of samples required for ties to chaotic algorithm to closely approximate the deterministic algorithm in each iteration. Simulations are presented for several examples from exponential families. Although we focus on applications to reference priors and minimax risk, the methods and analysis we develop are applicable to a much broader class of optimization problems and iterative algorithms.
]]>
John Lafferty et al.A Model of the Joint Distribution of Purchase Quantity and Timing
http://repository.cmu.edu/statistics/210
http://repository.cmu.edu/statistics/210Mon, 08 Apr 2013 12:23:59 PDT
Prediction of purchase timing and quantity decisions of a household is an important element for success of any retailer. This is especially so for an online retailer, as the traditional brick-and-mortar retailer would be more concerned with total sales. A number of statistical models have been developed in the marketing literature to aid traditional retailers in predicting sales and analyzing the impact of various marketing activities on sales. However, there are two important differences between traditional retail outlets and the increasingly important online retail/delivery companies, differences that prevent these firms from using models developed for the traditional retailers: 1) the profits of the online retailer/delivery company depend on purchase frequency and on purchase quantity, while the profits of traditional retailers are simply tied to total sales, and 2) customers in the tails of the frequency distribution are more important to the delivery company than to the retail outlet. Both of these differences are due to the fact that the delivery companies incur a delivery cost for each sale, while customers themselves travel to retail outlets when buying from traditional retailers. These differences in costs translate directly into needs that a model must address. For a model intended to be useful to online retailers the dependent variable should be a bivariate distribution of frequency and quantity, and frequency distribution must accurately represent consumers in the tails. In this article we develop such a model and apply it to predicting the consumer?s joint decision of when to shop and how much to spend at the store. Our approach is to model the marginal distribution of purchase timing and the distribution of purchase quantity conditional on purchase timing. We propose a hierarchical Bayes model that disentangles the weekly and daily components of the purchase timing. The daily component has a dependence on the weekly component thereby accounting for strong observed periodicity in the data. For the purchase times, we use the Conway-Maxwell-Poisson distribution, which we find useful to fit data in the tail regions (extremely frequent and infrequent purchasers).
]]>
Peter Boatwright et al.Using Computational and Mathematical Methods to Explore a New Distribution: The ν-Poisson
http://repository.cmu.edu/statistics/209
http://repository.cmu.edu/statistics/209Mon, 08 Apr 2013 12:23:57 PDT
A new distribution (the v-Poisson) and its conjugate density are introduced and explored using computational and mathematical methods. The v-Poisson is a two-parameter extension of the Poisson distribution that generalizes some well-known discrete distributions (Poisson, Bernoulli, Geometric). It also leads to the generalization of distributions derived from these discrete distributions (viz. the Binomial and Negative Binomial). We use mathematics as far as we can and then employ computational and graphical methods to explore the distribution and its conjugate density further. Three methods are presented for estimating the v-Poisson parameters: The first is a fast simple weighted least squares method, which leads to estimates that are sufficiently accurate for practical purposes. The second method of maximum likelihood can be used to refine the initial estimates. This method requires iterations and is more computationally intensive. The third estimation method is Bayesian. Using the conjugate prior, the posterior density of the v-Poisson parameters is easily computed. We derive the necessary and sufficient condition for the conjugate family to be proper. The v-Poisson is a flexible distribution that can account for over/under dispersion commonly encountered in count data. We also explore an empirical application demonstrating this flexibility of the v-Poisson to fit count data which does not seem to follow the Poisson distribution.
]]>
Galit Shmeuli et al.Hierarchical Modeling of Arsenic Concentrations at Entry Points in US Public Drinking Water Supplies
http://repository.cmu.edu/statistics/208
http://repository.cmu.edu/statistics/208Mon, 08 Apr 2013 12:23:55 PDT
A Bayesian hierarchical model is built to describe arsenic concentrations in treated water from sources of public drinking water systems. The model allows us to decompose the total variability in arsenic concentration into three components: between-system, between-source (within system) and within-source variabilities. Predictions about what percentage of a state's systems and sources affected by the various proposed maximum contaminant level (MCL) regulations are simulated based on the posterior predictive distribution. We investigate the potential impact of the between-source variability on this percentage by comparing predictions based on the full model and based on the reduced model which eliminates the between-source variability. Other issues addressed in the modeling are the possibility that the arsenic concentration in source water is changing over time and the possibility that changes in measurement methods and their detection limits cause a change in the precision and accuracy of the measurement methods. The analysis is conducted based on data from four states: California, Illinois, New Mexico and Utah.
]]>
Yangang ZhangPrediction of Freshmen Academic Performance
http://repository.cmu.edu/statistics/207
http://repository.cmu.edu/statistics/207Mon, 08 Apr 2013 12:23:53 PDT
The goal of this study is to improve prediction of freshman GPA based on college admission data to better inform the decision as to who to admit to Carnegie Mellon. This analysis assessed the utility of the non-academic data to find a better algorithm for making this prediction. Data for two consecutive entering classes at CMU were used. Both classical and Bayesian approaches were performed here. The classical methods allowed us to better understand the previous criterion of acceptance and to investigate the significance of a difference between students who were admitted and enrolled and the students who were admitted and did not come to CMU. A Bayesian predictive approach was used to identify the cutoff based on admission data for the predictive probability that a students' first semester GPA is greater than 2.0.
]]>
Iuliana IanusOperating Characteristics and Extensions of the FDR Procedure
http://repository.cmu.edu/statistics/206
http://repository.cmu.edu/statistics/206Mon, 08 Apr 2013 12:23:51 PDT
We investigate the operating characteristics of the Benjamini-Hochberg false discovery rate (FDR) procedure for multiple testing. This is a distribution free method that controls the expected fraction of falsely rejected null hypotheses among those rejected. This paper provides a framework for understanding how and why this procedure works. We start by studying the special case where the p-values under the alternative have a common distribution, where we are able to obtain many insights into this new procedure. We first obtain bounds on the ``deciding point'' D that determines the critical p-value. From this, we obtain explicit asymptotic expressions for a particular risk funciton. We introduce the dual notion of false non-rejections (FNR) and we consider a risk function that combines FDR and FNR. We also consider the optimal procedure with respect to a measure of conditional risk.
]]>
Christopher Genovese et al.Meta-analysis: In Practice
http://repository.cmu.edu/statistics/205
http://repository.cmu.edu/statistics/205Mon, 08 Apr 2013 12:23:49 PDT
The practice of meta-analysis is concerned with the details of implementation of a research synthesis that ensure the validity and robustness of the results from that synthesis. In this article, selected topics are discussed that represent current intellectual themes in the practice of meta-analysis, such as, (i) the role of decisions and judgments, particularly judgments about similarity of studies; (ii) the importance of sensitivity analysis to investigate the robustness of those decisions; and (iii) the role research synthesis plays in the process of scientific discovery. Brief illustrations of the role meta-analysis plays in explanation, program evaluation, and in informing policy decisions are presented.
]]>
Joel B. GreenhouseThresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate
http://repository.cmu.edu/statistics/204
http://repository.cmu.edu/statistics/204Mon, 08 Apr 2013 12:23:47 PDT
Finding objective and effective thresholds for voxelwise statistics derived from neuroimaging data has been a long-standing problem. With at least one test performed for every voxel in an image, some correction of the thresholds is needed to control the error rates, but standard procedures for multiple hypothesis testing (e.g., Bonferroni) tend to not be sensitive enough to be useful in this context. This paper introduces to the neuroscience literature statistical procedures for controlling the False Discovery Rate (FDR). Recent theoretical work in statistics suggests that FDR-controlling procedures will be effective for the analysis of neuroimaging data. These procedures operate simultaneously on all voxelwise test statistics to determine which tests should be considered statistically significant. The innovation of the procedures is that they control the expected proportion of the rejected hypotheses that are falsely rejected. We demonstrate this approach using both simulations and functional Magnetic Resonance Imaging data from two simple experiments.
]]>
Christopher Genovese et al.Bounds for Cell Entries in Contingency Tables Induced by Fixed Marginal Totals
http://repository.cmu.edu/statistics/203
http://repository.cmu.edu/statistics/203Mon, 08 Apr 2013 12:23:45 PDT
We describe new results for upper and lower bounds on the entries in multi-way tables of counts based on a set of released and possibly overlapping marginal tables which have practical import for assessing disclosure risk. In particular, we present a generalized version of the shuttle algorithm proposed by Buzzigoli and Giusti that is proven to compute sharp integer bounds for an arbitrary set of fixed marginals.
]]>
Adrian Dobra et al.Methods and Criteria for Model Selection
http://repository.cmu.edu/statistics/202
http://repository.cmu.edu/statistics/202Fri, 05 Apr 2013 09:27:25 PDT
Model selection is an important part of any statistical analysis, and indeed is central to the pursuit of science in general. Many authors have examined this question, from both frequentist and Bayesian perspectives, and many tools for selecting the ``best model'' have been suggested in the literature. This paper considers the various proposals from a Bayesian decision-theoretic perspective.
]]>
Joseph B. Kadane et al.Algorithms for maximum-likelihood logistic regression
http://repository.cmu.edu/statistics/201
http://repository.cmu.edu/statistics/201Fri, 05 Apr 2013 09:27:24 PDT
Logistic regression is a workhorse of statistics and is closely related to methods used in Machine Learning, including the Perceptron and the Support Vector Machine. This note reviews seven different algorithms for finding the maximum-likelihood estimate. Iterative Scaling is shown to apply under weaker conditions than usually assumed. A modified iterative scaling algorithm is also derived, which is equivalent to the algorithm of Collins et al (2000). The best performers in terms of running time are the line search algorithms and Newton-type algorithms, which far outstrip Iterative Scaling.
]]>
Thomas P. MinkaPractical Regeneration for Markov Chain Monte Carlo Simulation
http://repository.cmu.edu/statistics/200
http://repository.cmu.edu/statistics/200Fri, 05 Apr 2013 09:27:22 PDT
Regeneration is a useful tool in Markov chain Monte Carlo simulation, since it can be used to side-step the burn-in problem and to construct estimates of the variance of parameter estimates themselves. Unfortunately, it is often difficult to take advantage of, since for most chains, no recurrent atom exists, and it is not always easy to use Nummelin's splitting method to identify regeneration points. This paper describes a simple and practical method of obtaining regeneration in a Markov chain. The application of this method in simulation is discussed, and examples are given.
]]>
Anthony Brockwell et al.A Gridding Method for Sequential Analysis Problems
http://repository.cmu.edu/statistics/199
http://repository.cmu.edu/statistics/199Fri, 05 Apr 2013 09:27:21 PDT
This paper introduces a numerical method for finding optimal or approximately optimal decision rules and corresponding expected losses in Bayesian sequential decision problems. The method, based on the classical backward induction method, constructs a grid approximation to the expected loss at each decision time, viewed as a function of certain statistics of the posterior distribution of the parameter of interest. In contrast with existing techniques, this method has a computation time which is linear in the number of stages in the sequential problem. It can also be applied to problems with insufficient statistics for the parameters of interest. Furthermore, it is well-suited to be implemented using parallel processors.
]]>
Anthony Brockwell et al.A Non-parametric Analysis of the CMB Power Spectrum
http://repository.cmu.edu/statistics/198
http://repository.cmu.edu/statistics/198Fri, 05 Apr 2013 09:27:20 PDT
We examine Cosmic Microwave Background (CMB) temperature power spectra from the BOOMERANG, MAXIMA, and DASI experiments. We non-parametrically estimate the true power spectrum with no model assumptions. This is a significant departure from previous research which used either cosmological models or some other parameterized form (e.g. parabolic fits). Our non-parametric estimate is practically indistinguishable from the best fit cosmological model, thus lending independent support to the underlying physics that governs these models. We also generate a confidence set for the non-parametric fit and extract confidence intervals for the numbers, locations, and heights of peaks and the successive peak-to-peak height ratios.
]]>
Christopher J. Miller et al.A new source detection algorithm using FDR
http://repository.cmu.edu/statistics/197
http://repository.cmu.edu/statistics/197Fri, 05 Apr 2013 09:27:19 PDT
The False Discovery Rate (FDR) method has recently been described by Miller et al. (2001), along with several examples of astrophysical applications. FDR is a new statistical procedure due to Benjamini & Hochberg (1995) for controlling the fraction of false positives when performing multiple hypothesis testing. The importance of this method to source detection algorithms is immediately clear. To explore the possibilities offered we have developed a new task for performing source detection in radio-telescope images, Sfind 2.0, which implements FDR. We compare Sfind 2.0 with two other source detection and measurement tasks, Imsad and SExtractor, and comment on several issues arising from the nature of the correlation between nearby pixels and the necessary assumption of the null hypothesis. The strong suggestion is made that implementing FDR, as a threshold defining method in other existing source-detection tasks is easy and worhtwhile. We show that the constraint on the fraction of false detections as specified by FDR holds true even for highly correlated and realistic images. For the detection of true sources, which are complex combinations of source-pixels, this constraint appears to be somewhat less strict. It is still reliable enough, however, for a priori estimates of the fraction of false source detections to be robust and realistic. Further investigation of the relationship between `source-pixels' and `sources' is nevertheless important to more strictly constrain the fraction of falsely detected sources.
]]>
A. M. Hopkins et al.Computing Consecutive-Type Reliabilities Non-Recursively
http://repository.cmu.edu/statistics/196
http://repository.cmu.edu/statistics/196Fri, 05 Apr 2013 09:27:17 PDT
The reliability of consecutive-type systems has been approached from different angles. We present a new method for deriving the generating functions and reliabilities of various consecutive-type systems. Our method, which is based on Feller's run theory, is easy to implement, and leads to both recursive and non-recursive formulas for the reliability. The non-recursive expression is especially advantageous for systems with numerous components. We show how the method can be extended for computing generating functions and reliabilities of systems with multi-state components as well as systems with statistically dependent components. To make our theoretical derivations practical to practitioners, we include short computer programs that do the non-recursive computations yielding the reliabilities of such systems.
]]>
Galit ShmeuliNonparametric Inference in Astrophysics
http://repository.cmu.edu/statistics/195
http://repository.cmu.edu/statistics/195Fri, 05 Apr 2013 09:27:16 PDT
We discuss nonparametric density estimation and regression for astrophysics problems. In particular, we show how to compute nonparametric confidence intervals for the location and size of peaks of a function. We illustrate these ideas with recent data on the Cosmic Microwave Background. We also briefly discuss nonparametric Bayesian inference.
]]>
Woncheol Jang et al.Association Studies for Quantitative Traits in Structured Populations
http://repository.cmu.edu/statistics/194
http://repository.cmu.edu/statistics/194Fri, 05 Apr 2013 09:27:14 PDT
Association between disease and genetic polymorphisms often contributes critical information in our search for the genetic components of common diseases. Devlin and Roeder (1999) introduced genomic control, a statistical method that overcomes a drawback to the use of population-based samples for tests of association, namely spurious associations induced by population structure. In essence, genomic control (GC) uses markers throughout the genome to adjust for any inflation in test statistics due to substructure. To date genomic control (GC) has been developed for binary traits and bi- or multiallelic markers. Tests of association using GC have been limited to single genes. In this report, we generalize GC to quantitative traits (QT) and multilocus models. Using statistical analysis and simulations, we show that GC controls spurious associations in reasonable settings of population substructure for QT models, including gene-gene interaction. Through simulations we explore GC power for both random and selected samples, assuming the QT locus tested is causal and its specific heritability is 2.5 - 5%. We find that GC, combined with either random or selected samples, has good power in this setting, and that more complex models induce smaller GC corrections. The latter suggests greater power can be achieved by specifying more complex genetic models, but this observation only follows when such models are largely correct and specified a priori.
]]>
Silviu-Alin Bacanu et al.