Research Showcase @ CMUCopyright (c) 2015 Carnegie Mellon University All rights reserved.
http://repository.cmu.edu
Recent documents in Research Showcase @ CMUen-usSat, 28 Mar 2015 01:31:07 PDT3600Consistent Bounded-Asynchronous Parameter Servers for Distributed ML
http://repository.cmu.edu/machine_learning/140
http://repository.cmu.edu/machine_learning/140Fri, 27 Mar 2015 14:01:08 PDT
In distributed ML applications, shared parameters are usually replicated among computing nodes to minimize network overhead. Therefore, proper consistency model must be carefully chosen to ensure algorithm's correctness and provide high throughput. Existing consistency models used in general-purpose databases and modern distributed ML systems are either too loose to guarantee correctness of the ML algorithms or too strict and thus fail to fully exploit the computing power of the underlying distributed system. Many ML algorithms fall into the category of \emph{iterative convergent algorithms} which start from a randomly chosen initial point and converge to optima by repeating iteratively a set of procedures. We've found that many such algorithms are to a bounded amount of inconsistency and still converge correctly. This property allows distributed ML to relax strict consistency models to improve system performance while theoretically guarantees algorithmic correctness. In this paper, we present several relaxed consistency models for asynchronous parallel computation and theoretically prove their algorithmic correctness. The proposed consistency models are implemented in a distributed parameter server and evaluated in the context of a popular ML application: topic modeling.
]]>
Jinliang Wei et al.Identifying graph-structured activation patterns in networks
http://repository.cmu.edu/machine_learning/139
http://repository.cmu.edu/machine_learning/139Fri, 27 Mar 2015 12:25:40 PDT
We consider the problem of identifying an activation pattern in a complex, large-scale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of high-dimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size.
]]>
James Sharpnack et al.Detecting Weak but Hierarchically-Structured Patterns in Networks
http://repository.cmu.edu/machine_learning/138
http://repository.cmu.edu/machine_learning/138Fri, 27 Mar 2015 12:25:38 PDT
Copyright 2010 by the authors
]]>
Aarti Singh et al.Multi-Manifold Semi-Supervised Learning
http://repository.cmu.edu/machine_learning/137
http://repository.cmu.edu/machine_learning/137Fri, 27 Mar 2015 12:25:36 PDT
We study semi-supervised learning when the data consists of multiple intersecting manifolds. We give a finite sample analysis to quantify the potential gain of using unlabeled data in this multi-manifold setting. We then propose a semi-supervised learning algorithm that separates different manifolds into decision sets, and performs supervised learning within each set. Our algorithm involves a novel application of Hellinger distance and size-constrained spectral clustering. Experiments demonstrate the benefit of our multi-manifold semi-supervised learning approach
]]>
Andrew B. Goldberg et al.LightLDA: Big Topic Models on Modest Compute Clusters
http://repository.cmu.edu/machine_learning/136
http://repository.cmu.edu/machine_learning/136Fri, 27 Mar 2015 12:25:33 PDT
When building large-scale machine learning (ML) programs, such as big topic models or deep neural nets, one usually assumes such tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude faster than current state-of-the-art Gibbs samplers; 2) a structure-aware model-parallel scheme, which leverages dependencies within the topic model, yielding a sampling strategy that is frugal on machine memory and network communication; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed; and 4) a bounded asynchronous data-parallel scheme, which allows efficient distributed processing of massive data via a parameter server. Our distribution strategy is an instance of the model-and-data-parallel programming model underlying the Petuum framework for general distributed ML, and was implemented on top of the Petuum open-source system. We provide experimental evidence showing how this development puts massive models within reach on a small cluster while still enjoying proportional time cost reductions with increasing cluster size, in comparison with alternative options.
]]>
Jinhui Yuan et al.Model-Parallel Inference for Big Topic Models
http://repository.cmu.edu/machine_learning/135
http://repository.cmu.edu/machine_learning/135Fri, 27 Mar 2015 12:25:31 PDT
In real world industrial applications of topic modeling, the ability to capture gigantic conceptual space by learning an ultra-high dimensional topical representation, i.e., the so-called "big model", is becoming the next desideratum after enthusiasms on "big data", especially for fine-grained downstream tasks such as online advertising, where good performances are usually achieved by regression-based predictors built on millions if not billions of input features. The conventional data-parallel approach for training gigantic topic models turns out to be rather inefficient in utilizing the power of parallelism, due to the heavy dependency on a centralized image of "model". Big model size also poses another challenge on the storage, where available model size is bounded by the smallest RAM of nodes. To address these issues, we explore another type of parallelism, namely model-parallelism, which enables training of disjoint blocks of a big topic model in parallel. By integrating data-parallelism with model-parallelism, we show that dependencies between distributed elements can be handled seamlessly, achieving not only faster convergence but also an ability to tackle significantly bigger model size. We describe an architecture for model-parallel inference of LDA, and present a variant of collapsed Gibbs sampling algorithm tailored for it. Experimental results demonstrate the ability of this system to handle topic modeling with unprecedented amount of 200 billion model variables only on a low-end cluster with very limited computational resources and bandwidth.
]]>
Xun Zheng et al.Screening Rules for Overlapping Group Lasso
http://repository.cmu.edu/machine_learning/134
http://repository.cmu.edu/machine_learning/134Fri, 27 Mar 2015 12:25:29 PDT
Recently, to solve large-scale lasso and group lasso problems, screening rules have been developed, the goal of which is to reduce the problem size by efficiently discarding zero coefficients using simple rules independently of the others. However, screening for overlapping group lasso remains an open challenge because the overlaps between groups make it infeasible to test each group independently. In this paper, we develop screening rules for overlapping group lasso. To address the challenge arising from groups with overlaps, we take into account overlapping groups only if they are inclusive of the group being tested, and then we derive screening rules, adopting the dual polytope projection approach. This strategy allows us to screen each group independently of each other. In our experiments, we demonstrate the efficiency of our screening rules on various datasets.
]]>
Seunghak Lee et al.Large Scale Distributed Multiclass Logistic Regression
http://repository.cmu.edu/machine_learning/133
http://repository.cmu.edu/machine_learning/133Fri, 27 Mar 2015 12:25:26 PDT
Multiclass logistic regression (MLR) is a fundamental machine learning model to do multiclass classification. However, it is very challenging to perform MLR on large scale data where the feature dimension is high, the number of classes is large and the number of data samples is numerous. In this paper, we build a distributed framework to support large scale multiclass logistic regression. Using stochastic gradient descent to optimize MLR, we find that the gradient matrix is computed as the outer product of two vectors. This grants us an opportunity to greatly reduce communication cost: instead of communicating the gradient matrix among machines, we can only communicate the two vectors and use them to reconstruct the gradient matrix after communication. We design a Sufficient Vector Broadcaster (SVB) to support this communication pattern. SVB synchronizes the parameter matrix of MLR by broadcasting the sufficient vectors among machines and migrates gradient matrix computation on the receiver side.SVB can reduce the communication cost from quadratic to linear without incurring any loss of correctness. We evaluate the system on the ImageNet dataset and demonstrate the efficiency and effectiveness of our distributed framework.
]]>
Pengtao Xie et al.Petuum: A Framework for Iterative-Convergent Distributed ML
http://repository.cmu.edu/machine_learning/132
http://repository.cmu.edu/machine_learning/132Fri, 27 Mar 2015 12:25:23 PDT
A major bottleneck to applying advanced ML programs at industrial scales is the migration of an academic implementation, often specialized for a small, well-controlled computer platform such as desktop PCs and small lab-clusters, to a big, less predicable platform such as a corporate cluster or the cloud. This poses enormous challenges: how does one train huge models with billions of parameters on massive data, especially when substantial expertise is required to handle many low-level systems issues? We propose a new architecture of systems components that systematically addresses these challenges, thus providing a general-purpose distributed platform for Big Machine Learning. Our architecture specifically exploits the fact that many ML programs are fundamentally loss function minimization problems, and that their iterative-convergent nature presents many unique opportunities to minimize loss, such as via dynamic variable scheduling and error-bounded consistency models for synchronization. Thus, we treat data, parameter and variable blocks as computing units to be dynamically scheduled and updated in an error-bounded manner, with the goal of minimizing the loss function as quickly as possible.
]]>
Wei Dai et al.Understanding the Interaction between Interests, Conversations and Friendships in Facebook
http://repository.cmu.edu/machine_learning/131
http://repository.cmu.edu/machine_learning/131Fri, 27 Mar 2015 12:25:20 PDT
In this paper, we explore salient questions about user interests, conversations and friendships in the Facebook social network, using a novel latent space model that integrates several data types. A key challenge of studying Facebook’s data is the wide range of data modalities such as text, network links, and categorical labels. Our latent space model seamlessly combines all three data modalities over millions of users, allowing us to study the interplay between user friendships, interests, and higher-order network-wide social trends on Facebook. The recovered insights not only answer our initial questions, but also reveal surprising facts about user interests in the context of Facebook’s ecosystem. We also confirm that our results are significant with respect to evidential information from the study subjects.
]]>
Qirong Ho et al.Efficient Algorithm for Extremely Large Multi-task Regression with Massive Structured Sparsity
http://repository.cmu.edu/machine_learning/130
http://repository.cmu.edu/machine_learning/130Fri, 27 Mar 2015 12:25:17 PDT
We develop a highly scalable optimization method called "hierarchical group-thresholding" for solving a multi-task regression model with complex structured sparsity constraints on both input and output spaces. Despite the recent emergence of several efficient optimization algorithms for tackling complex sparsity-inducing regularizers, true scalability in practical high-dimensional problems where a huge amount (e.g., millions) of sparsity patterns need to be enforced remains an open challenge, because all existing algorithms must deal with ALL such patterns exhaustively in every iteration, which is computationally prohibitive. Our proposed algorithm addresses the scalability problem by screening out multiple groups of coefficients simultaneously and systematically. We employ a hierarchical tree representation of group constraints to accelerate the process of removing irrelevant constraints by taking advantage of the inclusion relationships between group sparsities, thereby avoiding dealing with all constraints in every optimization step, and necessitating optimization operation only on a small number of outstanding coefficients. In our experiments, we demonstrate the efficiency of our method on simulation datasets, and in an application of detecting genetic variants associated with gene expression traits.
]]>
Seunghak Lee et al.Mayor's Institute on City Design Midwest Session Meeting Summary
http://repository.cmu.edu/architecture/85
http://repository.cmu.edu/architecture/85Thu, 26 Mar 2015 09:11:25 PDT
The Mayors’ Institute on City Design is a program that conducts a series of intimate, closed-door two-day symposia intended to offer a small group of invited mayors a better understanding of the design of American cities. Participation is limited to eighteen to twenty people: half are mayors and half are urban design experts and other resource people. The Midwest Session took place on February 10-12, 2010 in Pittsburgh. The mayors came from Charleston, WV, Racine, WI, Huntingdon, WV, Springfield, IL, Kenosha, WI, Canton, OH, and Elkhart, IN.
]]>
Donald K. Carter et al.Optimal rates for stochastic convex optimization under Tsybakov noise condition
http://repository.cmu.edu/machine_learning/129
http://repository.cmu.edu/machine_learning/129Tue, 24 Mar 2015 15:08:23 PDT
We focus on the problem of minimizing a convex function f over a convex set S given T queries to a stochastic first order oracle. We argue that the complexity of convex minimization is only determined by the rate of growth of the function around its minimum x∗_{f,S}, as quantified by a Tsybakov-like noise condition. Specifically, we prove that if fgrows at least as fast as ∥x−x^{∗}_{f,S}∥κ around its minimum, for some κ>1, then the optimal rate of learning f(x∗_{f,S}) isΘ(T−κ/2κ−2). The classic rate Θ(1/T√) for convex functions and Θ(1/T) for strongly convex functions are special cases of our result for κ→∞ and κ=2, and even faster rates are attained for 1<κ<2. We also derive tight bounds for the complexity of learning x∗f,S, where the optimal rate is Θ(T−1/2κ−2). Interestingly, these precise rates also characterize the complexity of active learning and our results further strengthen the connections between the fields of active learning and convex optimization, both of which rely on feedback-driven queries.
]]>
Aaditya Ramdas et al.Detecting Activations over Graphs using Spanning Tree Wavelet Bases
http://repository.cmu.edu/machine_learning/128
http://repository.cmu.edu/machine_learning/128Tue, 24 Mar 2015 15:08:20 PDT
We consider the detection of clusters of activation over graphs under Gaussian noise. This problem appears in many real world scenarios, such as the detecting contamination or seismic activity by sensor networks, viruses in human and computer networks, and groups with anomalous behavior in social and biological networks. Despite the wide applicability of such a detection algorithm, there has been little success in the development of computationally feasible methods with provable theoretical guarantees. To this end, we introduce the spanning tree wavelet basis over a graph, a localized basis that reflects the topology of the graph. We first provide a necessary condition for asymptotic distinguishability of the null and alternative hypotheses. Then we prove that for any spanning tree, we can hope to correctly detect signals in a low signal-to-noise regime using spanning tree wavelets. We propose a randomized test, in which we use a uniform spanning tree in the basis construction. Using electrical network theory, we show that the uniform spanning tree provides strong guarantees that in many cases match our necessary condition. We prove that for edge transitive graphs, k-nearest neighbor graphs, and ϵ-graphs we obtain nearly optimal performance with the uniform spanning tree wavelet detector.
]]>
James Sharpnack et al.Changepoint Detection over Graphs with the Spectral Scan Statistic
http://repository.cmu.edu/machine_learning/127
http://repository.cmu.edu/machine_learning/127Tue, 24 Mar 2015 15:08:18 PDT
We consider the change-point detection problem of deciding, based on noisy measurements, whether an unknown signal over a given graph is constant or is instead piecewise constant over two induced subgraphs of relatively low cut size. We analyze the corresponding generalized likelihood ratio (GLR) statistic and relate it to the problem of finding a sparsest cut in a graph. We develop a tractable relaxation of the GLR statistic based on the combinatorial Laplacian of the graph, which we call the spectral scan statistic, and analyze its properties. We show how its performance as a testing procedure depends directly on the spectrum of the graph, and use this result to explicitly derive its asymptotic properties on few graph topologies. Finally, we demonstrate both theoretically and by simulations that the spectral scan statistic can outperform naive testing procedures based on edge thresholding and χ^{2} testing.
]]>
James Sharpnack et al.Subspace Detection of High-Dimensional Vectors using Compressing Sampling
http://repository.cmu.edu/machine_learning/126
http://repository.cmu.edu/machine_learning/126Tue, 24 Mar 2015 15:08:16 PDT
We consider the problem of detecting whether a high dimensional vector ∈ ℝ^{n} lies in a r-dimensional subspace S, where r ≪ n, given few compressive measurements of the vector. This problem arises in several applications such as detecting anomalies, targets, interference and brain activations. In these applications, the object of interest is described by a large number of features and the ability to detect them using only linear combination of the features (without the need to measure, store or compute the entire feature vector) is desirable. We present a test statistic for subspace detection using compressive samples and demonstrate that the probability of error of the proposed detector decreases exponentially in the number of compressive samples, provided that the energy off the subspace scales as n. Using information-theoretic lower bounds, we demonstrate that no other detector can achieve the same probability of error for weaker signals. Simulation results also indicate that this scaling is near-optimal.
]]>
Martin Azizyan et al.Efficient Active Algorithms for Hierarchical Clustering
http://repository.cmu.edu/machine_learning/125
http://repository.cmu.edu/machine_learning/125Tue, 24 Mar 2015 15:08:14 PDT
Advances in sensing technologies and the growth of the internet have resulted in an explosion in the size of modern datasets, while storage and processing power continue to lag behind. This motivates the need for algorithms that are efficient, both in terms of the number of measurements needed and running time. To combat the challenges associated with large datasets, we propose a general framework for active hierarchical clustering that repeatedly runs an off-the-shelf clustering algorithm on small subsets of the data and comes with guarantees on performance, measurement complexity and runtime complexity. We instantiate this framework with a simple spectral clustering algorithm and provide concrete results on its performance, showing that, under some assumptions, this algorithm recovers all clusters of size Ω(log n) using O(n log^{2} n) similarities and runs in O(n log^{3} n) time for a dataset of n objects. Through extensive experimentation we also demonstrate that this framework is practically alluring.
]]>
Akshay Krishnamurthy et al.Sparsistency of the Edge Lasso over Graphs
http://repository.cmu.edu/machine_learning/124
http://repository.cmu.edu/machine_learning/124Tue, 24 Mar 2015 15:08:11 PDT
The fused lasso was proposed recently to enable recovery of high-dimensional patterns which are piece-wise constant on a graph, by penalizing the ℓ_{1}-norm of differences of measurements at vertices that share an edge. While there have been some attempts at coming up with efficient algorithms for solving the fused lasso optimization, a theoretical analysis of its performance is mostly lacking except for the simple linear graph topology. In this paper, we investigate sparsistency of fused lasso for general graph structures, i.e. its ability to correctly recover the exact support of piece-wise constant graphstructured patterns asymptotically (for largescale graphs). To emphasize this distinction over previous work, we will refer to it as Edge Lasso. We focus on the (structured) normal means setting, and our results provide necessary and sufficient conditions on the graph properties as well as the signal-to-noise ratio needed to ensure sparsistency. We examplify our results using simple graph-structured patterns, and demonstrate that in some cases fused lasso is sparsistent at very weak signal-to-noise ratios (scaling as p (log n)/|A|, where n is the number of vertices in the graph and A is the smallest set of vertices with constant activation). In other cases, it performs no better than thresholding the difference of measurements at vertices which share an edge (which requires signal-to-noise ratio that scales as √ log n).
]]>
James Sharpnack et al.Stability of Density-Based Clustering
http://repository.cmu.edu/machine_learning/123
http://repository.cmu.edu/machine_learning/123Tue, 24 Mar 2015 15:08:09 PDT
High density clusters can be characterized by the connected components of a level set L(λ) = {x: p(x)>λ} of the underlying probability density function p generating the data, at some appropriate level λ ≥ 0. The complete hierarchical clustering can be characterized by a cluster tree T= ∪_{λ}L(λ). In this paper, we study the behavior of a density level set estimate L̂(λ) and cluster tree estimate T̂ based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L̂(λ) and T̂ as a function of h, and investigate the theoretical properties of these instability measures.
]]>
Alessandro Rinaldo et al.Robust Multi-Source Network Tomography using Selective Probes
http://repository.cmu.edu/machine_learning/122
http://repository.cmu.edu/machine_learning/122Tue, 24 Mar 2015 15:08:06 PDT
Knowledge of a network's topology and internal characteristics such as delay times or losses is crucial to maintain seamless operation of network services. Network tomography is a useful approach to infer such knowledge from end-to-end measurements between nodes at the periphery of the network, as it does not require cooperation of routers and other internal nodes. Most current tomography algorithms are single-source methods, which use multicast probes or synchronized unicast packet trains to measure covariances between destinations from a single vantage point and recover a tree topology from these measurements. Multi-source tomography, on the other hand, uses pairwise hop counts or latencies and consequently overcomes the difficulties associated with obtaining measurements for single-source methods. However, topology recovery is complicated by the fact that the paths along which measurements are taken do not form a tree in the network. Motivated by recent work suggesting that these measurements can be well-approximated by tree metrics, we present two algorithms that use selective pairwise distance measurements between peripheral nodes to construct a tree whose end-to-end distances approximate those in the network. Our first algorithm accommodates measurements perturbed by additive noise, while our second considers a novel noise model that captures missing measurements and the network's deviations from a tree topology. Both algorithms provably use O (p polylog p) pairwise measurements to construct a tree approximation on p end hosts. We present extensive simulated and real-world experiments to evaluate both of our algorithms.
]]>
Akshay Krishnamurthy et al.Noise Thresholds for Spectral Clustering
http://repository.cmu.edu/machine_learning/121
http://repository.cmu.edu/machine_learning/121Tue, 24 Mar 2015 15:08:04 PDT
Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results.
]]>
Sivaraman Balakrishnan et al.Minimax Localization of Structural Information in Large Noisy Matrices
http://repository.cmu.edu/machine_learning/120
http://repository.cmu.edu/machine_learning/120Tue, 24 Mar 2015 15:08:02 PDT
We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions: i) We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. ii) We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. iii) We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition.
]]>
Mladen Kolar et al.Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities
http://repository.cmu.edu/machine_learning/119
http://repository.cmu.edu/machine_learning/119Tue, 24 Mar 2015 15:08:00 PDT
Hierarchical clustering based on pairwise similarities is a common tool used in a broad range of scientific applications. However, in many problems it may be expensive to obtain or compute similarities between the items to be clustered. This paper investigates the possibility of hierarchical clustering of N items based on a small subset of pairwise similarities, significantly less than the complete set of N(N-1)/2 similarities. First, we show that, if the intracluster similarities exceed intercluster similarities, then it is possible to correctly determine the hierarchical clustering from as few as 3N log N similarities. We demonstrate this order of magnitude saving in the number of pairwise similarities necessitates sequentially selecting which similarities to obtain in an adaptive fashion, rather than picking them at random. Finally, we propose an active clustering method that is robust to a limited fraction of anomalous similarities, and show how even in the presence of these noisy similarity values we can resolve the hierarchical clustering using only O(N log^{2} N) pairwise similarities.
]]>
Brian Eriksson et al.Column Subset Selection with Missing Data via Active Sampling
http://repository.cmu.edu/machine_learning/118
http://repository.cmu.edu/machine_learning/118Mon, 23 Mar 2015 13:23:51 PDT
Column subset selection of massive data matrices has found numerous applications in real-world data systems. In this paper, we propose and analyze two sampling based algorithms for column subset selection without access to the complete input matrix. To our knowledge, these are the first algorithms for column subset selection with missing data that are provably correct. The proposed methods work for row/column coherent matrices by employing the idea of adaptive sampling. Furthermore, when the input matrix has a noisy low-rank structure, one algorithm enjoys a relative error bound.
]]>
Yining Wang et al.On the Power of Adaptivity in Matrix Completion and Approximation
http://repository.cmu.edu/machine_learning/117
http://repository.cmu.edu/machine_learning/117Mon, 23 Mar 2015 13:23:50 PDT
We consider the related tasks of matrix completion and matrix approximation from missing data and propose adaptive sampling procedures for both problems. We show that adaptive sampling allows one to eliminate standard incoherence assumptions on the matrix row space that are necessary for passive sampling procedures. For exact recovery of a low-rank matrix, our algorithm judiciously selects a few columns to observe in full and, with few additional measurements, projects the remaining columns onto their span. This algorithm exactly recovers an n × n rank r matrix using O(nrµ_{0} log^{2} (r)) observations, where µ_{0} is a coherence parameter on the column space of the matrix. In addition to completely eliminating any row space assumptions that have pervaded the literature, this algorithm enjoys a better sample complexity than any existing matrix completion algorithm. To certify that this improvement is due to adaptive sampling, we establish that row space coherence is necessary for passive sampling algorithms to achieve non-trivial sample complexity bounds. For constructing a low-rank approximation to a high-rank input matrix, we propose a simple algorithm that thresholds the singular values of a zero-filled version of the input matrix. The algorithm computes an approximation that is nearly as good as the best rank-r approximation using O(nrµ log^{2} (n)) samples, where µ is a slightly different coherence parameter on the matrix columns. Again we eliminate assumptions on the row space.
]]>
Akshay Krishnamurthy et al.Noise-adaptive Margin-based Active Learning and Lower Bounds under Tsybakov Noise Condition
http://repository.cmu.edu/machine_learning/116
http://repository.cmu.edu/machine_learning/116Mon, 23 Mar 2015 13:23:48 PDT
We present a polynomial-time noise-robust margin-based active learning algorithm to find homogeneous (passing the origin) linear separators and analyze its statistical rate of error convergence when labels are corrupted by noise. We show that when the imposed noise satisfies the Tsybakov low noise condition [MT^{+}99, Tsy04] the algorithm is able to adapt to unknown level of noise and achieves optimal statistical rate up to polylogarithmic factors. In addition, the presented algorithm is simple and does not require prior knowledge of the amount of noise in the label distribution. We also derive lower bounds for margin based active learning algorithms under Tsybakov noise conditions (TNC) for the membership query synthesis scenario [Ang88]. Our result implies lower bounds for the stream based selective sampling scenario [Coh90] under TNC for some fairly simple data distributions. Quite surprisingly, we show that the sample complexity cannot be improved even if the underlying data distribution is as simple as the uniform distribution on the unit ball. Our proof involves the construction of a well-separated hypothesis set on the d-dimensional unit ball along with carefully designed label distributions for the Tsybakov noise condition. Our analysis might provide insights for other forms of lower bounds as well.
]]>
Yining Wang et al.Feature Selection For High-Dimensional Clustering
http://repository.cmu.edu/machine_learning/115
http://repository.cmu.edu/machine_learning/115Mon, 23 Mar 2015 13:23:46 PDT
We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate of the resulting clustering. In addition, we provide the first error bounds on mode based clustering.
]]>
Martin Azizyan et al.Recovering Block-structured Activations Using Compressive Measurements
http://repository.cmu.edu/machine_learning/114
http://repository.cmu.edu/machine_learning/114Mon, 23 Mar 2015 13:23:45 PDT
We consider the problems of detection and localization of a contiguous block of weak activation in a large matrix, from a small number of noisy, possibly adaptive, compressive (linear) measurements. This is closely related to the problem of compressed sensing, where the task is to estimate a sparse vector using a small number of linear measurements. Contrary to results in compressed sensing, where it has been shown that neither adaptivity nor contiguous structure help much, we show that for reliable localization the magnitude of the weakest signals is strongly influenced by both structure and the ability to choose measurements adaptively while for detection neither adaptivity nor structure reduce the requirement on the magnitude of the signal. We characterize the precise tradeoffs between the various problem parameters, the signal strength and the number of measurements required to reliably detect and localize the block of activation. The sufficient conditions are complemented with information theoretic lower bounds.
]]>
Sivaraman Balakrishnan et al.Subspace Learning from Extremely Compressed Measurements
http://repository.cmu.edu/machine_learning/113
http://repository.cmu.edu/machine_learning/113Mon, 23 Mar 2015 13:23:43 PDT
We consider learning the principal subspace of a large set of vectors from an extremely small number of compressive measurements of each vector. Our theoretical results show that even a constant number of measurements per column suffices to approximate the principal subspace to arbitrary precision, provided that the number of vectors is large. This result is achieved by a simple algorithm that computes the eigenvectors of an estimate of the covariance matrix. The main insight is to exploit an averaging effect that arises from applying a different random projection to each vector. We provide a number of simulations confirming our theoretical results
]]>
Akshay Krishnamurthy et al.Confidence Sets for Persistence Diagrams
http://repository.cmu.edu/machine_learning/112
http://repository.cmu.edu/machine_learning/112Mon, 23 Mar 2015 13:23:41 PDT
Persistent homology is a method for probing topological properties of point clouds and functions. The method involves tracking the birth and death of topological features (2000) as one varies a tuning parameter. Features with short lifetimes are informally considered to be “topological noise,” and those with a long lifetime are considered to be “topological signal.” In this paper, we bring some statistical ideas to persistent homology. In particular, we derive confidence sets that allow us to separate topological signal from topological noise.
]]>
Brittany Therese Fasy et al.Low-Rank Matrix and Tensor Completion via Adaptive Sampling
http://repository.cmu.edu/machine_learning/111
http://repository.cmu.edu/machine_learning/111Mon, 23 Mar 2015 13:23:39 PDT
We study low rank matrix and tensor completion and propose novel algorithms that employ adaptive sampling schemes to obtain strong performance guarantees. Our algorithms exploit adaptivity to identify entries that are highly informative for learning the column space of the matrix (tensor) and consequently, our results hold even when the row space is highly coherent, in contrast with previous analyses. In the absence of noise, we show that one can exactly recover a n × n matrix of rank r from merely Ω(nr^{3/2} log(r)) matrix entries. We also show that one can recover an order T tensor using Ω(nr^{T −1/2}T^{2} log(r)) entries. For noisy recovery, our algorithm consistently estimates a low rank matrix corrupted with noise using Ω(nr^{3/2}polylog(n)) entries. We complement our study with simulations that verify our theory and demonstrate the scalability of our algorithms.
]]>
Akshay Krishnamurthy et al.Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
http://repository.cmu.edu/machine_learning/110
http://repository.cmu.edu/machine_learning/110Mon, 23 Mar 2015 13:23:38 PDT
While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering.
]]>
Martin Azizyan et al.Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
http://repository.cmu.edu/machine_learning/109
http://repository.cmu.edu/machine_learning/109Mon, 23 Mar 2015 13:23:36 PDT
The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lovasz extended scan statistic (LESS) that uses submodularity ´ to approximate the intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, knearest neighbor graphs, and ǫ-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds.
]]>
James Sharpnack et al.Cluster Trees on Manifolds
http://repository.cmu.edu/machine_learning/108
http://repository.cmu.edu/machine_learning/108Mon, 23 Mar 2015 13:23:34 PDT
In this paper we investigate the problem of estimating the cluster tree for a density f supported on or near a smooth d-dimensional manifold M isometrically embedded in R D. We analyze a modified version of a k-nearest neighbor based algorithm recently proposed by Chaudhuri and Dasgupta (2010). The main results of this paper show that under mild assumptions on f and M, we obtain rates of convergence that depend on d only but not on the ambient dimension D. Finally, we sketch a construction of a sample complexity lower bound instance for a natural class of manifold oblivious clustering algorithms.
]]>
Sivaraman Balakrishnan et al.Recovering Graph-Structured Activations using Adaptive Compressive Measurements
http://repository.cmu.edu/machine_learning/107
http://repository.cmu.edu/machine_learning/107Mon, 23 Mar 2015 13:23:32 PDT
We study the localization of a cluster of activated vertices in a graph, from adaptively designed compressive measurements. We propose a hierarchical partitioning of the graph that groups the activated vertices into few partitions, so that a top-down sensing procedure can identify these partitions, and hence the activations, using few measurements. By exploiting the cluster structure, we are able to provide localization guarantees at weaker signal to noise ratios than in the unstructured setting. We complement this performance guarantee with an information theoretic lower bound, providing a necessary signal-to-noise ratio for any algorithm to successfully localize the cluster. We verify our analysis with some simulations, demonstrating the practicality of our algorithm.
]]>
Akshay Krishnamurthy et al.On the Bootstrap for Persistence Diagrams and Landscapes
http://repository.cmu.edu/machine_learning/106
http://repository.cmu.edu/machine_learning/106Mon, 23 Mar 2015 13:23:30 PDT
Persistent homology probes topological properties from point clouds and functions. By looking at multiple scales simultaneously, one can record the births and deaths of topological features as the scale varies. In this paper we use a statistical technique, the empirical bootstrap, to separate topological signal from topological noise. In particular, we derive confidence sets for persistence diagrams and confidence bands for persistence landscapes.
]]>
Frederic Chazal et al.