We applied the dBug tool to two distributed systems – the Parallel Virtual File System (PVFS) implemented in C and the FAWN-based key-value storage (FAWN-KV) implemented in C++. In particular, we integrated both systems with dBug to expose the non-determinism due to concurrency. This mechanism was used to verify that the result of concurrent execution of a number of basic operations from a fixed initial state meets the high-level specification of PVFS and FAWN-KV. The experimental evidence shows that the dBug tool is capable of systematically exploring behaviors of a distributed system in a modular, practical, and effective manner.
]]>We consider the following greedy algorithm: Given terminal pairs in a metric space, call a terminal "active" if its distance to its partner is non-zero. Pick the two closest active terminals (say si,tj), set the distance between them to zero, and buy a path connecting them. Recompute the metric, and repeat. Our main result is that this algorithm is a constant-factor approximation.
We also use this algorithm to give new, simpler constructions of cost-sharing schemes for Steiner forest. In particular, the first "group-strict" cost-shares for this problem implies a very simple combinatorial sampling-based algorithm for stochastic Steiner forest.
]]>In this paper we show how to achieve this bound for all packing LPs, and also for a wide class of mixed packing/covering LPs. Our algorithms construct dual solutions using a regret-minimizing online learning algorithm in a black-box fashion, and use them to construct primal solutions. The adversarial guarantee that holds for the constructed duals help us to take care of most of the correlations that arise in the algorithm; the remaining correlations are handled via martingale concentration and maximal inequalities. These ideas lead to conceptually simple and modular algorithms, which we hope will be useful in other contexts.
]]>We first study the multistage matroid maintenance problem, where we need to maintain a base of a matroid in each time step under changing cost functions and acquisition costs for adding new elements. The online version generalizes online paging. E.g., given a graph, we need to maintain a spanning tree T t at each step: we pay c t (T t ) for the cost of the tree at time t, and also | T t ∖ T t − 1 | for the number of edges changed at this step. Our main result is a polynomial time O(logm logr)-approximation to the online problem, where m is the number of elements/edges and r is the rank of the matroid. This improves on results of Buchbinder et al. [7] who addressed the fractional version of this problem under uniform acquisition costs, and Buchbinder, Chen and Naor [8] who studied the fractional version of a more general problem. We also give an O(logm) approximation for the offline version of the problem. These bounds hold when the acquisition costs are non-uniform, in which case both these results are the best possible unless P=NP.
We also study the perfect matching version of the problem, where we maintain a perfect matching at each step under changing cost functions and costs for adding new elements. Surprisingly, the hardness drastically increases: for any constant ε > 0, there is no O(n 1 − ε )-approximation to the multistage matching maintenance problem, even in the offline case.
]]>Motivated by these applications, we consider a generalization of d-dimAP, where the positions of some k of the vertices (pins) is fixed and specified as part of the input. We are asked to extend this partial map to a map of all the vertices, again minimizing the weighted stretch of edges. This generalization, which we refer to as d-dimAP+, arises naturally in these application domains (since it can capture blocked-off parts of the board, or the requirement of power-carrying pins to be in certain locations, etc.). Perhaps surprisingly, very little is known about this problem from an approximation viewpoint.
For dimension d = 2, we obtain an O (k^{1/2} · logn)-approximation algorithm, based on a strengthening of the spreading-metric LP for 2-dimAP. The integrality gap for this LP is shown to be Ω(k^{1/4}). We also show that it is NP-hard to approximate 2-DIMAP+ within a factor better than Ω(k^{1/4–∊}). We also consider a (conceptually harder, but practically even more interesting) variant of 2-dimAP+, where the target space is the grid , instead of the entire integer lattice ℤ^{2}. For this problem, we obtain a O(klogklogn)-approximation using the same LP relaxation. We complement this upper bound by showing an integrality gap of Ω(k^{1/2}), and an Ω(k^{1/2–∊})-inapproximability result.
Our results naturally extend to the case of arbitrary fixed target dimension d ≥ 1.
]]>But what if the set of vertices sees both additions and deletions? Again, we would like to obtain a low-cost Steiner tree with as few edge changes as possible. The original paper of Imase and Waxman (SIAM J. Disc. Math, 4(3):369–384, 1991) had also considered this model, and it gave an algorithm that made at most O(n^{3/2}) edge changes for the first n requests, and maintained a constant-competitive tree online. In this paper we improve on these results:
]]>These questions have received a lot of attention in recent years, leading to some known tradeoffs between the sparsifier's quality q and its size |V(H)|. Nevertheless, it remains an outstanding question whether every G admits a flow-sparsifier H with quality q = 1 + ∊, or even q = O(1), and size |V(H)| ≤ f(k, ∊) (in particular, independent of |V(G)| and the edge capacities).
Making a first step in this direction, we present new constructions for several scenarios:
Our main result is that for quasi-bipartite networks G, one can construct a (1 + ∊)-flow-sparsifier of size poly(k/∊). In contrast, exact (q = 1) sparsifiers for this family of networks are known to require size 2^{Ω(k)}.
For networks G of bounded treewidth w, we construct a flow-sparsifier with quality q = O(logw/loglogw) and size O(w·poly(k)).
For general networks G, we construct a sketch sk(G), that stores all the feasible multicommodity flows up to factor q = 1 + ∊, and its size (storage requirement) is f(k, ∊).
We answer this question in the affirmative. We give a primal-dual algorithm that makes only a single swap per step (in addition to adding the edge connecting the new point to the previous ones), and such that the tree's cost is only a constant times the optimal cost. Our dual-based analysis is quite different from previous primal-only analyses. In particular, we give a correspondence between radii of dual balls and lengths of tree edges; since dual balls are associated with points and hence do not move around (in contrast to edges), we can closely monitor the edge lengths based on the dual radii. Showing that these dual radii cannot change too rapidly is the technical heart of the paper, and allows us to give a hard bound on the number of swaps per arrival, while maintaining a constant-competitive tree at all times. Previous results for this problem gave an algorithm that performed an amortized constant number of swaps: for each n, the number of swaps in the first $n$ steps was O(n). We also give a simpler tight analysis for this amortized case.
]]>To complement this algorithm, we show the following hardness results: If the non-uniform Sparsest Cut has a ρ-approximation for series-parallel graphs (where ρ ≥ 1), then the MaxCut problem has an algorithm with approximation factor arbitrarily close to 1/ρ. Hence, even for such restricted graphs (which have treewidth 2), the Sparsest Cut problem is NP-hard to approximate better than 17/16 - ε for ε > 0; assuming the Unique Games Conjecture the hardness becomes 1/α_{GW} - ε. For graphs with large (but constant) treewidth, we show a hardness result of 2 - ε assuming the Unique Games Conjecture.
Our algorithm rounds a linear program based on (a subset of) the Sherali-Adams lift of the standard Sparsest Cut LP. We show that even for treewidth-2 graphs, the LP has an integrality gap close to 2 even after polynomially many rounds of Sherali-Adams. Hence our approach cannot be improved even on such restricted graphs without using a stronger relaxation.
]]>the set Q of probed elements satisfy an “outer” packing constraint,
the set S of chosen elements satisfy an “inner” packing constraint.
The kinds of packing constraints we consider are intersections of matroids and knapsacks. Our results provide a simple and unified view of results in stochastic matching [1, 2] and Bayesian mechanism design [3], and can also handle more general constraints. As an application, we obtain the first polynomial-time Ω(1/k)-approximate “Sequential Posted Price Mechanism” under k-matroid intersection feasibility constraints, improving on prior work [3-5].
]]>This problem is closely related to the Asymmetric TSP (ATSP) problem, which seeks to find a tour (instead of an s-t path) visiting all the nodes: for ATSP, a ρ-approximation guarantee implies an O(ρ)-approximation for ATSPP. However, no such connection is known for the integrality gaps of the linear programming relxations for these problems: the current-best approximation algorithm for ATSPP is O(logn/loglogn), whereas the best bound on the integrality gap of the natural LP relaxation (the subtour elmination LP) for ATSPP is O(logn).
In this paper, we close this gap, and improve the current best bound on the integrality gap from O(logn) to O(logn/loglogn). The resulting algorithm uses the structure of narrow s-t cuts in the LP solution to construct a (random) tree witnessing this integrality gap. We also give a simpler family of instances showing the integrality gap of this LP is at least 2.
]]>For the multistage k-robust set cover problem, we give an O(logm + logn)-approximation algorithm, nearly matching the Ω(logn+logmloglogm) hardness of approximation [4] even for T = 2 stages. Moreover, our algorithm has a useful “thrifty” property: it takes actions on just two stages. We show similar thrifty algorithms for multi-stage k-robust Steiner tree, Steiner forest, and minimum-cut. For these problems our approximation guarantees are O( min { T, logn, logλmax }), where λ max is the maximum inflation over all the stages. We conjecture that these problems also admit O(1)-approximate thrifty algorithms.
]]>where A∈Rm×n≥0,c,u∈\ensuremathRn≥0. In the online setting, the constraints (i.e., the rows of the constraint matrix A) arrive over time, and the algorithm can only increase the coordinates of x to maintain feasibility. As an intermediate step, we consider solving thecovering linear program (CLP) online, where the requirement x ∈ ℤ n is replaced by x ∈ ℝ n .
Our main results are (a) an O(logk)-competitive online algorithm for solving the CLP, and (b) an O(logk ·logℓ)-competitive randomized online algorithm for solving the CIP. Here k ≤ n and ℓ ≤ m respectively denote the maximum number of non-zero entries in any row and column of the constraint matrix A. By a result of Feige and Korman, this is the best possible for polynomial-time online algorithms, even in the special case of set cover (where A ∈ {0,1} m ×nand c, u ∈ {0,1} n ).
The novel ingredient of our approach is to allow the dual variables to increase and decrease throughout the course of the algorithm. We show that the previous approaches, which either only raise dual variables, or lower duals only within a guess-and-double framework, cannot give a performance better than O(logn), even when each constraint only has a single variable (i.e., k = 1).
]]>Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs).
We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error O(|V|1.5). Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem.
]]>1. We show that the number of statistical queries necessary and sufficient for this task is—up to polynomial factors—equal to the agnostic learning complexity of C in Kearns’ statistical query (SQ) model. This gives a complete answer to the question when running time is not a concern.
2. We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions.
While interesting from a learning theoretic point of view, our main applications are in privacypreserving data analysis: Here, our second result leads to an algorithm that efficiently releases differentially private answers to all Boolean conjunctions with 1% average error. This presents significant progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially nonprivacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms.
]]>Many practically relevant instances of network design problems are NP-hard, and thus likely intractable. This survey focuses on approximation algorithms as one possible way of circumventing this impasse. Approximation algorithms are efficient (i.e., they run in polynomial-time), and they compute solutions to a given instance of an optimization problem whose objective values are close to those of the respective optimum solutions. More concretely, most of the problems discussed in this survey are minimization problems. We then say that an algorithm is an α-approximation for a given problem if the ratio of the cost of an approximate solution computed by the algorithm to that of an optimum solution is at most α over all instances. In the following we will also sometimes refer to α as the performance guarantee of the respective approximation algorithm.
The last 30 years have seen a tremendous amount of research on approximation algorithms for network design problems. And over this period, several technical themes have emerged, and have been explored and exploited to give algorithms and analyze their performance. Our aim in this survey is to provide an overview over these techniques. Each of the following sections focuses on one technique and has two main parts: first, we present an introductory application to the well-known classical minimum-spanning tree problem. The second part of each section demonstrates more sophisticated recent example applications of the respective technique. Throughout we assume that the reader is familiar with fundamental concepts of graph theory, combinatorial optimization, and approximation algorithms. While we may recap certain key definitions, we rely on the reader to be familiar with others. We refer to the excellent text books [45, 160, 163] for background reading.
The minimum spanning tree problem has been studied for at least a century, and it is clearly one of the most prominent network design problems. The input to an instance of this problem consists of an undirected graph G = (V,E) each of whose edges e ∈ E is endowed by an arbitrary cost ce, and the goal is to compute a spanning tree of smallest cost. The earliest known algorithm for this problem was developed by Boruvka [ ˙ 21], and since then a vast number of techniques have been developed and subsequently used in order to devise increasingly sophisticated algorithms.
]]>We apply our method on real datasets, including a phone-call network and a computer-traffic network. The phone call network consists of 4 million mobile users, with 51 million edges (phonecalls), over 14 days. Com2 spots intuitive patterns, that is, temporal communities (comet communities).
We report our findings, which include large ‘star’-like patterns, nearbipartite- cores, as well as tiny groups (5 users), calling each other hundreds of times within a few days.
]]>In this paper we present a stacked file system, TABLEFS, which uses another local file system as an object store. TABLEFS organizes all metadata into a single sparse table backed on disk using a Log-Structured Merge (LSM) tree, LevelDB in our experiments. By stacking, TABLEFS asks only for efficient large file allocation and access from the underlying local file system. By using an LSM tree, TABLEFS ensures metadata is written to disk in large, non-overwrite, sorted and indexed logs. Even an inefficient FUSE based user level implementation of TABLEFS can perform comparably to Ext4, XFS and Btrfs on data-intensive benchmarks, and can outperform them by 50% to as much as 1000% for metadata-intensive workloads. Such promising performance results from TABLEFS suggest that local disk file systems can be significantly improved by more aggressive aggregation and batching of metadata updates.
]]>We apply the Johnson-Lindenstrauss transform to the task of approximating cut-queries: the number of edges crossing a (S, S)-cut in a graph. We show that the JL transform allows us to publish a sanitized graph that preserves edge differential privacy (where two graphs are neighbors if they differ on a single edge) while adding only O(|S|ϵ) random noise to any given query (w.h.p). Comparing the additive noise of our algorithm to existing algorithms for answering cut-queries in a differentially private manner, we outperform all others on small cuts (|S| = o(n)).
We also apply our technique to the task of estimating the variance of a given matrix in any given direction. The JL transform allows us to publish a sanitized covariance matrix that preserves differential privacy w.r.t bounded changes (each row in the matrix can change by at most a norm-1 vector) while adding random noise of magnitude independent of the size of the matrix (w.h.p). In contrast, existing algorithms introduce an error which depends on the matrix dimensions.
]]>In this article, we describe an augmented version of the PAC model designed for semi-supervised learning, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community. This model provides a unified framework for analyzing when and why unlabeled data can help, in which one can analyze both sample-complexity and algorithmic issues. The model can be viewed as an extension of the standard PAC model where, in addition to a concept class C, one also proposes a compatibility notion: a type of compatibility that one believes the target concept should have with the underlying distribution of data. Unlabeled data is then potentially helpful in this setting because it allows one to estimate compatibility over the space of hypotheses, and to reduce the size of the search space from the whole set of hypotheses C down to those that, according to one's assumptions, are a-priori reasonable with respect to the distribution. As we show, many of the assumptions underlying existing semi-supervised learning algorithms can be formulated in this framework.
After proposing the model, we then analyze sample-complexity issues in this setting: that is, how much of each type of data one should expect to need in order to learn well, and what the key quantities are that these numbers depend on. We also consider the algorithmic question of how to efficiently optimize for natural classes and compatibility notions, and provide several algorithmic results including an improved bound for Co-Training with linear separators when the distribution satisfies independence given the label.
]]>