October 08, 2018

ApproxJoin: Approximate Distributed Multi-way Joins

  • Akkus I.
  • Bhatotia P.
  • Blanas S.
  • Chen R.
  • Fetzer C.
  • Le Quoc D.
  • Strufe T.

The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy and latency for expensive data processing operations. The equi-join operator is thus a natural candidate for optimization using approximation techniques. Although sampling-based approaches are widely used for approximation, sampling over joins is a compelling but challenging task regarding the output quality. Naive approaches, which perform joins over dataset samples, would not preserve statistical properties of the join output. To realize this potential, we combine Bloom filter sketching and stratified sampling into a novel approximate join operator. This join operator leverages the bloom filter to avoid shuffling non-joinable tuples around the network and then applies the stratified sampling to obtain an unbiased representative sample of the join output. Instead of decomposing multi-way joins as a sequence of binary joins, a unique property of our design is that it seamlessly joins multiple datasets in a single processing step. Our analysis shows that our technique scales well and significantly reduces data movement, without sacrificing tight error bounds on the accuracy of final outputs. We implemented ApproxJoin in Apache Spark and evaluated it using micro-benchmarks and real-world case studies. The evaluation shows that ApproxJoin achieves a speedup of 6-9 times over unmodified Spark-based joins with sampling fraction of 10%. Furthermore, the speedup is companied with a significant reduction in the shuffled data with 5-82 times less than unmodified Spark-based joins.

Recent Publications

August 09, 2017

A Cloud Native Approach to 5G Network Slicing

  • Francini A.
  • Miller R.
  • Sharma S.

5G networks will have to support a set of very diverse and often extreme requirements. Network slicing offers an effective way to unlock the full potential of 5G networks and meet those requirements on a shared network infrastructure. This paper presents a cloud native approach to network slicing. The cloud ...

August 01, 2017

Modeling and simulation of RSOA with a dual-electrode configuration

  • De Valicourt G.
  • Liu Z.
  • Violas M.
  • Wang H.
  • Wu Q.

Based on the physical model of a bulk reflective semiconductor optical amplifier (RSOA) used as a modulator in radio over fiber (RoF) links, the distributions of carrier density, signal photon density, and amplified spontaneous emission photon density are demonstrated. One of limits in the use of RSOA is the lower ...

July 12, 2017

PrivApprox: Privacy-Preserving Stream Analytics

  • Chen R.
  • Christof Fetzer
  • Le D.
  • Martin Beck
  • Pramod Bhatotia
  • Thorsten Strufe

How to preserve users' privacy while supporting high-utility analytics for low-latency stream processing? To answer this question: we describe the design, implementation and evaluation of PRIVAPPROX, a data analytics system for privacy-preserving stream processing. PRIVAPPROX provides three properties: (i) Privacy: zero-knowledge privacy (ezk) guarantees for users, a privacy bound tighter ...