October 08, 2018

ApproxJoin: Approximate Distributed Multi-way Joins

  • Akkus I.
  • Bhatotia P.
  • Blanas S.
  • Chen R.
  • Fetzer C.
  • Le Quoc D.
  • Strufe T.

The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy and latency for expensive data processing operations. The equi-join operator is thus a natural candidate for optimization using approximation techniques. Although sampling-based approaches are widely used for approximation, sampling over joins is a compelling but challenging task regarding the output quality. Naive approaches, which perform joins over dataset samples, would not preserve statistical properties of the join output. To realize this potential, we combine Bloom filter sketching and stratified sampling into a novel approximate join operator. This join operator leverages the bloom filter to avoid shuffling non-joinable tuples around the network and then applies the stratified sampling to obtain an unbiased representative sample of the join output. Instead of decomposing multi-way joins as a sequence of binary joins, a unique property of our design is that it seamlessly joins multiple datasets in a single processing step. Our analysis shows that our technique scales well and significantly reduces data movement, without sacrificing tight error bounds on the accuracy of final outputs. We implemented ApproxJoin in Apache Spark and evaluated it using micro-benchmarks and real-world case studies. The evaluation shows that ApproxJoin achieves a speedup of 6-9 times over unmodified Spark-based joins with sampling fraction of 10%. Furthermore, the speedup is companied with a significant reduction in the shuffled data with 5-82 times less than unmodified Spark-based joins.

Recent Publications

January 01, 2019

Friendly, appealing or both? Characterising user experience in sponsored search landing pages

  • Bron M.
  • Chute M.
  • Evans H.
  • Lalmas M.
  • Redi M.
  • Silvestri F.

© 2017 International World Wide Web Conference Committee (IW3C2), published under Creative Commons CC BY 4.0 License. Many of today's websites have recognised the importance of mobile friendly pages to keep users engaged and to provide a satisfying user experience. However, next to the experience provided by the sites themselves, ...

January 01, 2019

Analyzing uber's ride-sharing economy

  • Aiello L.
  • Djuric N.
  • Grbovic M.
  • Kooti F.
  • Lerman K.
  • Radosavljevic V.

© 2017 International World Wide Web Conference Committee (IW3C2), published under Creative Commons CC BY 4.0 License. Uber is a popular ride-sharing application that matches people who need a ride (or riders) with drivers who are willing to provide it using their personal vehicles. Despite its growing popularity, there exist ...

January 01, 2019

The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race

  • Cresci S.
  • Petrocchi M.
  • Pietro R.
  • Spognardi A.
  • Tesconi M.

© 2017 International World Wide Web Conference Committee (IW3C2), published under Creative Commons CC BY 4.0 License. Recent studies in social media spam and automation provide anecdotal argumentation of the rise of a new generation of spambots, so-called social spambots. Here, for the first time, we extensively study this novel ...