High Throughput Heavy Hitter Aggregation

Orestis Polychroniou; Kenneth A. Ross

High Throughput Heavy Hitter Aggregation
Polychroniou, Orestis
Ross, Kenneth A.
Technical reports
Computer Science
Permanent URL:
Columbia University Computer Science Technical Reports
Part Number:
Department of Computer Science, Columbia University
Publisher Location:
New York
Heavy hitters are data items that occur at high frequency in a data set. Heavy hitters are among the most important items for an organization to summarize and understand during analytical processing. In data sets with sufficient skew, the number of heavy hitters can be relatively small. We take advantage of this small footprint to compute aggregate functions for the heavy hitters in fast cache memory. We design cache-resident, shared-nothing structures that hold only the most frequent elements from the table. Our approach works in three phases. It first samples and picks heavy hitter candidates. It then builds a hash table and computes the exact aggregates of these candidates. Finally, if necessary, a validation step identifies the true heavy hitters from among the candidates based on the query specification. We identify trade-offs between the hash table capacity and performance. Capacity determines how many candidates can be aggregated. We optimize performance by the use of perfect hashing and SIMD instructions. SIMD instructions are utilized in novel ways to minimize cache accesses, be- yond simple vectorized operations. We use bucketized and cuckoo hash tables to increase capacity, to adapt to different datasets and query constraints. The performance of our method is an order of magnitude faster than in-memory aggregation over a complete set of items if those items cannot be cache resident. Even for item sets that are cache resident, our SIMD techniques enable significant performance improvements over previous work.
Computer science
Item views:
text | xml
Suggested Citation:
Orestis Polychroniou, Kenneth A. Ross, 2012, High Throughput Heavy Hitter Aggregation, Columbia University Academic Commons, http://hdl.handle.net/10022/AC:P:14295.

In Partnership with the Center for Digital Research and Scholarship at Columbia University Libraries/Information Services | Terms of Use | Copyright