2012 Reports
High Throughput Heavy Hitter Aggregation
Heavy hitters are data items that occur at high frequency in a data set. Heavy hitters are among the most important items for an organization to summarize and understand during analytical processing. In data sets with sufficient skew, the number of heavy hitters can be relatively small. We take advantage of this small footprint to compute aggregate functions for the heavy hitters in fast cache memory. We design cache-resident, shared-nothing structures that hold only the most frequent elements from the table. Our approach works in three phases. It first samples and picks heavy hitter candidates. It then builds a hash table and computes the exact aggregates of these candidates. Finally, if necessary, a validation step identifies the true heavy hitters from among the candidates based on the query specification. We identify trade-offs between the hash table capacity and performance. Capacity determines how many candidates can be aggregated. We optimize performance by the use of perfect hashing and SIMD instructions. SIMD instructions are utilized in novel ways to minimize cache accesses, be- yond simple vectorized operations. We use bucketized and cuckoo hash tables to increase capacity, to adapt to different datasets and query constraints. The performance of our method is an order of magnitude faster than in-memory aggregation over a complete set of items if those items cannot be cache resident. Even for item sets that are cache resident, our SIMD techniques enable significant performance improvements over previous work.
Subjects
Files
- cucs-009-12-1.pdf application/pdf 620 KB Download File
More About This Work
- Academic Units
- Computer Science
- Publisher
- Department of Computer Science, Columbia University
- Series
- Columbia University Computer Science Technical Reports, CUCS-009-12
- Published Here
- August 1, 2012