1998 Reports

# Optimizing Selections over Data Cubes

Datacube queries compute aggregates over data base relations at a variety of granularities, and they constitute an important class of decision support queries. Often one wants only datacube output tuples whose aggregate value satisfies a certain condition, such as exceeding a given threshold. For example, one might ask for all combinations of model, color, and year of cars(including the special value "ALL" for each of the dimensions) for which the total sales exceeded a given amount of money. Computing a selection over a datacube can naively be done by computing the entire datacube and checking if the selection condition holds for each tuple in the result. However, it is often the case that selections are relatively restrictive, meaning that a lot of work computing datacube tuples is "wasted" since those tuples don't satisfy the selection condition. Our approach is to develop algorithms for processing a datacube query using the selection condition internally during the computation. By making use of the selection condition within the datacube computation,we can safely prune parts of the computation and end up with a more efficient computation of the answer. Our first technique, called"specialization", uses the fact that a tuple in the datacube does not meet the given threshold to infer that all finer level aggregates cannot meet the threshold. We propose a scheme of specialization transformations on the underlying data sets, using properties of the aggregates and threshold functions. Our second technique is called "generalization", and applies in the case where the actual value of the aggregate is not needed in the output,but used just to compare with the threshold. Generalization uses the fact that a tuple meets the given threshold to infer that all coarser level aggregates also meet the threshold. We also propose a scheme of generalization transformations. We demonstrate the efficiency of these techniques by implementing them within the sparse datacube algorithm of Ross and Srivastava. We present a performance study using synthetic and real-world data sets.Our results indicate substantial performance improvements for queries with selective conditions.

## Subjects

## Files

- cucs-018-98.pdf application/pdf 248 KB Download File

## More About This Work

- Academic Units
- Computer Science
- Publisher
- Department of Computer Science, Columbia University
- Series
- Columbia University Computer Science Technical Reports, CUCS-018-98
- Published Here
- April 25, 2011