Theses Doctoral

Modern Analytics over Wide-Tables

Huang, Zezhou

Enterprises are eager to leverage tables from diverse sources for decision-making. Recent advancements in cloud data warehouses, offering the abstraction of shared storage with unlimited on-demand computing, provide the ideal architecture for these needs. However, individual analysts often find the sheer number of tables overwhelming. What they prefer instead is a simple Wide-Table, where all relevant tables have been comprehensively integrated into a single table. Such a Wide-Table abstraction, originating from the 1982 concept of the universal relation, simplifies data analysis by allowing analysts to focus on key business metrics and dimensions without the hassle of navigating the tables. To support this abstraction, modern business intelligence tools like PowerBI and Tableau implement "semantic layers" that allow business users to declaratively build Wide-Tables by specifying relationships between tables, defining metrics to compute, and selecting dimensions to group by.

Despite their convenience, Wide-Tables are defined as views over joins across tens or even hundreds of underlying tables. Current analytics systems execute these join queries naively on-the-fly, which is notoriously complex and costly in both optimization and execution. Modern analytics requires both interactive query performance for dashboard interactions and the ability to process large-scale data for business intelligence and machine learning applications. The growing demand to incorporate additional data sources, including external and streaming data, further compounds these challenges. This widening gap between current methods and modern requirements necessitates a fundamental redesign of systems to support Wide-Table abstraction effectively.

This thesis introduces the Calibrated Junction Hypertree (CJT), a novel data structure that enhances analytics over Wide-Tables. While CJT originated from probabilistic graphical models for efficient inference over joint probabilities (similar to aggregations over joins), it was previously limited to probability summation operations. We present the extension of CJT to support general Selection-Projection-Join-Aggregation (SPJA) queries with semi-ring aggregations, enabling common operations like SUM, AVG, MIN, and MAX in DBMSes. We implement CJT as a lightweight query rewriting layer that operates on top of existing DBMSes and data warehouses without requiring internal modifications. Building upon CJT's foundation, we develop a spectrum of Wide-Table analytics applications.

First, we present Treant, an interactive dashboard accelerator for Wide-Tables. Leveraging CJT's capabilities, Treant optimizes dashboard performance by exploiting incremental user interactions and utilizing natural user think-times for efficient materialization. The system operates through a two-stage process: offline preprocessing of dashboard queries to build initial CJTs, and online execution that shares computation between queries via message passing. Our evaluation across both single-node and cloud data warehouse environments demonstrates that Treant achieves >100x improvement in dashboard interaction speed compared to traditional query execution methods.

Then, we present JoinBoost, an in-DB ML system for tree-based models (Gradient Boosting and Random Forests) over Wide-Tables. Unlike previous in-DB ML systems that either extract data from DBMSes or modify their internals, JoinBoost leverages CJT to accelerate training through pure query rewriting, enabling large-scale model training while maintaining data security in cloud warehouses. Our multi-node evaluation shows that JoinBoost achieves >9x speedup compared to LightGBM and XGBoost. Building upon JoinBoost's foundation, we develop Reptile, a hierarchical data explanation system for Wide-Tables. Reptile employs models trained on Wide-Table features to repair aggregates and identify potential errors. When evaluated on real-world COVID-19 data, Reptile successfully identifies 21 out of 30 errors, significantly outperforming existing approaches that only detect 2 errors.

Next, we present Kitana, a data-centric AutoML system that addresses the limitations of traditional model-centric approaches. While conventional AutoML systems focus exclusively on model search and are constrained by training data quality, Kitana adopts a balanced strategy that allocates time between data augmentation (discovering new features and examples) and model search. Leveraging CJT and tailored materialization, Kitana achieves >1000x acceleration in the data search process compared to previous data augmentation techniques. Building upon Kitana's foundation, we develop Saibot, which extends the search capabilities with differential privacy guarantees. Saibot introduces a novel DP mechanism that efficiently computes semi-ring statistics across diverse data augmentations for ML tasks. Our evaluation demonstrates that Saibot significantly outperforms state-of-the-art DP mechanisms by multiple orders of magnitude.

We conclude by examining the future directions for Wide-Table analytics. While CJT effectively addresses online query processing performance, transforming raw data from diverse sources into tables usable for Wide-Table analytics remains predominantly manual and time-intensive, requiring semantic understanding of both the data and business concepts. Large Language Models (LLMs) offer a promising solution, as our preliminary research demonstrates significant improvements in relationalization, data cleaning, Text-to-SQL, and data transformation tasks. These developments foreshadow a more automated form of business intelligence where analysts can incorporate new data sources without extensive manual intervention. Our future work aims to explore extensions to CJT with more advanced semi-ring aggregates for a broader range of applications, and to develop an end-to-end automated Wide-Table construction pipeline that addresses both the online analytics capabilities enabled by CJT and the offline construction challenges that currently constrain Wide-Table analytics.

Files

  • thumbnail for Huang_columbia_0054D_19287.pdf Huang_columbia_0054D_19287.pdf application/pdf 3.84 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Wu, Eugene
Degree
Ph.D., Columbia University
Published Here
July 30, 2025