2022 Theses Doctoral
Efficient Machine Teaching Frameworks for Natural Language Processing
The past decade has seen tremendous growth in potential applications of language technologies in our daily lives due to increasing data, computational resources, and user interfaces. An important step to support emerging applications is the development of algorithms for processing the rich variety of human-generated text and extracting relevant information. Machine learning, especially deep learning, has seen increasing success on various text benchmarks. However, while standard benchmarks have static tasks with expensive human-labeled data, real-world applications are characterized by dynamic task specifications and limited resources for data labeling, thus making it challenging to transfer the success of supervised machine learning to the real world. To deploy language technologies at scale, it is crucial to develop alternative techniques for teaching machines beyond data labeling.
In this dissertation, we address this data labeling bottleneck by studying and presenting resource-efficient frameworks for teaching machine learning models to solve language tasks across diverse domains and languages. Our goal is to (i) support emerging real-world problems without the expensive requirement of large-scale manual data labeling; and (ii) assist humans in teaching machines via more flexible types of interaction. Towards this goal, we describe our collaborations with experts across domains (including public health, earth sciences, news, and e-commerce) to integrate weakly-supervised neural networks into operational systems, and we present efficient machine teaching frameworks that leverage flexible forms of declarative knowledge as supervision: coarse labels, large hierarchical taxonomies, seed words, bilingual word translations, and general labeling rules.
First, we present two neural network architectures that we designed to leverage weak supervision in the form of coarse labels and hierarchical taxonomies, respectively, and highlight their successful integration into operational systems. Our Hierarchical Sigmoid Attention Network (HSAN) learns to highlight important sentences of potentially long documents without sentence-level supervision by, instead, using coarse-grained supervision at the document level. HSAN improves over previous weakly supervised learning approaches across sentiment classification benchmarks and has been deployed to help inspections in health departments for the discovery of foodborne illness outbreaks. We also present TXtract, a neural network that extracts attributes for e-commerce products from thousands of diverse categories without using manually labeled data for each category, by instead considering category relationships in a hierarchical taxonomy. TXtract is a core component of Amazon’s AutoKnow, a system that collects knowledge facts for over 10K product categories, and serves such information to Amazon search and product detail pages.
Second, we present architecture-agnostic machine teaching frameworks that we applied across domains, languages, and tasks. Our weakly-supervised co-training framework can train any type of text classifier using just a small number of class-indicative seed words and unlabeled data. In contrast to previous work that use seed words to initialize embedding layers, our iterative seed word distillation (ISWD) method leverages the predictive power of seed words as supervision signals and shows strong performance improvements for aspect detection in reviews across domains and languages. We further demonstrate the cross-lingual transfer abilities of our co-training approach via cross-lingual teacher-student (CLTS), a method for training document classifiers across diverse languages using labeled documents only in English and a limited budget for bilingual translations. Not all classification tasks, however, can be effectively addressed using human supervision in the form of seed words. To capture a broader variety of tasks, we present weakly-supervised self-training (ASTRA), a weakly-supervised learning framework for training a classifier using more general labeling rules in addition to labeled and unlabeled data. As a complete set of accurate rules may be hard to obtain all in one shot, we further present an interactive framework that assists human annotators by automatically suggesting candidate labeling rules.
In conclusion, this thesis demonstrates the benefits of teaching machines with different types of interaction than the standard data labeling paradigm and shows promising results for new applications across domains and languages. To facilitate future research, we publish our code implementations and design new challenging benchmarks with various types of supervision. We believe that our proposed frameworks and experimental findings will influence research and will enable new applications of language technologies without the costly requirement of large manually labeled datasets.
- Karamanolakis_columbia_0054D_17460.pdf application/pdf 3.52 MB Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Gravano, Luis
- Hsu, Daniel
- Ph.D., Columbia University
- Published Here
- September 7, 2022