Theses Doctoral

Toward Annotation Efficiency in Biased Learning Settings for Natural Language Processing

Effland, Thomas

The goal of this thesis is to improve the feasibility of building applied NLP systems for more diverse and niche real-world use-cases of extracting structured information from text. A core factor in determining this feasibility is the cost of manually annotating enough unbiased labeled data to achieve a desired level of system accuracy, and our goal is to reduce this cost. We focus on reducing this cost by making contributions in two directions: (1) easing the annotation burden by leveraging high-level expert knowledge in addition to labeled examples, thus making approaches more annotation-efficient; and (2) mitigating known biases in cheaper, imperfectly labeled real-world datasets so that we may use them to our advantage. A central theme of this thesis is that high-level expert knowledge about the data and task can allow for biased labeling processes that focus experts on only manually labeling aspects of the data that cannot be easily labeled through cheaper means. This combination allows for more accurate models with less human effort. We conduct our research on this general topic through three diverse problems with immediate applications to real-world settings.

First, we study an applied problem in biased text classification. We encounter a rare-event text classification system that has been deployed for several years. We are tasked with improving this system's performance using only the severely biased incidental feedback provided by the experts over years of system use. We develop a method that combines importance weighting and an unlabeled data imputation scheme that exploits the selection-bias of the feedback to train an unbiased classifier without requiring additional labeled data. We experimentally demonstrate that this method considerably improves the system performance.

Second, we tackle an applied problem in named entity recognition (NER) concerning learning tagging models from data that have very low recall for annotated entities. To solve this issue we propose a novel loss, the Expected Entity Ratio (EER), that uses an uncertain estimate of the proportion of entities in the data to counteract the false-negative bias in the data, encouraging the model to have the correct ratio of entities in expectation. We justify the principles of our approach by providing theory that shows it recovers the true tagging distribution under mild conditions. Additionally we provide extensive empirical results that show it to be practically useful. Empirically, we find that it meets or exceeds performance of state-of-the-art baselines across a variety of languages, annotation scenarios, and amounts of labeled data. We also show that, when combined with our approach, a novel sparse annotation scheme can outperform exhaustive annotation for modest annotation budgets.

Third, we study the challenging problem of syntactic parsing in low-resource languages. We approach the problem from a cross-lingual perspective, building on a state-of-the-art transfer-learning approach that underperforms on ``distant'' languages that have little to no representation in the training corpus. Motivated by the field of syntactic typology, we introduce a general method called Expected Statistic Regularization (ESR) to regularize the parser on distant languages according to their expected typological syntax statistics. We also contribute general approaches for estimating the loss supervision parameters from the task formalism or small amounts of labeled data. We present seven broad classes of descriptive statistic families and provide extensive experimental evidence showing that using these statistics for regularization is complementary to deep learning approaches in low-resource transfer settings.

In conclusion, this thesis contributes approaches for reducing the annotation cost of building applied NLP systems through the use of high-level expert knowledge to impart additional learning signal on models and cope with cheaper biased data. We publish implementations of our methods and results, so that they may facilitate future research and applications. It is our hope that the frameworks proposed in this thesis will help to democratize access to NLP for producing structured information from text in wider-reaching applications by making them faster and cheaper to build.


  • thumnail for Effland_columbia_0054D_17587.pdf Effland_columbia_0054D_17587.pdf application/pdf 864 KB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Mckeown, Kathy
Ph.D., Columbia University
Published Here
November 23, 2022