Theses Doctoral

Pattern Mining and Concept Discovery for Multimodal Content Analysis

Li, Hongzhi

With recent advances in computer vision, researchers have been able to demonstrate impressive performance at near-human-level capabilities in difficult tasks such as image recognition. For example, for images taken under typical conditions, computer vision systems now have the ability to recognize if a dog, cat, or car appears in an image. These advances are made possible by utilizing the massive volume of image datasets and label annotations, which include category labels and sometimes bounding boxes around the objects of interest within the image. However, one major limitation of the current solutions is that when users apply recognition models to new domains, users need to manually define the target classes and label the training data in order to prepare labeled annotations required for the process of training the recognition models. Manually identifying the target classes and constructing the concept ontology for a new domain are time-consuming tasks, as they require the users to be familiar with the content of the image collection, and the manual process of defining target classes is difficult to scale up to generate a large number of classes. In addition, there has been significant interest in developing knowledge bases to improve content analysis and information retrieval. Knowledge base is an object model (ontology) with classes, subclasses, attributes, instances, and relations among them. The knowledge base generation problem is to identify the (sub)classes and their structured relations for a given domain of interest. Similar to ontology construction, Knowledge base is usually generated by human experts manually, and it is usually a time-consuming and difficult task. Thus, it is important and necessary to find a way to explore the semantic concepts and their structural relations that are important for a target data collection or domain of interest, so that we can construct an ontology or knowledge base for visual data or multimodal content automatically or semi-automatically.
Visual patterns are the discriminative and representative image content found in objects or local image regions seen in an image collection. Visual patterns can also be used to summarize the major visual concepts in an image collection. Therefore, automatic discovery of visual patterns can help users understand the content and structure of a data collection and in turn help users construct the ontology and knowledge base mentioned earlier.
In this dissertation, we aim to answer the following question: given a new target domain and associated data corpora, how do we rapidly discover nameable content patterns that are semantically coherent, visually consistent, and can be automatically named with semantic concepts related to the events of interest in the target domains? We will develop pattern discovery methods that focus on visual content as well as multimodal data including text and visual.
Traditional visual pattern mining methods only focus on analysis of the visual content, and do not have the ability to automatically name the patterns. To address this, we propose a new multimodal visual pattern mining and naming method that specifically addresses this shortcoming. The named visual patterns can be used as discovered semantic concepts relevant to the target data corpora. By combining information from multiple modalities, we can ensure that the discovered patterns are not only visually similar, but also have consistent meaning, as well. The capability of accurately naming the visual patterns is also important for finding relevant classes or attributes in the knowledge base construction process mentioned earlier.
Our framework contains a visual model and a text model to jointly represent the text and visual content. We use the joint multimodal representation and the association rule mining technique to discover semantically coherent and visually consistent visual patterns. To discover better visual patterns, we further improve the visual model in the multimodal visual pattern mining pipeline, by developing a convolutional neural network (CNN) architecture that allows for the discovery of scale-invariant patterns. In this dissertation, we use news as an example domain and image caption pairs as example multimodal corpora to demonstrate the effectiveness of the proposed methods. However, the overall proposed framework is general and can be easily extended to other domains.
The problem of concept discovery is made more challenging if the target application domain involves fine-grained object categories (e.g., highly related dog categories or consumer product categories). In such cases, the content of different classes could be quite similar, making automatic separation of classes difficult. In the proposed multimodal pattern mining framework, representation models for visual and text data play an important role, as they shape the pool of candidates that are fed to the pattern mining process. General models like the CNN models trained on ImageNet, though shown to be generalizable to various domains, are unable to capture the small differences in the fine-grained dataset. To address this problem, we propose a new representation model that uses an end-to-end artificial neural network architecture to discover visual patterns. This model can be fine-tuned on a fine-grained dataset so that the convolutional layers can be optimized to capture the features and patterns from the fine-trained image set. It has the ability to discover visual patterns from fine-grained image datasets because its convolutional layers of the CNN can be optimized to capture the features and patterns from the fine-grained images. Finally, to demonstrate the advantage of the proposed multimodal visual pattern mining and naming framework, we apply the proposed technique to two applications. In the first application, we use the visual pattern mining technique to find visual anchors to summarize video news events. In the second application, we use the visual patterns as important cues to link video news events to social media events.
The contributions of this dissertation can be summarized as follows: (1) We develop a novel multimodal mining framework for discovering visual patterns and nameable concepts from a collection of multimodal data and automatically naming the discovered patterns, producing a large pool of semantic concepts specifically relevant to a high-level event. The framework combines visual representation based on CNN and text representation based on embedding. The named visual patterns can be required for construct event schema needed in the knowledge base construction process. (2) We propose a scale-invariant visual pattern mining model to improve the multimodal visual pattern mining framework. The improved visual model leads to better overall performance in discovering and naming concepts. To localize the visual patterns discovered in this framework, we propose a deconvolutional neural network model to localize the visual pattern patterns within the image. (3) To directly learn from data in the target domain, we propose a novel end-to-end neural network architecture called PatternNet for finding high-quality visual patterns even for datsets that consistent of fine-grained classes. (4) We demonstrate novel applications of visual pattern mining in two applications: video news event summarization and video news event linking.

Files

  • thumnail for Li_columbia_0054D_13498.pdf Li_columbia_0054D_13498.pdf application/pdf 12.7 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Chang, Shih-Fu
Degree
Ph.D., Columbia University
Published Here
August 5, 2016