Academic Commons

Presentations (Communicative Events)

Restoring Punctuation and Capitalization in Transcribed Speech

Gravano, Agustin; Jansche, Martin; Bacchiani, Michiel

Adding punctuation and capitalization greatly improves the readability of automatic speech transcripts. We discuss an approach for performing both tasks in a single pass using a purely text-basedn-gram language model. We study the effect on performance of varying the n-gram order (from n = 3 to n = 6) and the amount of training data (from 58 million to
55 billion tokens). Our results show that using larger training data sets consistently improves performance, while increasing the n-gram order does not help nearly as much.

Subjects

Files

More About This Work

Academic Units
Computer Science
Published Here
April 29, 2013
Academic Commons provides global access to research and scholarship produced at Columbia University, Barnard College, Teachers College, Union Theological Seminary and Jewish Theological Seminary. Academic Commons is managed by the Columbia University Libraries.