Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bdepartment_facet%5D%5B%5D=Electrical+Engineering&f%5Bsubject_facet%5D%5B%5D=Applied+mathematics&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usComputational Methods for Nonlinear Optimization Problems: Theory and Applications
http://academiccommons.columbia.edu/catalog/ac:189943
Madani, Ramtinhttp://dx.doi.org/10.7916/D88S4PDMThu, 15 Oct 2015 18:09:15 +0000This dissertation is motivated by the lack of efficient global optimization techniques for polynomial optimization problems. The objective is twofold. First, a new mathematical foundation for obtaining a global or near-global solution will be developed. Second, several case studies will be conducted on a variety of real-world problems. Global optimization, convex relaxation and distributed computation are at the heart of this PhD dissertation. Some of the specific problems to be addressed in this thesis on both the theory and the application of nonlinear optimization are explained below:
Graph theoretic algorithms for low-rank optimization problems: There is a rapidly growing interest in the recovery of an unknown low-rank matrix from limited information and measurements. This problem occurs in many areas of engineering and applied science such as machine learning, control, and computer vision. We develop a graph-theoretic technique in Part I that is able to generate a low-rank solution for a sparse Linear Matrix Inequality (LMI), which is directly applicable to a large set of problems such as low-rank matrix completion with many unknown entries. Our approach finds a solution with a guarantee on its rank, using the recent advances in graph theory.
Resource allocation for energy systems: The flows in an electrical grid are described by nonlinear AC power flow equations. Due to the nonlinear interrelation among physical parameters of the network, the feasibility region represented by power flow equations may be nonconvex and disconnected. Since 1962, the nonlinearity of the network constraints has been studied, and various heuristic and local-search algorithms have been proposed in order to perform optimization over an electrical grid [Baldick, 2006; Pandya and Joshi, 2008]. Part II is concerned with finding convex formulations of the power flow equations using semidefinite programming (SDP). The potential of SDP relaxation for problems in power systems has been manifested in [Lavaei and Low, 2012], with further studies conducted in [Lavaei, 2011; Sojoudi and Lavaei, 2012]. A variety of graph-theoretic and algebraic methods are developed in Part II in order to facilitate performing fundamental, yet challenging tasks such as optimal power flow (OPF) problem, security-constrained OPF and the classical power flow problem.
Synthesis of distributed control systems: Real-world systems mostly consist of many interconnected subsystems, and designing an optimal controller for them pose several challenges to the field of control theory. The area of distributed control is created to address the challenges arising in the control of these systems. The objective is to design a constrained controller whose structure is specified by a set of permissible interactions between the local controllers with the aim of reducing the computation or communication complexity of the overall controller. It has been long known that the design of an optimal distributed (decentralized) controller is a daunting task because it amounts to an NP-hard optimization problem in general [Witsenhausen, 1968; Tsitsiklis and Athans, 1984]. Part III is devoted to study the potential of the SDP relaxation for the optimal distributed control (ODC) problem Our approach rests on formulating each of different variations of the ODC problem as rank-constrained optimization problems from which SDP relaxations can be derived. As the first contribution, we show that the ODC problem admits a sparse SDP relaxation with solutions of rank at most 3. Since a rank-1 SDP matrix can be mapped back into a globally-optimal controller, the low-rank SDP solution may be deployed to retrieve a near-global controller.
Parallel computation for sparse semidefinite programs: While small- to medium-sized semidefinite programs are efficiently solvable by second-order-based interior point methods in polynomial time up to any arbitrary precision [Vandenberghe and Boyd, 1996a], these methods are impractical for solving large-scale SDPs due to computation time and memory issues. In Part IV of this dissertation, a parallel algorithm for solving an arbitrary SDP is introduced based on the alternating direction method of multipliers. The proposed algorithm has a guaranteed convergence under very mild assumptions. Each iteration of this algorithm has a simple closed-form solution, and consists of scalar multiplication and eigenvalue decomposition over matrices whose sizes are not greater than the treewdith of the sparsity graph of the SDP problem. The cheap iterations of the proposed algorithm enable solving real-world large-scale conic optimization problems.Engineering, Applied mathematics, Computer sciencerm3122Electrical EngineeringDissertationsImproved recognition by combining different features and different systems
http://academiccommons.columbia.edu/catalog/ac:148938
Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13818Tue, 03 Jul 2012 00:00:00 +0000Combining multiple estimators to obtain a more accurate final result is a well-known technique in statistics. In the domain of speech recognition, there are many ways in which this general principle can be applied. We have looked at several ways for combining the information from different feature representations, and used these results in the best-performing system in last year's Aurora evaluation: Our entry combined feature streams after the acoustic classification stage, then used a combination of neural networks and Gaussian mixtures for more accurate modeling. These and other approaches to combination are described and compared, and some more general questions arising from the combination of information streams are considered.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesTandem connectionist feature stream extraction for conventional HMM systems
http://academiccommons.columbia.edu/catalog/ac:148941
Hermansky, Hynek; Ellis, Daniel P. W.; Sharma, Sangitahttp://hdl.handle.net/10022/AC:P:13821Tue, 03 Jul 2012 00:00:00 +0000Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations. In this work we show a large improvement in word recognition performance by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling. By training the network to generate the subword probability posteriors, then using transformations of these estimates as the base features for a conventionally-trained Gaussian-mixture based system, we achieve relative error rate reductions of 35% or more on the multicondition Aurora noisy continuous digits taskElectrical engineering, Applied mathematicsde171Electrical EngineeringArticlesTandem acoustic modeling in large-vocabulary recognition
http://academiccommons.columbia.edu/catalog/ac:148916
Ellis, Daniel P. W.; Singh, Rita; Sivadas, Sunilhttp://hdl.handle.net/10022/AC:P:13796Mon, 02 Jul 2012 00:00:00 +0000In the tandem approach to modeling the acoustic signal, a neural-net preprocessor is first discriminatively trained to estimate posterior probabilities across a phone set. These are then used as feature inputs for a conventional hidden Markov model (HMM) based speech recognizer, which relearns the associations to subword units. We apply the tandem approach to the data provided for the first Speech in Noisy Environments (SPINE1) evaluation conducted by the Naval Research Laboratory (NRL) in August 2000. In our previous experience with the ETSI Aurora noisy digits (a small-vocabulary, high-noise task) the tandem approach achieved error-rate reductions of over 50% relative to the HMM baseline. For SPINE1, a larger task involving more spontaneous speech, we find that, when context-independent models are used, the tandem features continue to result in large reductions in word-error rates relative to those achieved by systems using standard MFC or PLP features. However, these improvements do not carry over to context-dependent models. This may be attributable to several factors which are discussed in the paper.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesError visualization for tandem acoustic modeling on the Aurora task
http://academiccommons.columbia.edu/catalog/ac:148893
Reyes-Gomez, Manuel; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13790Mon, 02 Jul 2012 00:00:00 +0000Tandem acoustic modeling consists of taking the outputs of a neural network discriminantly trained to estimate the phone-class posterior probabilities of speech, and using them as the input features of a conventional distribution-modeling Gaussian mixture model (GMM) speech recognizer, thereby employing two acoustic models in tandem. This structure reduces the error rate on the Aurora 2 noisy English digits task in more than 50% compared to the HTK baseline. Even though there are some reasonable hypothesis to explain this improvement, the origins are still unclear. This paper introduces the use of visualization tools for error analysis of some variations of the tandem system. The error behavior is first analyzed using word-level confusion matrices. Posteriorgrams (displays of the variation in time of per-phone posterior probabilities) provide for further analysis. The results of corroborate our previous hypothesis that the gains from tandem modeling arise from the very different training and modeling schemes of the two acoustic models.Electrical engineering, Applied mathematicsmjr59, de171Electrical EngineeringArticlesAnchor Space for Classification and Similarity Measurement of Music
http://academiccommons.columbia.edu/catalog/ac:148885
Berenzweig, Adam; Ellis, Daniel P. W.; Lawrence, Stevehttp://hdl.handle.net/10022/AC:P:13788Mon, 02 Jul 2012 00:00:00 +0000This paper describes a method of mapping music into a semantic space that can be used for similarity measurement, classification, and music information retrieval. The value along each dimension of this anchor space is computed as the output from a pattern classifier which is trained to measure a particular semantic feature. In anchor space, distributions that represent objects such as artists or songs are modeled with Gaussian mixture models, and several similarity measures are defined by computing approximations to the Kullback-Leibler divergence between distributions. Similarity measures are evaluated against human similarity judgements. The models are also used for artist classification to achieve 62% accuracy on a 25-artist set, and 38% on a 404-artist set (random guessing achieves 0.25%). Finally, we describe a music similarity browsing application that makes use of the fact that anchor space dimensions are meaningful to users.Electrical engineering, Applied mathematicsalb63, de171Electrical EngineeringArticlesSelection, Parameter Estimation, and Discriminative Training of Hidden Markov Models for General Audio Modeling
http://academiccommons.columbia.edu/catalog/ac:148890
Reyes-Gomez, Manuel; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13789Mon, 02 Jul 2012 00:00:00 +0000Hidden Markov models (HMMs) permit a natural and flexible way to model time-sequential data. The ease of concatenation and time-warping algorithms implementation on HMMs suit them very well for segmentation and content based audio classification applications, as is clear from their extended and successful use on speech recognition applications. Speech has a natural basic unit, the phone, which normally delimits the number of models to one per phone. Moreover, knowledge of the speech structure facilitates the choice of the model parameters. When modeling generic audio, on other hand, the lack of a natural basic unit, and the absence of a clear structure, make the selection and the parameter estimation of an optimal set of HMMs difficult. In this paper we present different approaches to select and estimate the HMM parameters of a set of representative generic audio classes. We compare these approaches in the context of a content- based classification application using the MuscleFish database. The models are first found through frame clustering or by traditional EM techniques under some specific selection criteria, such as the Bayesian information criterion. Further discriminative training of the initial models considerably improve their performance in the content-based classification task, obtaining results comparable with the ones obtained, for the same task, by inherently discriminative classification methods, such as support vector machines, while preserving the intrinsic flexibility of HMMs.Electrical engineering, Applied mathematicsmjr59, de171Electrical EngineeringArticlesMulti-channel Source Separation by Beamforming Trained with Factorial HMMs
http://academiccommons.columbia.edu/catalog/ac:148729
Reyes-Gomez, Manuel; Raj, Bhiksha; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13730Fri, 29 Jun 2012 00:00:00 +0000Speaker separation has conventionally been treated as a problem of blind source separation (BSS). This approach does not utilize any knowledge of the statistical characteristics of the signals to be separated, relying mainly on the independence between the various signals to separate them. Maximum-likelihood techniques, on the other hand, utilize knowledge of the a priori probability distributions of the signals from the speakers, in order to effect separation. Previously (Reyes-Gomez, M.J. et al., Proc. ICASSP, 2003), we presented a maximum-likelihood speaker separation technique that utilizes detailed statistical information about the signals to be separated, represented in the form of hidden Markov models (HMMs), to estimate the parameters of a filter-and-sum processor for signal separation. We show that the filters that are estimated for a particular utterance by a speaker generalize well to other utterances by the same speaker, provided the location of the various speakers remains constant. Thus, filters that have been estimated using a "training" utterance of a known transcript can be used to separate all future signals by the speaker from mixtures of speech signals in an unsupervised manner. On the other hand, the filters are ineffective for other speakers, even at the same locations, indicating that they capture the spatio-frequency characteristics of the speaker.Electrical engineering, Applied mathematicsmjr59, de171Electrical EngineeringArticlesMultiband audio modeling for single-channel acoustic source separation
http://academiccommons.columbia.edu/catalog/ac:148684
Reyes-Gomez, Manuel; Ellis, Daniel P. W.; Jojic, Nebojsahttp://hdl.handle.net/10022/AC:P:13719Fri, 29 Jun 2012 00:00:00 +0000Detailed hidden Markov models (HMMs) that capture the constraints implicit in a particular sound can be used to estimate obscured or corrupted portions from partial observations, the situation encountered when trying to identify multiple, overlapping sounds. However, when the complexity and variability of the sounds are high, as in a particular speaker's voice, a detailed model might require several thousand states to cover the full range of different short-term spectra with adequate resolution. To address the tractability problems of such large models, we break the source signals into multiple frequency bands, and build separate but coupled HMMs for each band, requiring many fewer states per model. To prevent non-natural full spectral states and to enforce consistency within and between bands, at any given frame, the state in a particular band is determined by the previous state in that band and the states in the adjacent bands. Coupling the bands in this manner results in a grid like model for the full spectrum. Since exact inference of such a model is intractable, we derive an efficient approximation based on variational methods. Results in source separation of combined signals modeled with this approach outperform the separation obtained by full-band models.Electrical engineering, Applied mathematicsmjr59, de171Electrical EngineeringArticlesMulti-channel Source Separation by Factorial HMMs
http://academiccommons.columbia.edu/catalog/ac:148739
Reyes-Gomez, Manuel; Raj, Bhiksha; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13733Fri, 29 Jun 2012 00:00:00 +0000We present a new speaker-separation algorithm for separating signals with known statistical characteristics from mixed multi-channel recordings. Speaker separation has conventionally been treated as a problem of blind source separation (BSS). This approach does not utilize any knowledge of the statistical characteristics of the signals to be separated, relying mainly on the independence between the various signals to separate them. We present an algorithm that utilizes detailed statistical information about the signals to be separated, represented in the form of hidden Markov models (HMM). We treat the signal separation problem as one of beamforming, where each signal is extracted using a filter-and-sum array. The filters are estimated to maximize the likelihood of the summed output, measured on the HMM for the desired signal. This is done by iteratively estimating the best state sequence through the HMM from a factorial HMM (FHMM) that is the cross-product of the HMMs for the multiple signals, using the current output of the array, and estimating the filters to maximize the likelihood of that state sequence. Experiments show that the proposed method can cleanly extract a background speaker who is 20 dB below the foreground speaker in a two-speaker mixture, when the HMMs for the signals are constructed from knowledge of the utterance transcriptions.Electrical engineering, Applied mathematicsmjr59, de171Electrical EngineeringArticlesLP-TRAP: Linear predictive temporal patterns
http://academiccommons.columbia.edu/catalog/ac:148642
Athineos, Marios; Hermansky, Hynek; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13711Thu, 28 Jun 2012 00:00:00 +0000Autoregressive modeling is applied for approximating the temporal evolution of spectral density in critical-band-sized subbands of a segment of speech signal. The generalized autocorrelation linear predictive technique allows for a compromise between fitting the peaks and the troughs of the Hilbert envelope of the signal in the sub-band. The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesLearning Auditory Models of Machine Voices
http://academiccommons.columbia.edu/catalog/ac:148630
Dobson, Kelly; Whitman, Brian; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13706Thu, 28 Jun 2012 00:00:00 +0000Vocal imitation is often found useful in machine therapy sessions as it creates an emphatic relational bridge between human and machine. The feedback of the machine directly responding to the person's imitation can strengthen the trust of this connection. However, vocal imitation of machines often bear little resemblance to the target due to physiological limitations. In practice, we need a way to detect human vocalization of machine sounds that can generalize to new machines. In this study we learn the relationship between vocal imitation of machine sounds and the target sounds to create a predictive model of vocalization of otherwise humanly impossible sounds. After training on a small set of machines and their imitations, we predict the correct target of a new set of imitations with high accuracy. The model outperforms distance metrics between human and machine sounds on the same task and takes into account auditory perception and constraints in vocal expression.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesSong-Level Features and Support Vector Machines for Music Classification
http://academiccommons.columbia.edu/catalog/ac:148626
Mandel, Michael I.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13705Thu, 28 Jun 2012 00:00:00 +0000Searching and organizing growing digital music collections requires automatic classification of music. This paper describes a new system, tested on the task of artist identification, that uses support vector machines to classify songs based on features calculated over their entire lengths. Since support vector machines are exemplar-based classifiers, training on and classifying entire songs instead of short-time features makes intuitive sense. On a dataset of 1200 pop songs performed by 18 artists, we show that this classifier outperforms similar classifiers that use only SVMs or song-level features. We also show that the KL divergence between single Gaussians and Mahalanobis distance between MFCC statistics vectors perform comparably when classifiers are trained and tested on separate albums, but KL divergence outperforms Mahalanobis distance when trained and tested on songs from the same albums.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesClassifying Music Audio with Timbral and Chroma Features
http://academiccommons.columbia.edu/catalog/ac:148541
Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13676Wed, 27 Jun 2012 00:00:00 +0000Music audio classification has most often been addressed by modeling the statistics of broad spectral features, which, by design, exclude pitch information and reflect mainly instrumentation. We investigate using instead beat-synchronous chroma features, designed to reflect melodic and harmonic content and be invariant to instrumentation. Chroma features are less informative for classes such as artist, but contain information that is almost entirely independent of the spectral features, and hence the two can be profitably combined: Using a simple Gaussian classifier on a 20-way pop music artist identification task, we achieve 54% accuracy with MFCCs, 30% with chroma vectors, and 57% by combining the two. All the data and Matlab code to obtain these results are available.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesMultiple-Instance Learning For Music Information Retrieval
http://academiccommons.columbia.edu/catalog/ac:148502
Mandel, Michael I.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13664Wed, 27 Jun 2012 00:00:00 +0000Multiple-instance learning algorithms train classifiers from lightly supervised data, i.e. labeled collections of items, rather than labeled items. We compare the multiple-instance learners mi-SVM and MILES on the task of classifying 10- second song clips. These classifiers are trained on tags at the track, album, and artist levels, or granularities, that have been derived from tags at the clip granularity, allowing us to test the effectiveness of the learners at recovering the clip labeling in the training set and predicting the clip labeling for a held-out test set. We find that mi-SVM is better than a control at the recovery task on training clips, with an average classification accuracy as high as 87% over 43 tags; on test clips, it is comparable to the control with an average classification accuracy of up to 68%. MILES performed adequately on the recovery task, but poorly on the test clips.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesA Tempo-Insensitive Distance Measure for Cover Song Identification based on Chroma Features
http://academiccommons.columbia.edu/catalog/ac:148516
Jensen, Jesper Hojvang; Christensen, Mads G.; Ellis, Daniel P. W.; Jensen, Soren Holdthttp://hdl.handle.net/10022/AC:P:13669Wed, 27 Jun 2012 00:00:00 +0000We present a distance measure between audio files designed to identify cover songs, which are new renditions of previously recorded songs. For each song we compute the chromagram, remove phase information and apply exponentially distributed bands in order to obtain a feature matrix that compactly describes a song and is insensitive to changes in instrumentation, tempo and time shifts. As distance between two songs, we use the Frobenius norm of the difference between their feature matrices normalized to unit norm. When computing the distance, we take possible transpositions into account. In a test collection of 80 songs with two versions of each, 38% of the covers were identified. The system was also evaluated on an independent, international evaluation where it despite having much lower complexity performed on par with the winner of last year.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesEvaluation Distance Measures Between Gaussian Mixture Models of MFCCs
http://academiccommons.columbia.edu/catalog/ac:148548
Jensen, Jesper Hojvang; Ellis, Daniel P. W.; Christensen, Mads G.; Jensen, Soren Holdthttp://hdl.handle.net/10022/AC:P:13678Wed, 27 Jun 2012 00:00:00 +0000In music similarity and in the related task of genre classification, a distance measure between Gaussian mixture models is frequently needed. We present a comparison of the Kullback-Leibler distance, the earth movers distance and the normalized L2 distance for this application. Although the normalized L2 distance was slightly inferior to the Kullback-Leibler distance with respect to classification performance, it has the advantage of obeying the triangle inequality, which allows for efficient searching.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesA probability model for interaural phase difference
http://academiccommons.columbia.edu/catalog/ac:148580
Mandel, Michael I.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13688Wed, 27 Jun 2012 00:00:00 +0000In this paper, we derive a probability model for interaural phase differences at individual spectrogram points. Such a model can combine observations across arbitrary time and frequency regions in a structured way and does not make any assumptions about the characteristics of the sound sources. In experiments with speech from twenty speakers in simulated reverberant environments, this probabilistic method predicted the correct interaural delay of a signal more accurately than generalized cross-correlation methods.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesAn EM Algorithm for Localizing Multiple Sound: Sources in Reverberant Environments
http://academiccommons.columbia.edu/catalog/ac:148573
Mandel, Michael I.; Ellis, Daniel P. W.; Jebara, Tonyhttp://hdl.handle.net/10022/AC:P:13686Wed, 27 Jun 2012 00:00:00 +0000We present a method for localizing and separating sound sources in stereo recordings that is robust to reverberation and does not make any assumptions about the source statistics. The method consists of a probabilistic model of binaural multisource recordings and an expectation maximization algorithm for finding the maximum likelihood parameters of that model. These parameters include distributions over delays and assignments of time-frequency regions to sources. We evaluate this method against two comparable algorithms on simulations of simultaneous speech from two or three sources. Our method outperforms the others in anechoic conditions and performs as well as the better of the two in the presence of reverberation.Electrical engineering, Applied mathematicsde171, tj2008Computer Science, Electrical EngineeringArticlesEstimating Single-Channel Source Separation Masks: Relevance Vector Machine Classifiers vs. Pitch-Based Masking
http://academiccommons.columbia.edu/catalog/ac:148583
Weiss, Ron J.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13689Wed, 27 Jun 2012 00:00:00 +0000Audio sources frequently concentrate much of their energy into a relatively small proportion of the available time-frequency cells in a short-time Fourier transform (STFT). This sparsity makes it possible to separate sources, to some degree, simply by selecting STFT cells dominated by the desired source, setting all others to zero (or to an estimate of the obscured target value), and inverting the STFT to a waveform. The problem of source separation then becomes identifying the cells containing good target information. We treat this as a classification problem, and train a Relevance Vector Machine (a probabilistic relative of the Support Vector Machine) to perform this task. We compare the performance of this classifier both against SVMs (it has similar accuracy but is not as efficient as RVMs), and against a traditional Computational Auditory Scene Analysis (CASA) technique based on a noise-robust pitch tracker, which the RVM outperforms significantly. Differences between the RVM- and pitch-tracker-based mask estimation suggest benefits to be obtained by combining both.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesShort-term audio-visual atoms for generic video concept classification
http://academiccommons.columbia.edu/catalog/ac:148465
Jiang, Wei; Cotton, Courtenay Valentine; Chang, Shih-Fu; Ellis, Daniel P. W.; Loui, Alexander C.http://hdl.handle.net/10022/AC:P:13657Tue, 26 Jun 2012 00:00:00 +0000We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named Short-Term Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVAs from generic videos under challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera. Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection. We extensively evaluate our algorithm over Kodak's consumer benchmark video set from real users. Experimental results confirm significant performance improvements - over 120% MAP gain compared to alternative approaches using static region segmentation without temporal tracking. The joint audio-visual features also outperform visual features alone by an average of 8.5% (in terms of AP) over 21 concepts, with many concepts achieving more than 20%.Electrical engineering, Applied mathematicscvc2106, sc250, de171Computer Science, Electrical EngineeringArticlesA variational EM algorithm for learning eigenvoice parameters in mixed signals
http://academiccommons.columbia.edu/catalog/ac:148483
Weiss, Ron J.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13661Tue, 26 Jun 2012 00:00:00 +0000We derive an efficient learning algorithm for model-based source separation for use on single channel speech mixtures where the precise source characteristics are not known a priori. The sources are modeled using factor-analyzed hidden Markov models (HMM) where source specific characteristics are captured by an "eigenvoice" speaker subspace model. The proposed algorithm is able to learn adaptation parameters for two speech sources when only a mixture of signals is observed. We evaluate the algorithm on the 2006 speech separation challenge data set and show that it is significantly faster than our earlier system at a small cost in terms of performance.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesMulti-Voice Polyphonic Music Transcription Using Eigeninstruments
http://academiccommons.columbia.edu/catalog/ac:148452
Grindlay, Graham C.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13654Tue, 26 Jun 2012 00:00:00 +0000We present a model-based approach to separating and transcribing single-channel, multi-instrument polyphonic music in a semi-blind fashion. Our system extends the non-negative matrix factorization (NMF) algorithm to incorporate constraints on the basis vectors of the solution. In the context of music transcription, this allows us to encode prior knowledge about the space of possible instrument models as a parametric subspace we term "eigeninstruments". We evaluate our algorithm on several synthetic (MIDI) recordings containing different instrument mixtures. Averaged over both sources, we achieved a frame-level accuracy of over 68% on an excerpt of Pachelbel's Canon arranged for doublebass and piano and 72% on a mixture of overlapping melodies played by flute and violin.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesGuided harmonic sinusoid estimation in a multi-pitch environment
http://academiccommons.columbia.edu/catalog/ac:148449
Smit, Christine E.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13653Tue, 26 Jun 2012 00:00:00 +0000We describe an algorithm to accurately estimate the fundamental frequency of harmonic sinusoids in a mixed voice recording environment using an aligned electronic score as a guide. Taking the pitch tracking results on individual voices prior to mixing as ground truth, we are able estimate the pitch of individual voices in a 4-part piece to within 50 cents of the correct pitch more than 90% of the time.Electrical engineering, Applied mathematicsces2130, de171Electrical EngineeringArticlesImproving MIDI-audio alignment with acoustic features
http://academiccommons.columbia.edu/catalog/ac:148456
Devaney, Johanna; Mandel, Michael I.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13655Tue, 26 Jun 2012 00:00:00 +0000This paper describes a technique to improve the accuracy of dynamic time warping-based MIDI-audio alignment. The technique implements a hidden Markov model that uses aperiodicity and power estimates from the signal as observations and the results of a dynamic time warping alignment as a prior. In addition to improving the overall alignment, this technique also identifies the transient and steady state sections of the note. This information is important for describing various aspects of a musical performance, including both pitch and rhythm.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesFinding similar acoustic events using matching pursuit and locality-sensitive hashing
http://academiccommons.columbia.edu/catalog/ac:148446
Cotton, Courtenay Valentine; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13652Tue, 26 Jun 2012 00:00:00 +0000There are many applications for the ability to find repetitions of perceptually similar sound events in generic audio recordings. We explore the use of matching pursuit (MP) derived features to identify repeated patterns that characterize distinct acoustic events. We use locality-sensitive hashing (LSH) to efficiently search for similar items. We describe a method for detecting repetitions of events, and demonstrate performance on real data.Electrical engineering, Applied mathematicscvc2106, de171Electrical EngineeringArticlesThe Ideal Interaural Parameter Mask: A Bound on Binaural Separation Systems
http://academiccommons.columbia.edu/catalog/ac:148459
Mandel, Michael I.; Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:13656Tue, 26 Jun 2012 00:00:00 +0000We introduce the ideal interaural parameter mask as an upper bound on the performance of mask-based source separation algorithms that are based on the differences between signals from two microphones or ears. With two additions to our Model-based EM source separation and localization system, its performance approaches that of the IIPM upper bound to within 0.9 dB. These additions battle the effects of reverberation by absorbing reverberant energy and by forcing the ILD estimate to be larger than it might otherwise be. An oracle reliability measure was also added, in the hope that estimating parameters from more reliable regions of the spectrogram would improve separation, but it was not consistently useful.Electrical engineering, Applied mathematicsde171Electrical EngineeringArticlesDetecting local semantic concepts in environmental sounds using Markov model based clustering
http://academiccommons.columbia.edu/catalog/ac:148439
Lee, Keansub; Ellis, Daniel P. W.; Loui, Alexander C.http://hdl.handle.net/10022/AC:P:13650Tue, 26 Jun 2012 00:00:00 +0000Detecting the time of occurrence of an acoustic event (for instance, a cheer) embedded in a longer soundtrack is useful and important for applications such as search and retrieval in consumer video archives. We present a Markov-model based clustering algorithm able to identify and segment consistent sets of temporal frames into regions associated with different ground-truth labels, and simultaneously to exclude a set of uninformative frames shared in common from all clips. The labels are provided at the clip level, so this refinement of the time axis represents a variant of Multiple-Instance Learning (MIL). Evaluation shows that local concepts are effectively detected by this clustering technique based on coarse-scale labels, and that detection performance is significantly better than existing algorithms for classifying real-world consumer recordings.Electrical engineering, Applied mathematicskl2074, de171Electrical EngineeringArticlesStructured Prediction Models for Chord Transcription of Music Audio
http://academiccommons.columbia.edu/catalog/ac:148442
Weller, Adrian Vivian; Ellis, Daniel P. W.; Jebara, Tonyhttp://hdl.handle.net/10022/AC:P:13651Tue, 26 Jun 2012 00:00:00 +0000Chord sequences are a compact and useful description of music, representing each beat or measure in terms of a likely distribution over individual notes without specifying the notes exactly. Transcribing music audio into chord sequences is essential for harmonic analysis, and would be an important component in content-based retrieval and indexing, but accuracy rates remain fairly low. In this paper, the existing 2008 LabROSA Supervised Chord Recognition System is modified by using different machine learning methods for decoding structural information, thereby achieving significantly superior results. Specifically, the hidden Markov model is replaced by a large margin structured prediction approach (SVMstruct) using an enlarged feature space. Performance is significantly improved by incorporating features from future (but not past) frames. The benefit of SVMstruct increases with the size of the training set, as might be expected when comparing discriminative and generative models. Without yet exploring non-linear kernels, these improvements lead to state-of-the-art performance in chord transcription. The techniques could prove useful in other sequential learning tasks which currently employ HMMs.Electrical engineering, Applied mathematicsaw2506, de171, tj2008Computer Science, Electrical EngineeringArticlesModel-Based Scene Analysis
http://academiccommons.columbia.edu/catalog/ac:145152
Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:12743Wed, 07 Mar 2012 00:00:00 +0000When multiple sound sources are mixed together into a single channel (or a small number of channels) it is in general impossible to recover the exact waveforms that were mixed; indeed, without some kind of constraints on the form of the component signals, it is impossible to separate them at all. These constraints could take several forms. For instance, given a particular family of processing algorithms (such as linear filtering, or selection of individual time-frequency cells in a spectrogram), constraints could be defined in terms of the relationships between the set of resulting output signals, such as statistical independence [3, 41], or clustering of a variety of properties that indicate distinct sources [1, 45]. These approaches are concerned with the relationships between the properties of the complete set of output signals, rather than the specific properties of any individual output; in general, the individual sources could take any form. Another way to express the constraints is to specify the form that the individual sources can take, regardless of the rest of the signal. These restrictions may be viewed as "prior models" for the sources, and source separation then becomes the problem of finding a set of signals that combine together to give the observed mixture signal at the same time as conforming in some optimal sense to the prior models. This is the approach to be examined in this chapter.Acoustics, Applied mathematicsde171Electrical EngineeringBook chaptersModeling the auditory organization of speech: a summary and some comments
http://academiccommons.columbia.edu/catalog/ac:144486
Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:12563Wed, 15 Feb 2012 00:00:00 +0000The preceding three chapters have been concerned with the issues arising as a result of the inconvenient fact that our ears are rarely presented with the sound of a single speaker in isolation, but more often with a combination of several speech and nonspeech sounds which may also have been further altered by the acoustic environment. Faced with such a mixture, the listener evidently needs to consider each source separately, and this process of information segregation is known as auditory organization or auditory scene analysis (Bregman, 1990). Pure curiosity as well as the possibility of applications in automatic signal interpretation drive us to investigate auditory scene analysis through psychological experiments and computational modeling. Having sketched this framework and the current limits to our understanding of the process of auditory organization, we can now examine the material of each of the three chapters in more detail, seeing how it fits into this framework and also where the framework may be inadequate. Following these discussions, we will conclude with some remarks suggested by the particular combination of results in this section.Acoustics, Applied mathematicsde171Electrical EngineeringBook chaptersAn Introduction to Signal Processing for Speech
http://academiccommons.columbia.edu/catalog/ac:144483
Ellis, Daniel P. W.http://hdl.handle.net/10022/AC:P:12562Wed, 15 Feb 2012 00:00:00 +0000The formal tools of signal processing emerged in the mid twentieth century when electronics gave us the ability to manipulate signals — time-varying measurements — to extract or rearrange various aspects of interest to us, i.e., the information in the signal. The core of traditional signal processing is a way of looking at the signals in terms of sinusoidal components of differing frequencies (the Fourier domain), and a set of techniques for modifying signals that are most naturally described in that domain, i.e., filtering. Although originally developed using analog electronics, since the 1970s signal processing has more and more been implemented on computers in the digital domain, leading to some modifications to the theory without changing its essential character. This chapter aims to give a transparent and intuitive introduction to the basic ideas of the Fourier domain and filtering, and connects them to some of the common representations used in speech science, including the spectrogram and cepstral coefficients. We assume the absolute minimum of prior technical background, which will naturally be below the level of many readers; however, there may be some value in taking such a ground-up approach even for those for whom much of the material is review.Communication, Applied mathematicsde171Electrical EngineeringBook chaptersAccessing Minimal-Impact Personal Audio Archives
http://academiccommons.columbia.edu/catalog/ac:144446
Ellis, Daniel P. W.; Lee, Keansubhttp://hdl.handle.net/10022/AC:P:12543Mon, 13 Feb 2012 00:00:00 +0000We've collected personal audio - essentially everything we hear - for two years and have experimented with methods to index and access the resulting data. Here, we describe our experiments in segmenting and labeling these recordings into episodes (relatively consistent acoustic situations lasting a few minutes or more) using the Bayesian information criterion (from speaker segmentation) and spectral clustering.Applied mathematicsde171, kl2074Electrical EngineeringArticles