A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures

Berenzweig, Adam; Logan, Beth; Ellis, Daniel P. W.; Whitman, Brian

A valuable goal in the field of Music Information Retrieval (MIR) is to devise an automatic measure of the similarity between two musical recordings based only on an analysis of their audio content. Such a tool—a quantitative measure of similarity—can be used to build classification, retrieval, browsing, and recommendation systems. To develop such a measure, however, presupposes some ground truth, a single underlying similarity that constitutes the desired output of the measure. Music similarity is an elusive concept—wholly subjective, multifaceted, and a moving target—but one that must be pursued in support of applications to provide automatic organization of large music collections. In this article, we explore music similarity measures in several ways, motivated by different types of questions. We are first motivated by the desire to improve automatic, acoustic-based similarity measures. Researchers from several groups have recently tried many variations of a few basic ideas, but it remains unclear which are best-suited for a given application. Few authors perform comparisons across multiple techniques, and it is impossible to compare results from different authors, because they do not share the required common ground: a common database and a common evaluation method. Of course, to improve any measure, we need an evaluation methodology, a scientific way of determining whether one variant is better than another. Otherwise, we are left to intuition, and nothing is gained. In our previous work (Ellis et al. 2002), we have examined several sources of human opinion about music similarity, with the impetus that human opinion must be the final arbiter of music similarity, because it is a subjective concept. However, as expected, there are as many opinions about music similarity as there are people to be asked, and so the second question is how to unify the various sources of opinion into a single ground truth. As we shall see, it turns out that perhaps this is the wrong way to look at things, and so we develop the concept of a "consensus truth" rather than a single ground truth. Finally, armed with these evaluation techniques, we provide an example of a cross-site evaluation of several acoustic- and subjective-based similarity measures. We address several main research questions. Regarding the acoustic measures, which feature spaces and which modeling and comparison methods are best? Regarding the subjective measures, which provides the best single ground truth, that is, which agrees best on average with the other sources? In the process of answering these questions, we address some of the logistical difficulties peculiar to our field, such as the legal obstacles to sharing music between research sites. We believe this is one of the first and largest cross-site evaluations in MIR. Our work was conducted in three independent labs (LabROSA at Columbia, MIT, and HP Labs in Cambridge), yet by carefully specifying our evaluation metrics, and by sharing data in the form of derived features (which presents little threat to copyright holders), we were able to make fine distinctions between algorithms running at each site. We see this as a powerful paradigm that we would like to encourage other researchers to use.


Also Published In

Computer Music Journal

More About This Work

Academic Units
Electrical Engineering
Published Here
February 14, 2012