Please use this identifier to cite or link to this item:
http://dspace.cityu.edu.hk/handle/2031/7150
Title: | Music annotation with augmented MFCC |
Authors: | Ni, Chenying (倪晨穎) |
Department: | Department of Computer Science |
Issue Date: | 2013 |
Course: | CS4514 Project |
Programme: | Bachelor of Science (Honours) in Computer Science |
Instructor: | Dr. Chan, Antoni Bert |
Subjects: | Musical notation -- Data processing. |
Citation: | Ni, C. (2013). Music annotation with augmented MFCC (Outstanding Academic Papers by Students (OAPS)). Retrieved from City University of Hong Kong, CityU Institutional Repository. |
Abstract: | MFCC is a widely adopted feature in speech analysis. Research has proved its adaptability in music information retrieval systems. Recently, it has been proposed that an augmented Mel-Frequency Cepstral Coe cients (MFCCs) feature which is invariant to music keys can improve results for the task of music genre classi cation. Based on this nding, the goal of this project is to further determine whether this nding is applicable to a large music annotation task. To achieve our goal, we implement an audio annotation system that can tag audio tracks with meaningful words, and experiment with key augmented audio features. Particularly we are interested in nding a key invariant feature to t in the system. We adopt the Computer Audition Lab 500-Song (CAL500) dataset containing 502 popular Western songs and each with a list of human annotated tags. We build two classi ers for annotation. In our case, classi ers are called tag models. They de ne how audio features are gathered from training tracks to form tag models. Based on Gaussian Mixture Model, we learn each track-level distribution by running Expectation-Maximization (EM) algorithm. Our tag models are built on those tracklevel distributions. Firstly, we adopt Model Averaging on original CAL500 dataset. This tag model takes weighted average over related audio tracks. Although Model Averaging model produces good results, however it is computationally expensive. So we choose Mixture Hierarchical Model as an alternative to do the augmentation experiments. Instead of naive averaging, it computes a word-level distribution based on the related track-level distributions by running hierarchical EM (HEM) algorithm and produces xed small number of Gaussian Mixture components. To further nd out how MFCC, a commonly used short term low level feature, can be made invariant to musical key and how this key augmented feature will perform for the annotation task, we use two di erent techniques to do feature augmentation. One is using Audacity to generate key shifted songs and the other one is by changing Mel band lters during the computation of MFCC. Traditionally in Audactiy key shifted model, we need to output the dataset in sound wave i.e. mp3/wav format for each key shift. With our novel approach of MFCC Mel band lters changing key simulation, we get the key shifted features without producing extra sound track les. We compare the augmentation results for each individual approach under four models depending on the training or testing phrases where augmented dataset is used. Namely these four models are Original, Augmented Training, Augmented Testing and Augmented Both (Training and Testing). With the results from feature augmentations, we analyze each model in terms of precision and recall for annotation and area under characteristic curve for retrieval. From our experiments, we nd that key augmented MFCC performs better for some certain categories. However, there is still, some categories such as Instrument which are better predicted using the original dataset. |
Appears in Collections: | OAPS - Dept. of Computer Science |
Files in This Item:
File | Size | Format | |
---|---|---|---|
fulltext.html | 154 B | HTML | View/Open |
Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.