Automatic Music Genre Classification

Automatic Music Genre ClassificationMei-Lan Chu, R97945048christychu@mrilab.orgGraduate Institute of Biomedical Electronics and Bioinformatics,National Taiwan University, Taipei, TaiwanAbstractSince the rapidly increasing size of database, digital music becomes much morepopular. An automatic and faster system is thus demanding for enhancing thesearching efficiency and quality, music genre classification [1-3], music emotionclassification [4], beat tracking [5], preference recommendation [6], and etc.. In thistutorial, we focus on the music genre classification system. Traditionally, the genre ofeach song is tagged by experts, called expert system, which is subjective andtime-consuming. To address this issue, low-level features are exploited forclassification, which are extracted directly from music signal and describe sound on afine scale. Low-level features including spectrum, amplitude, periodicity, chord andetc. typically. Recently, several low-level features which are borrowed from speechsignal processing, such as Mel-frequencty cepstral coefficients (MFCCs), areexploited as well. On the other hand, different data-mining algorithms, includingsupervised and unsupervised classification, are proposed to classifying features. Theclassifying accuracy is about 80% up to 2009 [7]. The automatic genre classificationinvolves the following operations: We review state-of-the-art methods for automaticgenre classification in this tutorial, and experimental results from different works arecompared as well.1

I. What is music “genre”?Music can be divided into many categories based on styles, rhythm, and evencultural background. The styles are what we call the “genres”. The boundary ofmusic genres is ambiguous and one song may belong to several genres withdifferent weighting. What is more, different countries and organizations havedifferent genre lists, and they even define the same genre with differentdefinitions. There is not an official specification of music genre until now.Traditionally, the genres of music are tagged by musical experts who may bemusicians, professors or artists, and the genres of songs may be subjective andaffected by cultural background. Moreover, the work loading is considerable.Therefore, an objective and convenient genre classification system is demanding.There are about 500 to 800 genres in music [8, 9]. Different genres may coverpart of the same style and even with the hierarchical structure, so theclassification may be quite complex. In research [3, 7] and some MP3 playerswith genre classifying function, only about ten genres are used for simplicity andpracticability. Please refer to section iv for genres often used.The application of genre classification is not just for music retrieval andcategorization. Since the flourishing development of music signal processing,such as music emotional classification [4] and singer recognition [10], manyresearch area in music can no more depend only on low-level features. Toaddress these issues, high-level features are exploited, and music genre is one ofthese high-level features. For example, we can only get the signal informationfrom low-level features, but not “how the song sounds like for human” or “doesthe song say a happy or a sad story”. Since genres imply the style and theemotion in music basically, we may use music genre as a feature to assist the2

further classification.One songDatabase of songsMusic ModelingMusic ModelingFeature ExtractionFeature ExtractionUnsupervised approachesSupervised approachesClassificationSong spaceDataMiningClassificationGenre resultFigure 1. The automatic genre classificationII. Works on genre classification:We divide the process of genre classification in music into three steps:1. Music modeling2. Feature extraction3. Classification methodsThe flow chart of a music genre classification or recommendation system isshown in Figure 1. The scheme is based on typical data-mining process that a setof features is extracted from a database of songs which are modeled by a specificwindow (about 7000 30-second songs of 10 genres in the last edition of the3

MIREX genre classification contest [7]). Each song can be represented by a setof weights of features in a high-dimension feature space, and then classified bydata-mining process (supervised or unsupervised classification) into severalgenres. Recognition of a new song is performed by “projecting” a new song intothe subspace spanned by the features (“song space”) and then classifying thesong by comparing its position in the song space with the position of knowngenres. The three steps of music genre classification are introduced in thefollowing subsections.i. Music ModelingAn untrained and non-expert person can detect the genre of a song withaccuracy of 72% by hearing three-second segmentation of the song. [11]However, a human-like processing of songs has not been developed in acomputer, and thus we have to choose which segments of a song that can mostlyrepresent the song should be used for feature extraction. Generally,excerpt-level modeling and song-level modeling are most popular. Excerpt-levelmodeling takes a short segment of a song for feature extraction, whilesong-level modeling takes a whole song. The former is more popular than thelatter, since the complex structure of a whole song may somehow influence therepresentativeness of features.However, how to choose a segment that canbest represent a song is still an ongoing problem.ii.Feature ExtractionPractically, a concise representation of a song is unavailable, and we have toaddress this issue by dealing with audio sample, which is directed obtained by4

The triangle bandpass filters in Figure 3 are used for mel-scaling masksmoothing, and are designed based on the sensitivity of hearing of human.Since the perception of hearing for low frequency is much more sensitivethan that for high frequency, the smoothing triangle filter smooth largerrange of high frequency, while the bandwidth of the smoothing filter forlow frequency is narrower. The main purpose of triangle bandpass filters isto emphasize the Formants, i.e. the maximal power in local spectrum, andeliminate the influence of harmonics. Therefore, MFCCs are independent ofthe pitch and tone of the audio signal, and thus can be an excellent featureset for speech recognition and audio processing. Log energy of the signalframe and12 coefficients of cepstrum, that is, 13-dimension feature set isthe basic MFCCs for an audio signal frame.2) Spectral shape features[1-3]:Spectral shape features are computed directly from the power spectrum ofan audio signal frame, describing the shape and characteristics of the powerspectrum. Several popular spectral shape features are introduced in thefollowing.1. Spectral centroid is the centroid of the magnitude spectrum of short timeFourier transform (STFT) and it is a measure of spectral brightness.2. Spectral flux is the squared difference between the normalizedmagnitudes of successive spectral distributions. It measures the amountof local spectral change.3. Spectral Roll-off is the frequency below which 85% of the magnitudedistribution is concentrated. It measures the spectral shape.6

3) Temporal Features[1]:Temporal features measure the shape and characteristic of audio wave intime domain. For example, zero-crossing is the number of zero crossings ofthe signal in time domain and it measures noisiness of the signal.4) Energy features[1]:Energy features describe the energy content of the signal. Low-energyfeature is the most popular one, and it measures the percentage of framesthat have root mean square (RMS) energy less than the average RMSenergy over the whole signal. It measures amplitude distribution of thesignal. For example, vocal music with silences has large low-energy valuewhile continuous strings have smaller low-energy value.5) Texture window [1, 2]:All timbre features mentioned above are computed within a small frame(about 10 – 60 ms) over a whole audio signal, that is, a song is broken intomany small frames and timbre features of each frame are computed.However, in order to capture the long-term variation of the signal, whichwe call “texture”, the actual features classified in automatic system is therunning means or variation of the extracted feature described above over anumber of small frames. Texture window is the term used to describe thislarger window. For example, in the system of [3], a small frame of 23 ms(512 samples at 22 050 Hz sampling rate) and a texture window of 1 s (43analysis windows) is used.7

2. Rhythmic Features [13, 14]Rhythmic features describe the periodicity of audio signal.1) Tempo induction:Measure the number of beats per minute and the inter-beat interval.2) Beat tracking [5]:Use bandpass filters and comb filters to extract the beat from, musicalsignals of arbitrary musical structure and containing arbitrary timbres. Onthe other hand, the simplest method is calculating the beat histogram. Forexample, in figure 4, the beat histograms of four genres are shown. As youcan see, rock and hip-hop contain higher BPM with stronger strength thanthose of classical and jazz music, The histogram is intuitive since that therhythm of rock and hip-hop music are bouncy while classical and jazzmusic are gentle.Figure 4. Beat histogram of different genres, the horizontal axis isbeat per minute (BPM), and the vertical axis is the beat strength.8

more. The resulting clusters describe the similarity among songs, that is, thesongs belong to the same cluster may sound more similar. The drawback ofk-means is that “k” must be decided in advance, and how to choose aproper “k” is still an intractable issue.2) Agglomerative hierarchical classification[16, 17]Agglomerative hierarchical classification agglomerates clusters with a treestructure, as figure 6. Initially, we take each data point in feature space as acluster C i . Find out the cluster C j with the minimal distance between C i , andthen agglomerate C i and C j into a new cluster. Repeat the operationsdescribe above to build up an agglomerative tree as figure 6. Repeat theoperations until the distance between clusters is smaller than the thresholdor the clusters number is enough.The advantage of agglomerative hierarchical classification is that we canrealize the similarity among songs in depth. For music recommendation, wecan recommend the song within the same sub-cluster first.Figure 6: Agglomerative tree, the horizontal axis is the clusternumber, and the vertical axis is the distance between each cluster.11

2. The supervised ApproachesThe supervised approach to music genre classification has been studied moreextensively. Instead of using methods that clustering with no knowledge aboutthe genres of songs as unsupervised approaches, the system designed bysupervised approaches is trained by manually labeled data at first, that is,supervised approach knows the genres of songs. When unlabeled data (newcoming data) comes, the trained system is used to classify it into a knowngenre. A number of commonly used supervised algorithms are described in thefollowing subsections.1) Support vector machines (SVM) [18, 19]SVM aims to find a classification hyper-plan that maximizes the marginamong different genre data, and the basic idea is shown in figure 7.E1E2marginFigure 7: The goal of SVM is to find a classificationhyper-plane (the solid line) to maximize the margin betweentwo support hyper-planes E1 and E2.SVM is a convex optimization problem. We do not describe the trivial solutionsin depth in this tutorial, since this is not the focus of this topic. In order to modelthe feature space for easier optimization, a kernel is needed to transform thefeature space into a higher dimension. SVM is used for genre classification with12

user, so few works extensively study on this problem.2) Over-fitting problemSome approaches for genre classification obtain a high accuracy up to 95%, butthe accuracy is accurate “only” for the specific database. On the other words, ifthe system is tuned too closely to the training data, it might perform perfectly onthese samples, but is unlikely to work well on a new database.Feature SpaceFigure 9(a)Feature SpaceFigure 9(b)Figure 9 illustrate this idea. If the system is tuned too closely to the trainingsample as figure 9(a), it would obtain 100% accuracy over these samples.However, the boundary does not honestly reflect the implied model of the data.A new coming data are likely to be misclassified by the system. The challenge isto find the right trade-off for the generalization: having a system that is complexenough to capture the differences in the implied models, but simple enough toavoid overfitting as figure 9(b).iv.Experimental ResultsMusic Information Retrieval Evaluation eXchange (MIREX) [7] is a famouscontest in music retrieval study. They offer standard database for not only15

contest but for any research purpose. There are many topics in this contest, asfigure 10.Figure 10: Topics for challenge in MIREX 2010The dataset for music genre classification in 2009 includes 7000 30-secondaudio clips in 22.05kHz mono WAV format drawn from 10 genres (700 clipsfrom each genre). Genres are blues, Jazz, Country, Baroque, Classical,Romantic, Electronic, Hip-Hop, Rock, and Metal. The mean accuracy forwinner in 2009 is 73.33%. (refer to the MIREX 2009 result page:http://www.music-ir.org/mirex/wiki/2009:Audio_Genre_Classification_(Mixed_Set)_Results ) On the other hand, the accuracy across different genres for part ofthe submitted algorithms is shown in figure 11. As you may see, Hip-hopobtains the highest accuracy and rock obtains the lowest one. This is intuitivesince hip-hop music is unique and quite different from other genre. Rock musicis likely to be confused with metal.16

Figure 11: accuracy across categoriesv. ConclusionAutomatic genre classification is a complicated and problematic task, but stillhas important value both in research and commercialized applications. Recently,performance in research about automatic genre classification has had smallgains, with the result that some have suggested that we should pursue similarityresearch instead. Some argued that we should abandon genre classification sinceits limited utility and its ambiguity and subjective nature. C. McKay, and I.Fujinaga [20] present a number of counterarguments from both psychology andsocial science that emphasize the importance of continuing research inautomatic genre classification. Most importantly, music information retrievalresearch could benefit from music datasets that include varied metadata such asgenre, emotion, lyrics, chord, etc. However, subjective and poorly labeledground truth is a problem, a large-scale effort to construct high-quality groundtruth would still be demanding.VI.References[1] N. Scaringella, G. Zoia, and D. Mlynek, “Automatic genre classification ofmusic content: a survey,” Signal Processing Magazine, IEEE, vol. 23, no. 2,17

pp. 133-141, 2006.[2] T. Li, M. Ogihara, and Q. Li, “A comparative study on content-based musicgenre classification,” in Proceedings of the 26th annual international ACMSIGIR conference on Research and development in informaion retrieval,Toronto, Canada, 2003.[3] G. Tzanetakis, and P. Cook, “Musical genre classification of audio signals,”Speech and Audio Processing, IEEE Transactions on, vol. 10, no. 5, pp.293-302, 2002.[4] Y. Yi-Hsuan, L. Yu-Ching, S. Ya-Fan et al., “A Regression Approach to MusicEmotion Recognition,” Audio, Speech, and Language Processing, IEEETransactions on, vol. 16, no. 2, pp. 448-457, 2008.[5] E. Scheirer, “Tempo and beat analysis of acoustic musical signals,” TheJournal of the Acoustical Society of America, vol. 103, no. 1, pp. 588-601,1998.[6] H.-C. Chen, and A. L. P. Chen, “A Music Recommendation System Based onMusic and User Grouping,” Journal of Intelligent Information Systems, vol. 24,no. 2, pp. 113-132, 2005.[7] “http://www.music-ir.org/mirex/wiki/2009:Main_Page,” Music InformationRetrieval Evaluation eXchange (MIREX 2009).[8] “http://www.allmusic.com/,” Allmusic.[9] “http://www.amazon.com/,” Amazon music.[10] T. Wei-Ho, and W. Hsin-Min, “Automatic singer recognition of popular musicrecordings via estimation and modeling of solo vocal signals,” Audio, Speech,and Language Processing, IEEE Transactions on, vol. 14, no. 1, pp. 330-341,2006.[11] D. P. R. Gjerdigen, “Scanning the dial: an exploration of factors in theidentification of musical style,” Proceedings of the 1999 Society for MusicPerception and Cognition, pp. 88, 1999.[12] B. Logan, “Mel frequency cepstral coefficients for music modeling,”Proceedings of the International Coference in Music Information Retrieval(ISMIR), pp. 11-23, 2000.[13] F. Gouyon, and S. Dixon, “A Review of Automatic Rhythm DescriptionSystems,” Comput. Music J., vol. 29, no. 1, pp. 34-54, 2005.[14] E. Gomez, A. Klapuri, and B. Meudic, “Melody Description and Extraction inthe Context of Music Content Processing,” Journal of New Music Research,vol. 32, no. 1, pp. 23 - 40, 2003.[15] “http://www.pandora.com/,” Pandora.[16] S. L. Aristomenis, and A. T. George, "Agglomerative Hierarchical Clustering18

For Musical Database Visualization and Browsing," 2004.[17] S. Xi, X. Changsheng, and M. S. Kankanhalli, "Unsupervised classification ofmusic genre using hidden Markov model." pp. 2023-2026 Vol.3.[18] J. A. K. Suykens, and J. Vandewalle, “Least Squares Support Vector MachineClassifiers,” Neural Processing Letters, vol. 9, pp. 293-300, 1999.[19] C. Xu, N. C. Maddage, X. Shao et al., "Musical genre classification usingsupport vector machines." pp. 429-432.[20] C. McKay, and I. Fujinaga, “Musical genre classification: Is it worth pursuingand how can it be improved?,” International Conference on Music InformationRetrieval, 2006.19

Automatic Music Genre Classification

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?