4 OVERVIEW OF AUDIO SIGNAL PARAMETRIZATION
4.1 MUSIC MOOD RECOGNITION PARAMETRIZATION
4.1.1 Music Features and Parameters Related to Mood of Music
Figure 4.1 Three layers of music interpretation and description
4.1.1 Music Features and Parameters Related to Mood of Music
In MIR and especially in MER, studies are performed to determine the relationship between music features and the impact on the listener. Music Emotion Recognition is the area where these relationships are crucial and underlie the whole concept of mood recognition. In the subsequent section relationships investigated by different researchers are cited and compared. On the other hand, it could be observed that composers commonly
use these rules; a skilled and conscious one uses particular elements of music to achieve desired impact on the listener.
Several works related to Music Mood Recognition refer to different groups and combinations of music features and specific parameters. However, it should be remembered that the basis of these descriptors had roots in research performed earlier within general Music Information Recognition [50,232].
Eronen [79] analyzed several features with regard to recognition performance in a musical instrument recognition system. He took into consideration a wide set of features covering both spectral and temporal properties of sound.
A number of studies considered interpretation between music features and valence and arousal [91,92,177,327]. Valence and arousal are expressions taken from Thayer's model of emotion, which is described in details in Section 2.5. A summarized set of music features important in the prediction of valence and arousal is listed in Tab. 4.1 [43]. Definitions and description of music features are included in Section 2.2.
Table 4.1 Features in the prediction of valence and arousal [43]
Valence Arousal
Chroma Slow tempo
Percussiveness variability across bands Loudness
Measure related to the ratio of fast and
slow tempos Chroma eccentricity
Modulation spectrum Fast tempo
Harmonic strangeness Spectral tilt
Hevner summarized her findings related to the music features that create emotional content of music [108]. They are schematically shown in Tab. 4.2, according to eight clusters of adjectives included in Hevner's model of emotions (Fig. 2.20) [108].
The next step of the analysis is to determine the relationship between music features and particular parameters. Brinker [43] tested 79 features and proposed a schematic alignment, which is presented in Tab. 4.3.
Table 4.2 Musical characteristics related to emotion groups with weights proposed by Hevner [108]
No. Adjectives Music Characteristic Weight
1 spiritual, lofty, awe-inspiring,
2 pathetic, doleful, sad, mournful, tragic, melancholy, frustrated
Table 4.3 Parameters related to musical features proposed by Brinker [43]
Feature class Descriptor
Spectral MFCC and modulations
Tonality Chroma, key consonants, dissonants,
harmonic strangeness, chroma eccentricity Rhythm Tempos (fast-‐slow), onsets, inter-‐onsets
intervals
Percussiveness Characterization and classification of onset per band
This set allowed Brinker to achieve Valence and Arousal prediction with variance of 0.68 for Arousal and 0.50 for Valence [43].
As mentioned before, within the area of Music Emotion Recognition authors use different sets of parameters and algorithms. Panda for audio features extraction employed Marsyas and MIR Toolbox [234]. He fed parameters into SVMs classification and regression system, reducing the number of features using forward feature selection (FFS).
Rauber and his collaborators [256] executed 2-‐stage features extraction based on Psycho-‐Acoustic Models and used them in the SOM model (Fig. 4.2). They based their parameters on the basic of auditory perception that is loudness sensation and rhythm patterns per frequency band.
Figure 4.2 2-‐stage feature extraction proposed by Rauber [256]
Baume [25] described 47 different types of audio features and evaluated them for the purpose of music mood recognition. He tested these sets with different types of regressors (Fig. 4.3) as well as different subsets for each SVR regressor (Tab.4.4). Baume used different subset evaluation techniques that can be divided into three categories. He followed Liu's [184]: categorization of feature selection techniques the filter model, the wrapper model and the hybrid model [25]. The filter model relies on general characteristics of the data to evaluate feature subsets, whereas the wrapper model uses the performance of a predetermined algorithm (such as a support vector machine) as the evaluation criterion.
The wrapper model gives superior performance as it finds features best suited to the chosen algorithm, but it is more computationally expensive and specific to that algorithm.
The hybrid model attempts to combine the advantages of both. Results of his works are presented accordingly in Fig. 4.3 and Tab. 4.4.
For the purpose of MIR, including genre classification and MER, Li [175] used MFCC, STFT, DWCH and lyrics-‐based feature sets. At the same time Skowronek and her collaborators [287,288] employed mostly rhythm based, key and chroma features in their experiments. Schmidt [273,275] tested several subgroups of features (i. e. MFCC, Chroma, and Statistical Spectrum Descriptors) for emotion recognition and time-‐varying emotion regression. Schmidt analyzed individual sets of features and determined accuracy of 4-‐
category classification for each of them (Tab. 4.5). Schmidt [273,274] tested several subgroups of features (i. e. MFCC, Chroma, and Statistical Spectrum Descriptors) for emotion recognition and time-‐varying emotion regression. Schmidt analyzed individual sets of features and determined accuracy of 4-‐categories classification for each of them (Tab.
4.5).
Figure 4.3 The absolute error of the best performing combinations for each of the five regressors. The first
Table 4.4 Best feature combinations for each regressor [25]
Table 4.5 Results of 4-‐way mood classification for several groups of parameters [275]
MFCC & Spectral Contrast 50.18±4.18%
As a result, Schmidt found different features appropriate for particular analysis [273-‐
275]. Even within his research, length and the content of the feature vector varied.
[235] Panda and collaborators recently proposed a unique feature set consisting of standard and melodic features extracted directly from audio. Their results show that melodic features perform better than standard audio. They achieved the result of 64% F-‐
measure, with only 11 features (9 melodic and 2 standard).
In many studies different layers of the music parametrization are mixed. Parameters are commonly included in music features set and vice versa. For example Brinker with collaborators [43] put on the same stage of analysis chroma and modulation spectrum (Tab.
4.2). On the other hand, systematics proposed by Thayer [308] are related only to the music features and are very hard to describe without the expert involved in the process.
Methodology proposed by the author of the presented dissertation is based on 3-‐stage music analysis described before (Fig. 4.1). It involves attempt to create time-‐based parameters that describe particular musical content with mathematical tools. Proposed parameters and motivation behind them is presented in Section 4.6.
Each of the works presented in this Section refers to different sets of features and parameters, even though all of them aim at music mood recognition. Moreover, even within one computational method, different settings may require other parameters. Therefore it is difficult to determine one and only valid set of features which would be suitable for any approach to mood description and recognition of music.
4.1.2 Preprocessing
Preprocessing is a very important step that occurs before almost any analysis. The purpose of the process is preparing or adjusting data for the particular method or goal, extracting desired information and removing redundant content.
Usually, data values within a dataset may differ widely, which is one of the reasons of preprocessing. Normalization is often applied to bring various data to the same range of values. The procedure of and different types of normalization are described in Section 5.1.
Regardless of the type of the extracted feature, segmentation of the analyzed signal is first applied to set appropriate time resolution for the particular analysis and recognition tasks. Segmentation of audio piece is used to split it into its structural components such as vowels, phrases, notes, bars, etc. [159]. It is also commonly used for the purpose of analysis of varying signals to achieve more detailed information within time domain. During parametrization process, segmentation is implemented, i.e. to avoid bias caused by fragments of silence, observe differences between fragments, perform proper averaging process. Lengths of segments as well as their overlap, etc. are adjusted to match the requirements of the specific feature. In MER, segmentation is used not only for smoothing and determining whether some values are constant, but also for mood tracking [188].