4 OVERVIEW OF AUDIO SIGNAL PARAMETRIZATION
4.2 MPEG-‐7-‐BASED AUDIO PARAMETERS
MPEG-‐7 audio parameters [215] are commonly used in MIR including MER [29,132,232], therefore they are listed and described in the subsequent section.
MPEG-‐7 standard is a set of standardized tools to describe multimedia content. MPEG-‐7 standard provides tools for audio, images and video data and are used both by humans as well as automatic systems. MPEG-‐7 Audio refers to audio content in any multimedia subject.
Even though MPEG-‐7 Audio features are widely described and commented in the literature [132,215,216,267], therefore they will only be reviewed in the following Section shortly.
MPEG-‐7 Audio contains low-‐level descriptors that can be implemented in many applications as well as high-‐level descriptors, which are more specific to a set of applications described in standard [215]. Low-‐level descriptors are grouped and listed in Tab. 4.6. High-‐level tools include more complex schemes and procedures, which are: the audio signature Description Scheme, musical instrument timbre Description Schemes, the melody Description Tools to aid query-‐by-‐humming, general sound recognition and indexing Description Tools, and spoken content Description Tools. Since high-‐level descriptors are dedicated to specific tasks, which do not apply to the topic of presented
The MPEG-‐7 low-‐level audio descriptors are constructed to describe general attributes of audio signal. There are 17 temporal and spectral descriptors that can be extracted from audio automatically and may be used in a variety of applications. MPEG-‐7 descriptors are often used to determine similarity between different audio signals. Thus it is possible to identify identical, similar or dissimilar audio content. This also provides the basis for classification of audio content.
Table 4.6 MPEG-7 Audio Low-level descriptors
Group Low-‐level descriptor Abbreviation
Basic Audio Waveform Spectral Basis Audio Spectrum Basis
AudioSpectrumProjection ASB
ASP Signal Parameters Audio Harmonicity Audio
Fundamental Frequency AH
AFF Timbral Temporal Log Attack Time
Temporal Centroid LAT
Basic Descriptors provide simple description of temporal structure of an audio signal.
They are listed below including essential information.
Audio Waveform
Audio Waveform (AW) is defined to get a compact description of the shape of an audio signal. Whole signal is divided into non-‐overlapping frames (hopSize) and the lower (minRange) and upper (maxRange) limit of audio amplitude in the frame are stored. AW consist of minRange and maxRange time series, numbered accordingly to the frame index (hopSize). Comparison of the regular waveform and AW representation are shown in Figs.
4.4a and 4.4b.
Audio Power
Audio Power (AP) describes the temporally smoothed instantaneous power of the audio.
The AP coefficient of the m-‐th frame of the signal is calculated according to the following formula:
AP(m) = 1
N | S(n + mN ) |2
n=0 N−1
∑
(4.1)An example of the AP description of a music signal is given in Figure 4.4c.
Figure 4.4 Comparison of representations of audio signal: a) original signal, b) Audio Waveform, c) Audio Power
4.2.2 Basic Spectral Descriptors
Basic Spectral Descriptors provide time series of descriptions in the frequency domain.
Frequencies are scaled logarithmically.
Audio Spectrum Envelope
Audio Spectrum Envelope (ASE) is a log-‐frequency power spectrum, which is obtained by
bands are distributed within the range [loEdge, hiEdge], according to the chosen resolution
where P(k) is the power spectrum (see Eq. 4.1).
Audio Spectrum Centroid
Audio Spectrum Centroid (ASC) stands for the center of gravity of a log-‐frequency power spectrum and is calculated as following:
ASC = frequencies below 62.5 Hz are treated as a single band to avoid disproportionate weight of low-‐frequency components. Detailed information about particular is included in Kim’s work [132].
Audio Spectrum Spread
AudioSpectrumSpread (ASS) is a measure of the spectral shape. It is defined as the second central moment of the log-‐frequency spectrum.
Audio Spectrum Flatness (ASF) characterizes an audio spectrum and provides a way to quantify how noise-‐like or how tone-‐like a given sound is [100,189]. It describes the amount of peaks or resonant structure in a power spectrum, as opposed to flat spectrum of white noise. A high spectral flatness (value 1.0 for white noise) indicates that the spectrum
for a pure tone) indicates that the spectral power is concentrated in a relatively small number of bands (mixture of sine waves) [29]. ASF is calculated by dividing the geometric mean of the power spectrum by the arithmetic mean of the power spectrum [189]. Spectral Flatness Measure is calculated as follows:
SFMb(X) = sd[ X(k)
where, X(k) is magnitude spectrum of signal x(t). The ASF is calculated within separate sub-‐bands b.
4.2.3 Spectral Basis
Audio Spectrum Basis (ASB) and Audio Spectrum Projection (ASP) descriptors were initially defined to be used in the MPEG-‐7 sound recognition high-‐level tool [132]. Their main concept includes the projection of an audio signal spectrum (high-‐dimensional representation) into a low-‐dimensional representation. This processing is aimed for classification systems. The extraction of ASB and ASP is based on normalized techniques spectrum: Harmonic Ratio HR (the ratio of harmonic power to total power) and Upper Limit of Harmonicity ULH (the frequency beyond which the spectrum cannot be considered
Upper Limit of Harmonicity is an estimation of the frequency beyond which the spectrum no longer has any harmonic structure.
Audio Fundamental Frequency
Audio Fundamental Frequency (AFF) provides estimations of the fundamental frequency f0 in segments where the signal is assumed to be periodic. It can be interpreted as an approximation of the pitch of any music or speech signals.
Detailed calculation procedures Signal Parameters are included in [132].
4.2.5 Timbral Temporal
Timbral Temporal descriptors are extracted from the signal envelope in the time domain. They aim at describing perceptual features of instrument sounds based on ADSR envelope. It is schematically shown in Fig. 4.5.
Figure 4.5 Schema of ADSR envelope of a single sound
Typical phases of ADSR are: Attack (the sound reaches its maximum volume), Decay (time when volume reaches the second volume level known as the sustain level), Sustain (is the volume level at which the sound sustains after the decay phase) and Release (volume reduces to zero).
Log Attack Time
Log Attack Time (LAT) is defined as the time it takes to reach the maximum amplitude of a signal from the minimum threshold time.
LAT = log10(Tstop− Tstart) (4.7)
Temporal Centroid
Temporal Centroid (TC) is defined as the time average over the energy envelope of the signal and is calculated as follows:
TC =N
where Env(l) is the signal envelope.
4.2.6 Timbral Spectral Descriptors
Timbral Spectral describe the structure of harmonic spectra and are extracted in a linear frequency space.
Harmonic Spectral Centroid
Harmonic Spectral Centroid (HSC) is defined as the average, over the duration of the signal, of the amplitude-‐weighted mean (on a linear scale) of the harmonic peaks of the spectrum. For a given frame l it is defined:
LHSCl=
Thus, HSC value is obtained by averaging the local centroids over the total number of frames:
Spectral Centroid (SC) is not related to the harmonic structure of the signal. It gives the power-‐weighted average of the discrete frequencies of the estimated spectrum over the sound segment. SC is highly correlated with the perceptual feature of the brightness of sound [132] and is calculated as following:
SC =
Harmonic Spectral Deviation
Harmonic Spectral Deviation (HSD) measures the deviation of the harmonic peaks from the envelopes of the local spectra. To achieve HSD, local measures are averaged over the
Harmonic Spectral Spread (HSS) is a measure of the average spectrum spread in relation to the HSC. At the frame level, it is defined as the power-‐weighted RMS deviation from the local HSC LHSC (Eq. 4.9).
Harmonic Spectral Variation
Harmonic Spectral Variation (HSV) reflects the spectral variation between adjacent frames. At the frame level, it is defined as the complement to 1 of the normalized correlation between the amplitudes of harmonic peaks taken from two adjacent frames.