• Nie Znaleziono Wyników

4 OVERVIEW  OF  AUDIO  SIGNAL  PARAMETRIZATION

4.2 MPEG-­‐7-­‐BASED  AUDIO  PARAMETERS

MPEG-­‐7   audio   parameters   [215]   are   commonly   used   in   MIR   including   MER   [29,132,232],  therefore  they  are  listed  and  described  in  the  subsequent  section.    

MPEG-­‐7  standard  is  a  set  of  standardized  tools  to  describe  multimedia  content.  MPEG-­‐7   standard  provides  tools  for  audio,  images  and  video  data  and  are  used  both  by  humans  as   well  as  automatic  systems.  MPEG-­‐7  Audio  refers  to  audio  content  in  any  multimedia  subject.  

Even  though  MPEG-­‐7  Audio  features  are  widely  described  and  commented  in  the  literature   [132,215,216,267],  therefore  they  will  only  be  reviewed  in  the  following  Section  shortly.    

MPEG-­‐7   Audio   contains   low-­‐level   descriptors   that   can   be   implemented   in   many   applications   as   well   as   high-­‐level   descriptors,   which   are   more   specific   to   a   set   of   applications   described   in   standard   [215].   Low-­‐level   descriptors   are   grouped   and   listed   in   Tab.   4.6.   High-­‐level   tools   include   more   complex   schemes   and   procedures,   which   are:   the   audio   signature   Description   Scheme,   musical   instrument   timbre   Description   Schemes,   the   melody   Description   Tools   to   aid   query-­‐by-­‐humming,   general   sound   recognition   and   indexing   Description   Tools,   and   spoken   content   Description   Tools.   Since   high-­‐level   descriptors   are   dedicated   to   specific   tasks,   which   do   not   apply   to   the   topic   of   presented  

The  MPEG-­‐7  low-­‐level  audio  descriptors  are  constructed  to  describe  general  attributes   of  audio  signal.  There  are  17  temporal  and  spectral  descriptors  that  can  be  extracted  from   audio  automatically  and  may  be  used  in  a  variety  of  applications.  MPEG-­‐7  descriptors  are   often   used   to   determine   similarity   between   different   audio   signals.   Thus   it   is   possible   to   identify   identical,   similar   or   dissimilar   audio   content.   This   also   provides   the   basis   for   classification  of  audio  content.  

Table 4.6 MPEG-7 Audio Low-level descriptors

Group   Low-­‐level  descriptor   Abbreviation  

Basic   Audio  Waveform   Spectral  Basis   Audio  Spectrum  Basis  

AudioSpectrumProjection   ASB  

ASP   Signal  Parameters   Audio  Harmonicity  Audio  

Fundamental  Frequency   AH  

AFF   Timbral  Temporal   Log  Attack  Time    

Temporal  Centroid   LAT  

Basic   Descriptors   provide   simple   description   of   temporal   structure   of   an   audio   signal.  

They  are  listed  below  including  essential  information.  

 

Audio  Waveform  

Audio  Waveform  (AW)  is  defined  to  get  a  compact  description  of  the  shape  of  an  audio   signal.   Whole   signal   is   divided   into   non-­‐overlapping   frames   (hopSize)   and   the   lower   (minRange)   and   upper   (maxRange)   limit   of   audio   amplitude   in   the   frame   are   stored.   AW   consist  of  minRange  and  maxRange  time  series,  numbered  accordingly  to  the  frame  index   (hopSize).  Comparison  of  the  regular  waveform  and  AW  representation  are  shown  in  Figs.  

4.4a  and  4.4b.  

Audio  Power  

Audio  Power  (AP)  describes  the  temporally  smoothed  instantaneous  power  of  the  audio.  

The  AP  coefficient  of  the  m-­‐th  frame  of  the  signal  is  calculated  according  to  the  following   formula:  

AP(m) = 1

N | S(n + mN ) |2

n=0 N−1

                (4.1)  

An  example  of  the  AP  description  of  a  music  signal  is  given  in  Figure  4.4c.  

   

Figure  4.4     Comparison  of  representations  of  audio  signal:  a)  original  signal,  b)  Audio  Waveform,   c)  Audio  Power    

4.2.2 Basic  Spectral  Descriptors  

Basic  Spectral  Descriptors  provide  time  series  of  descriptions  in  the  frequency  domain.  

Frequencies  are  scaled  logarithmically.      

   

Audio  Spectrum  Envelope    

Audio  Spectrum  Envelope  (ASE)  is  a  log-­‐frequency  power  spectrum,  which  is  obtained  by  

bands  are  distributed  within  the  range  [loEdge,  hiEdge],  according  to  the  chosen  resolution  

where  P(k)  is  the  power  spectrum  (see  Eq.  4.1).  

Audio  Spectrum  Centroid    

Audio  Spectrum  Centroid  (ASC)  stands  for  the  center  of  gravity  of  a  log-­‐frequency  power   spectrum  and  is  calculated  as  following:  

ASC = frequencies  below  62.5  Hz  are  treated  as  a  single  band  to  avoid  disproportionate  weight  of   low-­‐frequency  components.  Detailed  information  about  particular  is  included  in  Kim’s  work   [132].  

Audio  Spectrum  Spread    

AudioSpectrumSpread   (ASS)   is   a   measure   of   the   spectral   shape.   It   is   defined   as   the   second  central  moment  of  the  log-­‐frequency  spectrum.  

 

Audio  Spectrum  Flatness  (ASF)  characterizes  an  audio  spectrum  and  provides  a  way  to   quantify   how   noise-­‐like   or   how   tone-­‐like   a   given   sound   is   [100,189].   It   describes   the   amount  of  peaks  or  resonant  structure  in  a  power  spectrum,  as  opposed  to  flat  spectrum  of   white  noise.  A  high  spectral  flatness  (value  1.0  for  white  noise)  indicates  that  the  spectrum  

for   a   pure   tone)   indicates   that   the   spectral   power   is   concentrated   in   a   relatively   small   number  of  bands  (mixture  of  sine  waves)  [29].  ASF  is  calculated  by  dividing  the  geometric   mean  of  the  power  spectrum  by  the  arithmetic  mean  of  the  power  spectrum  [189].  Spectral   Flatness  Measure  is  calculated  as  follows:  

SFMb(X) = sd[ X(k)

where,  X(k)  is  magnitude  spectrum  of  signal  x(t).  The  ASF  is  calculated  within  separate   sub-­‐bands  b.      

4.2.3 Spectral  Basis  

Audio   Spectrum   Basis   (ASB)   and   Audio   Spectrum   Projection   (ASP)   descriptors   were   initially   defined   to   be   used   in   the   MPEG-­‐7   sound   recognition   high-­‐level   tool   [132].   Their   main   concept   includes   the   projection   of   an   audio   signal   spectrum   (high-­‐dimensional   representation)   into   a   low-­‐dimensional   representation.   This   processing   is   aimed   for   classification   systems.   The   extraction   of   ASB   and   ASP   is   based   on   normalized   techniques   spectrum:  Harmonic  Ratio  HR  (the  ratio  of  harmonic  power  to  total  power)  and  Upper  Limit   of   Harmonicity   ULH   (the   frequency   beyond   which   the   spectrum   cannot   be   considered  

Upper  Limit  of  Harmonicity  is  an  estimation  of  the  frequency  beyond  which  the  spectrum   no  longer  has  any  harmonic  structure.    

Audio  Fundamental  Frequency  

Audio  Fundamental  Frequency  (AFF)  provides  estimations  of  the  fundamental  frequency   f0   in   segments   where   the   signal   is   assumed   to   be   periodic.   It   can   be   interpreted   as   an   approximation  of  the  pitch  of  any  music  or  speech  signals.  

Detailed  calculation  procedures  Signal  Parameters  are  included  in  [132].  

4.2.5 Timbral  Temporal  

Timbral   Temporal   descriptors   are   extracted   from   the   signal   envelope   in   the   time   domain.  They  aim  at  describing  perceptual  features  of  instrument  sounds  based  on  ADSR   envelope.  It  is  schematically  shown  in  Fig.  4.5.      

         

Figure  4.5    Schema  of  ADSR  envelope  of  a  single  sound  

Typical   phases   of   ADSR   are:   Attack   (the   sound   reaches   its   maximum   volume),   Decay   (time  when  volume  reaches  the  second  volume  level  known  as  the  sustain  level),  Sustain   (is   the   volume   level   at   which   the   sound   sustains   after   the   decay   phase)   and   Release   (volume  reduces  to  zero).  

 

Log  Attack  Time    

Log  Attack  Time  (LAT)  is  defined  as  the  time  it  takes  to  reach  the  maximum  amplitude  of   a  signal  from  the  minimum  threshold  time.  

LAT = log10(Tstop− Tstart)                 (4.7)    

   

Temporal  Centroid  

Temporal  Centroid  (TC)  is  defined  as  the  time  average  over  the  energy  envelope  of  the   signal  and  is  calculated  as  follows:  

TC =N

where    Env(l)  is  the  signal  envelope.  

4.2.6 Timbral  Spectral  Descriptors  

Timbral  Spectral  describe  the  structure  of  harmonic  spectra  and  are  extracted  in  a  linear   frequency  space.  

Harmonic  Spectral  Centroid    

Harmonic   Spectral   Centroid   (HSC)   is   defined   as   the   average,   over   the   duration   of   the   signal,   of   the   amplitude-­‐weighted   mean   (on   a   linear   scale)   of   the   harmonic   peaks   of   the   spectrum.  For  a  given  frame  l  it  is  defined:  

LHSCl=

Thus,  HSC  value  is  obtained  by  averaging  the  local  centroids  over  the  total  number  of   frames:  

Spectral  Centroid  (SC)  is  not  related  to  the  harmonic  structure  of  the  signal.  It  gives  the   power-­‐weighted   average   of   the   discrete   frequencies   of   the   estimated   spectrum   over   the   sound   segment.   SC   is   highly   correlated   with   the   perceptual   feature   of   the   brightness   of   sound  [132]  and  is  calculated  as  following:  

SC =

Harmonic  Spectral  Deviation    

Harmonic  Spectral  Deviation  (HSD)  measures  the  deviation  of  the  harmonic  peaks  from   the   envelopes   of   the   local   spectra.   To   achieve   HSD,   local   measures   are   averaged   over   the  

Harmonic  Spectral  Spread  (HSS)  is  a  measure  of  the  average  spectrum  spread  in  relation   to  the  HSC.  At  the  frame  level,  it  is  defined  as  the  power-­‐weighted  RMS  deviation  from  the   local  HSC  LHSC  (Eq.  4.9).  

Harmonic  Spectral  Variation  

Harmonic   Spectral   Variation   (HSV)   reflects   the   spectral   variation   between   adjacent   frames.  At  the  frame  level,  it  is  defined  as  the  complement  to  1  of  the  normalized  correlation   between  the  amplitudes  of  harmonic  peaks  taken  from  two  adjacent  frames.