Automatic system based on EM-MAP-GMM modeling methods was used. Four various coders of H.323 standard were investigated with special emphasis placed on packet loss phenomenon.

(1)

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

Abstract—Despite the growing importance of Voice

Transmission over IP (VoIP) systems, there is still a shortage of thorough analyses of VoIP transmission effect on speech signal, speech and speaker recognition performance. In this study the effect on speaker verification performance was investigated.

Automatic system based on EM-MAP-GMM modeling methods was used. Four various coders of H.323 standard were investigated with special emphasis placed on packet loss phenomenon.

Index Terms— Automatic Speaker Verification, packet loss,

speech compression, voice over IP.

I. I

NTRODUCTION

he popularity of automatic speaker recognition as one of the methods of biometric human identification, is constantly growing. The natural consumer of this type of technology is a bank sector. The voice biometrics is becoming supplemental to traditional security methods like passwords.

Another potential beneficiary of voice biometrics’ discipline are forensic sciences. Forensic speaker identification experts may use not only aural-perceptual methods but also unbiased parametric analysis based on automatic verification [14][19].

VoIP technology is an effect of a widespread access to the internet and continuing development of IP technologies. The most important advantages of VoIP are: lower costs of telephone conversations and possibility of a parallel non-audio data transmission. According to telecommunication market predictions, there will be 88 millions of VoIP users in Western Europe by the end of 2012, which represents a 240 % growth since 2007 [4]. These market data show that the role of VoIP is still growing. The effect of GSM and PSTN transmission degradations on speaker verification performance was a subject of many previous studies. The majority of investigations focused at evaluating the influence of additive noise like white and colored noise [6] as well as the influence of non-linear spectral distortions [6][9]. Results showed that this type of distortions caused the degradation of speaker verification performance. Other researchers evaluated the influence of audio GSM codecs, such as GSM 06.10, GSM 06.20 and GSM 06.60 [7][8]. All GSM and PSTN speech transmission phenomenon lead to speaker verification performance decline. Degree of degradation depends on transmission technology.

II. V

OICE TRANSMISSION OVER

IP

Each telecommunication system makes up a collection of transmission and switching devices. The older telephone systems like Public Switch Telephone Network used the circuit switching in which, two network nodes established a communication channel. When the connection was established the full bandwidth was guaranteed and the communication path remained busy for the entire duration of session [3].

Alternatively, the VoIP systems use the packet switching. A packet is a fundamental binary information in telecommunication network, circulating between the terminals as a series of bits. Packets consist two kinds of data: the header and the body. The header contains, among others the control information such as terminal IP, the number of packets into which the message has been divided and synchronization information. The body is a part comprising the coded and divided speech signal. Each packet is individually transmitted through the network. Unlike in the circuit switching, the communication path remains busy only during packet transmission. When packet transmission is finished, the path becomes immediately accessible [3].

A. Packet loss and packet loss concealment

The IP transmission may cause many different packet errors. Among others, packets can be lost, damaged or delayed. Most of the IP traffic is under control of TCP protocol, which provides solution of packet retransmission.

However, the VoIP technology uses UDP protocol that doesn’t provide any recovery method.

The investigation of packet loss phenomenon reveals that voice packets are lost in bursts. The IP traffic is described by a two-state simple Gilbert model [15][16], which allows using only two parameters to describe the loss process and provides good approximation of this phenomenon [17]. Example of Gilbert model is presented on Figure 1 [5].

Fig 1 Gilbert model

The Effect of Voice Transmission over IP on Text-Independent Speaker Verification

Performance

Waldemar Maciejko

T

2012

(2)

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2 The probability p

11

stands for conditional loss probability

(clp). Mean Burst Loss Length (MBLL) based on [15] is computed as:

clp - 1

MBLL 1 (1)

Finally, based on equation (1), clp can be computed as [15][16]:

MBLL 1 1

clp (2)

The probability of being in state B, representing the mean loss, is denoted as unconditional loss probability (P

z

):

11 B 01 G

Z

p p p p

P (3)

where p

G

and p

B

are the stationary probabilities p

G

+ p

B

=1 and P

z

= p

B

. Consequently, p

01

can be expressed as

) P - (1

clp) - (1 p P

Z Z

01

(4)

Gilbert model based on equations (4) and (2) is described by [17]:

) P - (1 MBLL p P

Z Z

01

(5)

B. Speech signal compression

The final effect of speech signal compression (quality and compression ratio) depends on applied codec. Usually, higher compression causes quality loss and allows to use more narrow band to voice transmission.

The first standard of audio-video transmission by IP network was defined by H.323. This standard uses many audio codecs [2], of which: G.711 (codec with the highest bit rate), G.729 (codec with medium bit rate) and G.723.1 (codec with the lowest bit rate) were investigated in this study and are summarized in Table 1.

TABLEI

SELECTEDH.323 STANDARD AUDIO CODERS [2].

Standard Compression algorithm

Frame (ms)

Compression

ratio Bit rate kbps

G.711 PCM 0,125 1:1 64,0

G.729 +CS-ACELP 10 9:1 6,4

G.729 CS-ACELP 10 8:1 11,8

G.723.1 MP-MLQ 30 10:1 6,3

III. A

UTOMATIC SPEAKER VERIFICATION

The task of automatic speaker verification is to determine if a hypothesized test utterance Y was spoken by a hypothesized speaker M. The general approach to this task is to test the likelihood ratio. In automatic systems, the similarity between

the test utterance Y and the voice of target speaker M is quantified by similarity score, according to Bayes’ defined by:

) M p(Y

) M score p(Y

match accept

UBM

!

(6)

The likelihood ratio score is compared to a threshold δ and in case of a match the speaker is accepted. The target speaker model M can be calculated with maximum accuracy by using training speech. The model ¬M

UBM,

called Universal Background Model (UBM) represents the entire spectrum of possible alternatives to the hypothesized speaker and can not be estimated with maximum accuracy [12]. The UBM represents a set of selected speakers according to Gaussian distribution. The criterions for speaker selection to the alternative population are for example: quality of speech, gender and language. The exemplary application of the automatic speaker verification based on UBM is presented on the figure 2.

To determine the likelihood functions of the nominator and denominator shown on equation 6, is the main objective of an automatic verification model. This functions depend among others, on the low level of used spectral features. The current system extracts 16 MFCC coefficients, 16 Δ MFCC first order derivatives and the signal log energy using the HTK binary HCopy [20]. The core of the Mel Frequency Cepstral Coefficients is the short-term Fourier spectrum. The mel- frequency analysis uses filters spaced linearly at low frequencies and logarithmically at high frequencies. Mel-scale is related to human pitch sensation. Such signal representation captures individually and phonetically important characteristics of speech [10]. The MFCC signal representation underlies the subsequent speech analysis.

Fig 2 Basic building blocks of speaker verification system.

The next step is the cepstral mean variance-normalization (CMS). This technique permits to use recognizer even in case of a mismatch between training and testing environment [11].

The applied algorithm uses the voice automatic detection (VAD) and the CMS normalization is conducted only on a speech, non zero frames. The VAD method is based on a 2

^nd

order GMM of energy distribution. The assumption of this method is a difference between the speech energy and the energy of non speech frames [23].

In text-independent speaker recognition there is no previous

knowledge of what the speaker will say. Over the past dozen

(3)

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3 years, Gaussian mixture models (GMM) has been remaining

the most successful method of estimation of text-independent speaker model. The GMM is based on assumption that the probability density function of some features, being defined in a multidimensional space, can be estimated with function of mixture models. Each model of the mixture represents some high level phonetic sound [12]. Gaussian mixture model is a weighted linear combination of R Gaussian densities p

i

(y).

¦

^R

i i

i

p y

M y p

1

) ( )

|

( Z (7)

where R is the model order, ω

i

is the vector of weights.

Gaussian PDF p

i

(y) for D-dimensional spectral features is given by:

¿ ¾

½

¯ ®

c 6

6 ( )

) ( ) 2 (

exp 1 )

2 ( ) 1

(

₁_/₂ ¹

2

/ i i i

i

y

D

y y

p P P

S

(8) where μ

i

is the mean vector and Σ

i

is a covariance matrix where i=1…R. Because of an assumption that MFCC feature vectors are independent, the likelihood of appearance of a sequence of MFCC vectors Y = {y

1

, y

2

, ... y

T

} in model M is defined as:

¦

^T

t

M

y p M

Y p

1

)

| ( log )

| (

log (9)

where model M is described by parameters {μ

i

, Σ

i

, ω

i

}, where i

= 1, …, R

To estimate the Universal Background Model (¬M

UBM

), the Expectation Maximization (EM) algorithm described in [12], [13], with model order 128 is used. The EM method is an iterative algorithm for approximating Maximum Likelihood in the case of incomplete data. To estimate target speaker model (M), Maximum a Posteriori algorithm is used where relevance factor r equals 14 [12][13]. MAP procedure allows to adapt the parameters of the Universal Background Model using the speaker’s training speech. To build probabilistic speaker models and background models, the BECARS C++ library was used [22].

IV. S

PEAKER VERIFICATION OVER

IP

NETWORKS EXPERIMENTS

A. Database

The verification experiments were conducted on a database containing recordings of 38 polish language speakers performed with a high quality condenser microphone in PCM format with 44,1 kHz sampling frequency and 16 bit resolution. Both the speakers’ training and the test utterance were performed on approximately 30-second long, phonetically rich sentences.

B. Packet loss and H.323 audio coders speakers verification performance degradation

The speaker database was transcoded with coders described in section II.B. The simulation of packet loss was performed according to Gilbert model (described in section II.A). Packet loss phenomenon was analyzed at varying degrees of packet loss from 0 up to 25%. The results of the performed tests are

presented on Figures 3-6 in the form of Detection Error Tradeoff curves [18].

Fig 3 Effect of voice transmission over IP on text-independent speaker verification performance for varying degrees of packets loss, from 0 up to 25%. The G.711 coder was used.

Fig 4 Effect of voice transmission over IP on text-independent speaker verification performance for varying degrees of packet loss, from 0 up to 25%.

The G.723.1 6,3 kbps coder was used.

Fig 5 Effect of voice transmission over IP on text-independent speaker verification performance for varying degrees of packet loss, from 0 up to 25%.

The G.729 11,8 kbps coder was used.

(4)

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4

Fig 6 Effect of voice transmission over IP on text-independent speaker verification performance for varying degrees of packet loss, from 0 up to 25%, G.729 6,4 kbps coder was used.

In all plots, Pz represents the probability of a packet loss, while EER represents Equal Error Rate and was marked as black dots.

V. S

UMMARY

Speaker verification performance is summarized on Fig 7.

Notably, the results do not depend on applied speech compression method. There are no obvious differences between coders with high bit rate and low bit rate.

Additionally, the obtained results are opposite to scores presented in [1], where bit rate was an important parameter. A source of this dissonance lies probably in the methods used for coding of target and test utterance. Bit rate becomes important if both test and target are coded with different methods [1].

Fig 7 Summary of the experiment. Equal Error Rate as a function of the probability packet loss for various methods of speech compression.

The main factor of speaker verification performance degradation is a packet loss. The dependence between packet loss probability and Equal Error Rate is almost linear, regardless of applied codec, which is shown on Fig 7. On the other hand, the highest degree of packet loss ensuring acceptable voice quality must not exceed 1%. At this rate, the EER factor has increased by no more than 1%, compared to no packet loss .

References

[1] L. Besacier, A. M. Ariyaeeinia, J. S. Mason, J. F.Bonastre, P. Mayorga, C. Fredouille, S. Meignier, J. Siau, N. W. D. Evans, R. Auckenthaler and R. Stapert,“Voice biometrics over the internet in the framework of COST action 275” , EURASIP Journal on Applied Signal Processing 2004:4, 466-479, Hindawi Publishing Corporation.

[2] ITU-T H.323 Series H: Audiovisual and multimedia systems.

Infrastructure of audiovisual services – Systems and terminal equipment for audiovisual services. Packet-based multimedia communications systems. Recommendation ITU-T H.323, International

Telecommunication Union12/2009.

[3] A. Jajszczyk, “Wstęp do telekomunikacji”, Podręczniki akademickie WNT 2009.

[4] Fierce Enterprise Communications (2008, March) Europe VoIP to reach 88 million,Available:http://www.fierceenterprisecommunications.com.

[5] E. N.Gilbert,“Capacity of a burst-noise channel”,The Bell System Technical Journal, September 1960.

[6] D. A. Reynolds, M. A. Zissman, T.F. Quatieri,G. C. O’Leary,B.

A.Carlson,“The effect of telephone transmission degradation on speaker recognition performance”, Acoustics, Speech, and Signal Processing, 1995, ICASSP-95.

[7] L. Besacier, S. Grassi, A. Dufaux, M. Ansorge, F. Pellandini, GSM speech coding and speaker recognition, ICASSP 2000.

[8] C. Byrne, P. Foulkes, “The mobile phone effect on vowel formants”

International Journal of Speech Language and the Law, Vol 11. No 1 2004.

[9] D.A. Reynolds, “The effects of handset variability on speaker recognition performance: experiments on the switchboard corpus”, Acoustics, Speech and Signal Processing, ICASSP-96, Conference Proceedings.

[10] S.B. Davis, P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”, IEEE Transactions on Acoustics, Speech and Signal Processing28(4).

1980. s. 357–366.

[11] S. Furui, “Cepstral analysis technique for automatic speaker

verification”, IEEE Transactions Acoustics, Speech, Signal Processing, ASSP-29. 1981. s. 254-272.

[12] D. A. Reynolds, T. F. Quatieri, R. B. Dunn, “Speaker verification using adapted gaussian mixture models”, Digital Signal Processing, nr. 10, 2000, s. 19– 41.

[13] D. A. Reynolds, R. C. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models”, IEEE Transactions on Speech and Audio Processing, nr. 3(1), 1995, s.72–83.

[14] P. Rose “Forensic speaker identification”, Taylor & Francis. 2002. s. 72.

[15] H. Sanneck, “Packet loss recovery and control for voice transmission over the interne”, Ph.D. thesis Technischen Universität Berlin 2000, unpublished.

[16] S. Jelassi, G. A. Rubino, “A study of artificial speech quality assessor of VoIP calls subject to limited bursty packet losses” EURASIP Journal on Image and Video Processing 2011, 2011:9.

[17] S. Mohamed, G. Rubino, M. Varela Perfomance evaluation of real-time speech through a packet network: a random networks-based approach.

Performance evaluation. An international Journal. 57 (2004) 141-161.

[18] A. Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, The DET curve in assessment of detection task performance In Proc.

Eurospeech ’97, pages 1895–1898, Rhodes, Greece 1997.

[19] W. Maciejko, “Biometryczne rozpoznawanie mówców w kryminalistyce”, Problemy Kryminalistyki 275, Warszawa 2012.

[20] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G.

Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, (2009),“The HTK book v3.4”,Cambridge 2009.

[21] O. Viikki, K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition”,Speech Communication 25 (1998) 133-147.

[22] C. Mokbel, H. Mokbel, R. Blouet, G. Aversano, BECARS library and tools for speaker verification 1.1.0, April 2005

[23] I. Margin-Chagnolleau, G. Gravier, R. Blouet Overview of the 2000- 2001 ELISA consortium research activities ISCA A speaker Odyssey The Speaker Recognition Workshop Crete 2001.

Automatic system based on EM-MAP-GMM modeling methods was used. Four various coders of H.323 standard were investigated with special emphasis placed on packet loss phenomenon.

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

Transmission over IP (VoIP) systems, there is still a shortage of thorough analyses of VoIP transmission effect on speech signal, speech and speaker recognition performance. In this study the effect on speaker verification performance was investigated.

Automatic system based on EM-MAP-GMM modeling methods was used. Four various coders of H.323 standard were investigated with special emphasis placed on packet loss phenomenon.

speech compression, voice over IP.

I. I

he popularity of automatic speaker recognition as one of the methods of biometric human identification, is constantly growing. The natural consumer of this type of technology is a bank sector. The voice biometrics is becoming supplemental to traditional security methods like passwords.

Another potential beneficiary of voice biometrics’ discipline are forensic sciences. Forensic speaker identification experts may use not only aural-perceptual methods but also unbiased parametric analysis based on automatic verification [14][19].

II. V

IP

A. Packet loss and packet loss concealment

The IP transmission may cause many different packet errors. Among others, packets can be lost, damaged or delayed. Most of the IP traffic is under control of TCP protocol, which provides solution of packet retransmission.

However, the VoIP technology uses UDP protocol that doesn’t provide any recovery method.

The Effect of Voice Transmission over IP on Text-Independent Speaker Verification

Performance

Waldemar Maciejko

T

2012

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2 The probability p

stands for conditional loss probability

(clp). Mean Burst Loss Length (MBLL) based on [15] is computed as:

clp - 1

MBLL 1 (1)

Finally, based on equation (1), clp can be computed as [15][16]:

MBLL 1 1

clp  (2)

The probability of being in state B, representing the mean loss, is denoted as unconditional loss probability (P

):

p p p p

P  (3)

where p

and p

are the stationary probabilities p

+ p

=1 and P

= p

. Consequently, p

can be expressed as

) P - (1

clp) - (1 p P

(4)

Gilbert model based on equations (4) and (2) is described by [17]:

) P - (1 MBLL p P

 (5)

B. Speech signal compression

The final effect of speech signal compression (quality and compression ratio) depends on applied codec. Usually, higher compression causes quality loss and allows to use more narrow band to voice transmission.

III. A

The task of automatic speaker verification is to determine if a hypothesized test utterance Y was spoken by a hypothesized speaker M. The general approach to this task is to test the likelihood ratio. In automatic systems, the similarity between

the test utterance Y and the voice of target speaker M is quantified by similarity score, according to Bayes’ defined by:

) M p(Y

) M score p(Y

match accept



!

 (6)

The likelihood ratio score is compared to a threshold δ and in case of a match the speaker is accepted. The target speaker model M can be calculated with maximum accuracy by using training speech. The model ¬M

The next step is the cepstral mean variance-normalization (CMS). This technique permits to use recognizer even in case of a mismatch between training and testing environment [11].

The applied algorithm uses the voice automatic detection (VAD) and the CMS normalization is conducted only on a speech, non zero frames. The VAD method is based on a 2

order GMM of energy distribution. The assumption of this method is a difference between the speech energy and the energy of non speech frames [23].

In text-independent speaker recognition there is no previous

knowledge of what the speaker will say. Over the past dozen

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3 years, Gaussian mixture models (GMM) has been remaining

(y).

¦

p y

M y p

) ( )

|

( Z (7)

where R is the model order, ω

is the vector of weights.

Gaussian PDF p

(y) for D-dimensional spectral features is given by:

¿ ¾

½

¯ ®

­    c 6 

6 



( )

clp (2)

P (3)

(5)

(6)

c 6

6