• Nie Znaleziono Wyników

Index of /rozprawy2/11160

N/A
N/A
Protected

Academic year: 2021

Share "Index of /rozprawy2/11160"

Copied!
169
0
0

Pełen tekst

(1)Faculty of Electrotechnics, Automatics, Computer Science and Biomedical Engineering. Analysis of non-linguistic content of speech signals mgr in˙z. Magdalena Igras-Cybulska PH. D. THESIS. Supervisor prof. dr hab. in˙z. Mariusz Ziółko Auxiliary Supervisor dr in˙z. Bartosz Ziółko. Kraków 2016.

(2) © 2016 Magdalena Igras DECLARATION I hereby declare that the work in this Thesis is my own original work, except where indicated in the text. The Thesis is based on the following publications, with the consent of co-authors: — — — — — — — — — — — — — — — — —. — —. —. M. Igras and B. Ziółko, "The role of acoustic features in marking accent and delimiting sentence boundaries in spoken Polish," Acta Physica Polonica. A 2014 vol. 126 no. 6, s. 1246–1257. M. Igras and B. Ziółko, "Detection of Sentence Boundaries in Polish Based on Acoustic Cues", Archives of Acoustics 2016, 41, 2, pp. 233–243. M. Igras-Cybulska, B. Ziółko, P. Zelasko and M. Witkowski, "Structure of pauses in speech in the context of speaker verification and classification of spontaneous speech type", 2016, EURASIP Journal on Audio, Speech, and Music Processing, 2016(1):18, 2016. M. Igras and B. Ziółko, "Akustyczne korelaty intonacji ironicznej", W: Sens i brzmienie, red. Magdalena Danielewiczowa et al., Warszawa: Wydawnictwo Uniwersytetu Kardynała Stefana Wyszy´nskiego, 2015. s. 33–48. M. Igras and B. Ziółko, "Rodzaje pauz akustycznych i ich korelacje z interpunkcja˛ w transkrypcjach mówionego jezyka ˛ polskiego" In: Bogactwo współczesnej ˙ polszczyzny / pod red. Piotra Zmigrodzkiego, Sylwii Przeczek-Kisielak. ˛ Kraków: Towarzystwo Miło´sników Jezyka ˛ Polskiego, 2014. s. 61–69. M. Igras and B. Ziółko, "Baza danych nagra´n mowy emocjonalnej" Studia Informatica 2013 vol. 34 no. 2B, s. 67–77. M. Igras and W. Wszołek, "Pomiary parametrów akustycznych mowy emocjonalnej – krok ku modelowaniu wokalnej ekspresji emocji" Pomiary, Automatyka, Kontrola 2012 vol. 58 nr 4, s. 335–338. M. Igras and B. Ziółko, "Wavelet method for breath detection in audio signals," In: ICME 2013: 2013 IEEE International Conference on Multimedia and Expo : July 15–19, 2013, San Jose, USA : proceedings, p. 1–6. M. Igras, M. Ziółko and J. Galka, "Wavelet evaluation of speaker emotion" In: Proceedings of the eighteenth national conference on Applications of mathematics in biology and medicine: Krynica Morska, 23–27 September 2012, p. 54–59. M. Igras, J. Grzybowska and M. Ziółko, "Classification of emotions in emergency call center conversations " In: ICACII 2015 : 17th International Conference on Affective Computing and Intelligent Interaction: international scholarly and scientific research & innovation: May 18–19, 2015, Paris, France, s. 1448. M. Ziółko, P. Jaciów and M. Igras, "Combination of Fourier and wavelet transformations for detection of speech emotions," In: 7th International Conference on Human System Interactions (HSI) : 16–18 June 2014, Costa da Caparica. p. 49–54. M. Igras and B. Ziółko, "Different types of pauses as a source of information for biometry," In: Models and analysis of vocal emissions for biomedical applications : 8th international workshop : Firenze, Italy, December 16–18, 2013 : proceedings, s. 197–200. M. Igras, J. Grzybowska, M. Ziółko, "Emotional profiles of emergency phone callers" In: IACS-2014 : the first conference of the International Association for Cognitive Semiotics : September 25–27, 2014, Lund: book of abstracts, s. 196–197. M. Majdak and M. Igras, "Metafory głosu – analiza akustyczna" Prace Filologiczne 2015 t. 66, s. 179–199. M. Majdak, M. Igras and A. Domeracka-Kołodziej, "Looking for natural voice – the effectiveness of the program of Postgraduate Studies of Voice and Speech Training," In: 2014 XXII Annual Pacific Voice Conference: Kraków, Poland, 11–13 April 2014, s. 1–6. K. Barczewska and M. Igras, "Detection of disfluencies in speech signal," Challenges of Modern Technology 2013 vol. 4 no. 2, s. 3–10. J. Grzybowska, M. Igras and M. Ziółko, "Affects associations analysis in emergency situations," In: ICACII 2015 : 17th International Conference on Affective Computing and Intelligent Interaction]: international scholarly and scientific research & innovation : international science index : May 18–19, 2015, Paris, France, p. 1447. M. Witkowski, M. Igras, J. Grzybowska, P. Jaciów, J. Galka and M. Ziółko, "Caller identification by voice," In: 2014 XXII annual Pacific Voice Conference (PVC): Kraków, Poland, 11–13 April 2014 : proceedings, s. 1–7. J. Galka, J. Grzybowska, M. Igras, P. Jaciów, K. Wajda, M. Witkowski, and M. Ziółko, "System supporting speaker identification in emergency call center", W: INTERSPEECH 2015, September 6–10, 2015, Dresden, Germany eds.: Sebastian Möller, [et al.]. Germany: International Speech Communication Association, s. 110. M. Witkowski, J. Gałka, J. Grzybowska, M. Igras, P. Jaciów and M. Ziółko,"Online caller profiling solution for a call centre", The Speaker and Language Recognition Workshop - Odyssey 2016, Show&Tell session, June 21-24, Bilbao, Spain.. Work presented in this Thesis was supported: in part by the National Science Centre under grant no. DEC-2011/03/D/ST6/00914, in part by the National Centre for Research and Development granted by decision 0072/R/ID1/2013/03, in part by DOCTUS – Lesser Poland Scholarship Fund for PhD Students under contract no. ZS.4112-195/12 in part by the statutory research of AGH University of Science and Technology under grant no. 15.11.120.208. September 2016. 2.

(3) Abstract Voice carries a lot of information: about semantic content we want to communicate, about our identity and also about affective, psycho-social or physical attributes of the speaker. Speaker’s states and traits significantly affect the voice itself as well as speaking manner, syntax and semantic content. From the technical point of view, all this information is mixed in a one-dimensional signal. Only proper parameterization and statistical analysis methods allow to extract vocal correlates of speaker profile features. This work combines various aspects of nonlinguistic information conveyed in speech signal - the form and the content of speech that lays beyond linguistic message. The way we speak was investigated in terms of different functions - starting from paralinguistic aspects like accents or sentence boundaries, ending with non-linguistic information like speaker emotions or attitude. The crucial aim of research was the automatization of detection of different aspects of speaker profile by machine learning methods. The research on paralinguistics has been emerging as a new branch of speech technology for several decades. This interdisciplinary field is located on the borders of computer science, signal processing and linguistic, phonetics, phonology, psychology and sociology and deals also with medical and artistic aspects. Therefore, the thesis starts with theoretical framework on the discipline, including definition, taxonomies and review of state-of-art. First, the author looks for acoustic parameters, models and methods that describe voice and speaking style. Temporal structure of speech (including pauses, accents, phonemic variations, prosody) and voice quality are measured and analyzed quantitatively and qualitatively. Then, problems of extraction of high-level information (speaker state and traits) are explored.. The. state-of-the-art methods were adapted and optimized to the purposes of the study and some new parameterization techniques were introduced, using Wavelet Transform and Wavelet-Fourier Transform. It was also proven that selected speech or speaker recognition state-of-the-art algorithms can be directly applied also to paralinguistics, given proper training material. Automatic classification algorithms were developed for recognition of emotions and attitude. In the research, diversified speech corpora were used (some of the corpora were collected by the author) to provide broad comparative framework: monologues/dialogs, read/spontaneous speech, recording of studio quality/telephone/real-life situations, formal/informal situations, unexperienced/professional speakers. The results of the research are applied in other speech technology systems or can be used for separate tools. Majority of the research has multilingual impact, as some of the investigated phenomena are culture-independent. The thesis contributes also to the Polish phonology and phonetics field by bringing new evidence of phones properties and nature of accents in Polish spoken language. Thesis T1. Acoustic descriptors of voice and speech are universal cues of paralinguistic contents of speech as well as speaker states and traits (emotional states and attitudes). T2. The state-of-the-art algorithms borrowed from speech or speaker recognition techniques can be directly applied also to automatic classification of paralinguistics and improve its performance, given proper training material. Keywords: speech processing, paralinguistics, emotion detection, emotion recognition, biometrics, speaker recognition, speech signal analysis, voice analysis, speech corpora, psychoacoustic models, prosody, voice timbre, spectral analysis, pattern recognition. 3.

(4) Streszczenie Głos jest no´snikiem wielu informacji - poczawszy ˛ od tre´sci, która˛ chcemy przekaza´c, przez to˙zsamo´sc´ mówcy, a˙z po informacje˛ o afektywnych, psychologiczno-socjologicznych czy fizycznych atrybutach mówcy. Stan i cechy mówcy wpływaja˛ zarówno na sam głos, jak i sposób mówienia, składnie˛ oraz semantyke˛ mowy. Z technicznego punktu widzenia, wszystkie te informacje sa˛ zawarte w jednowymiarowym sygnale, dopiero odpowiednie metody parametryzacji i analizy statystycznej pozwalaja˛ na ekstrakcje˛ wokalnych korelatów cech profilu mówcy. Niniejsza praca łaczy ˛ ró˙zne aspekty analizy informacji para- i nielingwistycznej zawartej w sygnale mowy: form i tre´sci, które wykraczaja˛ poza zawarto´sc´ słowna. ˛ Sposób mówienia został zbadany pod katem ˛ takich funkcji, jak sygnalizowanie granic fraz, zda´n, akcentów, czy te˙z ekspresji stanów mówcy – emocji i nastawienia. Kluczowym celem przeprowadzonych prac była automatyzacja rozpoznawania poszczególnych aspektów profilu mówcy przy u˙zyciu algorytmów uczenia maszynowego. Badania nad paralingwistyka˛ stanowia˛ gała´ ˛z technologii mowy kształtujac ˛ a˛ sie˛ od kilku dekad.. To. interdyscyplinarne pole zlokalizowane jest na pograniczu informatyki, przetwarzania sygnałów, lingwistyki, fonetyki, fonologii, psychologii i socjologii, jak równie˙z nauki medyczne i artystyczne. Pierwsza cze´ ˛sc´ pracy zawiera opis podło˙za teoretycznego tej dyscypliny, z uwzglednieniem ˛ definicji, taksonomii oraz przegladem ˛ literatury. W cze´ ˛sci eksperymentalnej, autorka poszukuje parametrów akustycznych, modeli i metod opisu barwy głosu i sposobu mówienia, w tym struktury czasowej mowy (pauzy, akcenty, cechy segmentalne, prozodia), stosujac ˛ analize˛ ilo´sciowa˛ i jako´sciowa. ˛ Dalsza cze´ ˛sc´ po´swiecona ˛ jest ekstrakcji informacji wysokopoziomej o stanie i cechach mówcy. Dla tego zastosowania, zaadaptowano i zoptymalizowano algorytmy znane z innych gałezi ˛ technologii mowy, m.in. analize˛ falkowa˛ oraz falkowo-Fourierowska. ˛ Opracowane zostały modele i algorytmy dla automatycznego rozpoznawania emocji i nastawienia mówcy. W badaniach wykorzystano ró˙znorodne korpusy mowy (niektóre z nich zostały zebrane i opracowane przez autorke), ˛ w tym zawierajace ˛ monologi/dialogi, mowe˛ czytana/spontaniczn ˛ a, ˛ nagrania w sytuacjach formalnych/nieformalnych, mówców profesjonalnych/niedo´swiadczonych, nagrania o ró˙znej jako´sci: studyjnej, telefonicznej, z sytuacji z˙ ycia codziennego. Rezultaty pracy sa˛ stosowane dla celów innych systemów technologii mowy lub moga˛ by´c wykorzystane jako osobne narzedzia. ˛ Cze´ ˛sc´ bada´n dotyczaca ˛ zjawisk niezale˙znych od kultury ma znaczenie równie˙z w kontek´scie jezyków ˛ innych ni˙z polski. W zakresie jezyka ˛ polskiego, praca wnosi ewidencje dotyczace ˛ polskich fonemów, zwiazków ˛ cech akustycznych z interpunkcja, ˛ jak równie˙z natury akcentu w polskim jezyku ˛ mówionym. Tezy pracy T1. Parametry akustyczne głosu i mowy stanowia˛ uniwersalne wska´zniki zawarto´sci paralingwistycznej mowy, jak równie˙z cech i stanu mówcy (emocji i nastawienia). T2. Algorytmy state-of-the-art dla rozpoznawania mowy i mówcy moga˛ zosta´c wykorzystane bezpo´srednio do automatycznej klasyfikacji cech paralingwistycznych i nielingwistycznych, przy zapewnieniu odpowiedniego materiału uczacego. ˛ Słowa kluczowe: przetwarzanie mowy, paralingwistyka, detekcja emocji, rozpoznawanie emocji, biometria, rozpoznawanie mówcy, analiza sygnału mowy, analiza głosu, korpusy mowy, modele psychoakustyczne, prozodia, barwa głosu, jako´sc´ głosu, analiza widmowa, rozpoznawanie wzorców. 4.

(5) Acknowledgements. Firstly, I would like to express my sincere gratitude to my advisors prof. Mariusz Ziółko and dr Bartosz Ziółko for the continuous support of my Ph.D. study and related research, for their patience, motivation, and knowledge. I thank dr Magdalena Majdak, mgr Daniela Hekiert and all my lab colleagues, especially mgr in˙z. Joanna Równicka, mgr in˙z. Stanisław Kacprzak, mgr in˙z. Marcin Witkowski, mgr in˙z. Paweł Jaciów and dr in˙z. Jakub Gałka for the stimulating discussions and cooperation in many projects. Last but not the least, I would like to thank my Husband Artur, my Parents and my Friends for supporting me spiritually throughout writing this thesis and my life in general.. 5.

(6) Acronyms ANN - Artificial Neural Networks ANOVA - Analysis of Variance ANSI - American National Standards Institute ASR - Automatic Speech Recognition b_p - Breath Pause BPNN - Back-Propagation Neural Networks CART - Classification-Regression Trees CRF - Conditional Random Field DFT - Discrete Fourier Transform DWT - Discrete Wavelet Transform DWFT - Discrete Wavelet-Fourier Transform DTW - Dynamic Time Warping EER - Equal Error Rate EM - Expectation Maximization f_p - Filled Pause F0 - Fundamental frequency of vocal cords F1, F2, F3, ... - Formant frequencies FFT - Fast Fourier Transform FN - False Negatives FNR - False Negative Ratio FP - False Positives FPR - False Positives Ratio FSM - Finite State Model GMM - Gaussian Mixture Model GRBAS - Grade, Roughness, Breathiness, Asthenia, Strain GUI - Graphical User Interface HMM - Hidden Markov Model HNR - Harmonics to Noise Ratio IPA - International Phonetic Alphabet IVR - Interactive Voice Response LDA - Linear Discriminant Analysis LLR - Log-Likelihood Ratio LM - Language Model LPC - Linear Predictive Coding MAP - Maximum A-Posteriori ME - Maximum Entropy MECC - Malopolska Emergency Call Center MFCC - Mel-Frequency Cepstral Coefficients MLF - Master Label File MLP - Multi-Layered Perceptron NB - Naive Bayes NHR - Noise to Harmonics Ratio NLP - Natural Language Processing PCA - Principal Components Analysis. 6.

(7) PCM - Pulse Code Modulation PSAP - Public Safety Answering Point QDA - Quadratic Discriminant Analysis RMS - Root Mean Square s_p - Silent Pause SAMPA - Speech Assessment Methods Phonetic Alphabet SER - Slot Error Rate SIP - Session Initiation Protocol SNR - Signal to Noise Ratio SSP - Social Signal Processing STFT - Short-Time Fourier Transform STT - Speech to Text SVM - Support Vector Machines TED - Time-Energy Distribution TTS - Text to Speech TN - True Negatives TP - True Positives UBM - Universal Background Model VAD - Voice Activity Detector VAD - Valence-Arousal Diagram VHI - Voice Handicap Index VoIP - Voice over Internet Protocol WER - Word Error Rate WPT - Wavelet Packet Transform ZCR - Zero-Crossing Rate. 7.

(8) List of Figures 1.1. Number of scientific articles containing keyword ’paralinguistic’ [5] . . . . . . . . . . . . . . . .. 20. 2.1. Layered figure-ground relationship of paralinguistic functions [256] . . . . . . . . . . . . . . . .. 24. 2.2. Multidimensional space of paralinguistics description based on Schuller and Batliner work [256] .. 25. 2.3. General phases of typical speech processing path . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 2.4. Enrollment and verification in automatic emotion recognition system . . . . . . . . . . . . . . . .. 28. 2.5. An example DTW path between two vectors of different lengths . . . . . . . . . . . . . . . . . .. 30. 4.1. Sparse matrix of tags in the MECC corpora, sorted by features from top to bottom. White color indicates presence of the feature [107] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 4.2. Negative emotions distribution in the MECC corpora . . . . . . . . . . . . . . . . . . . . . . . .. 49. 4.3. Positive emotions distribution in the MECC corpora . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 4.4. Valence categories of emotions distribution in MECC corpora . . . . . . . . . . . . . . . . . . .. 50. 5.1. Frequency subbands obtained from perceptual scale wavelet decomposition [88] . . . . . . . . . .. 56. 5.2. Word “cztery” (Eng. four) as a waveform (a), its wavelet spectrum (b) and STFT spectrum (c) . .. 57. 5.3. Example of DFWT amplitude spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. 5.4. Speech processing path in the proposed algorithm for emotion recognition using DFWT . . . . . .. 58. 5.5. Method of calculating brightness of sound for music sample [142] . . . . . . . . . . . . . . . . .. 59. 5.6. Relative pleasantness as a function of: (a) relative roughness with band-width, (b) relative sharpness [341] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60. 6.1. Different types of filled pauses and proportion of their occurrence . . . . . . . . . . . . . . . . .. 63. 6.2. Types of pauses determining full stops and commas, and types of filled pauses signalizing punctuation [133] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64. 6.3. Different types of pauses determining full stops and commas [133] . . . . . . . . . . . . . . . . .. 64. 6.4. Proportions of filled pauses and breath pauses occurrences correlated with full stops or commas [133] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6.5. 66. a) Correlation matrix for 30 speakers, b) Cumulative distribution function and histogram of correlation coefficients for pairs of speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67. 6.6. Boxplot for correlation coefficients distribution for each speaker . . . . . . . . . . . . . . . . . .. 67. 6.7. Histogram of coefficients γ s for all speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67. 6.8. Results of ANOVA analysis for 30 speakers: a) breath duration b) filler /yyy/ duration . . . . . . .. 68. 6.9. Gaussian models fitted for 9 speakers for a) breaths duration b) fillers /yyy/ duration . . . . . . . .. 69. 6.10 a) Pauses-related features matrix with quantized values, b) heatmap with dendrograms for speakers and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 69. 6.11 Differences in distribution of each feature between 3 groups of speakers (P- presentations orations, T- translation, R – radio transmissions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 70. 6.12 Verification process in Pauses-MFCC biometric speaker verification evaluation system . . . . . .. 71. 8.

(9) 6.13 Performance of baseline system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 72. 6.14 Performance of system based on pauses-related features only . . . . . . . . . . . . . . . . . . . .. 73. 6.15 EER of a speaker recognition task in function of coefficient a . . . . . . . . . . . . . . . . . . . .. 73. 6.16 Confusion matrix for automatic recognition of types of spontaneous speech . . . . . . . . . . . .. 74. 6.17 Confusion matrix for automatic recognition of read/spontaneous speech . . . . . . . . . . . . . .. 75. 6.18 Percentage of respondents who indicated given types of the speech disfluency as the most negatively affecting the reception of speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76. 6.19 Number of respondents who placed filled pauses on 1-6 places in the ranking of disfluencies the most negatively affecting the reception of speech . . . . . . . . . . . . . . . . . . . . . . . . . .. 76. 6.20 Histogram of I duration in the training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 78. 6.21 Pitch variability in segments with /yyy/ filled pause (SAMPA /I/). Left: pitch courses for all filled pauses annotated for W1. Right: comparison of pitch variability for filled pause and normal speech for M2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 78. 6.22 Characteristics of an example filled pause: a) waveform, b) formants, c) F0. . . . . . . . . . . . .. 79. 6.23 Polish vowels located on F1-F2 plane, original figure from [290] and modified version with points corresponding to all disfluencies from training set. . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. 6.24 Contribution of breath pauses to determining full stops and commas - comparison between read and spontaneous speech; b_p_ - percent of breaths not correlated with fullstops nor commas, b_p. - percent of breath pauses denoting fullstops, b_p, - percent of breath pauses denoting commas . .. 82. 6.25 A part of a speech signal containing a breath: a) waveform, b) F0 contour, c) formants d) energy contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83. 6.26 Wavelet spectra of a speech signal containing breath . . . . . . . . . . . . . . . . . . . . . . . . .. 84. 6.27 Wavelet spectra of a breath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84. 6.28 Block diagram illustrating the concept of the algorithm . . . . . . . . . . . . . . . . . . . . . . .. 85. 6.29 An example of DTW path between two vectors representing energy parameters in one decomposition level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85. 6.30 Map of Polish phonemes on a duration-energy two-dimensional µ(d f ) - µ(E f ) plane for a) Db1, b) Db2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 92. 6.31 Graphical presentation of values of speakers coefficients from Tab. 6.21 (7 coefficients in columns, for 45 speakers in rows) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 94. 6.32 Preboundary phonemes lengthening (stairs) for two example speakers (PS1 upper and BC1 lower) in comparison to average for all speakers (dotted line). Lower figures show histograms of last phonemes relative durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 94. 6.33 Distribution of the last phonemes’ duration ratios (black) compared to all phonemes’ duration ratios (white): a) Db1, b) Db2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 96. 6.34 Distribution of duration ratios of the last phonemes before a comma (dotted line) and before a full stop (solid line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 97. 6.35 Probability that a phoneme with a given duration ratio is the last one in a phrase . . . . . . . . . .. 97. 6.36 Comparison of probability that a phoneme with a given duration ratio is the last one in a phrase depending on different speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 98. 6.37 Distribution of a slope coefficient for linear regression measured for last 10 phonemes of each sentence: a) Db1, b) Db2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 98. 6.38 Distribution of all phonemes’ energy ratios (white) and last phonemes’ energy ratios (black): a) Db1, b) Db2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101. 9.

(10) 6.39 Distribution of all phonemes’ power ratios (white) and last phonemes’ power ratios (black); a) Db1, b) Db2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.40 Probability that a phoneme with a given energy (left) or power (right) ratio is the last one in the sentence; a) Db1, b) Db2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.41 Distribution of phonemes (black) and end-of-sentence phonemes (grey) in the duration - energy space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.42 Example of pitch contour within a sentence, frames 10 ms . . . . . . . . . . . . . . . . . . . . . 104 6.43 Segmentation of speech signal into phrases separated by pauses . . . . . . . . . . . . . . . . . . . 106 6.44 Division of speech signal into frames of 500 ms . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.45 Distribution of accented vowels’ duration ratios (black histogram) against distribution of all vowels (white histogram): a) Db1, b) Db2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.46 Models of distribution of accented and non-accented /a/ phoneme. . . . . . . . . . . . . . . . . . 112 6.47 Distribution of energy ratios for accented and non-accented vowels (left) and the probability model describing probability that the vowel is accented, given its energy ratio (right) . . . . . . . 112 6.48 Comparison of energy in 11 frequency sub-bands for two pairs of timbres: glow-mat and bright-dark115 6.49 Graphical representation of correlation matrix of voice timbres . . . . . . . . . . . . . . . . . . . 116 6.50 Dendrogram grouping voice timbres on the basis of their acoustic similarity . . . . . . . . . . . . 116 6.51 Voice timbres located in the plane of two principal components . . . . . . . . . . . . . . . . . . . 117 6.52 Association analysis scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.53 Representation of emotions appearing in MECC corpora into models of emotions known from psychology and medicine, sources: [232, 308, 230, 185] . . . . . . . . . . . . . . . . . . . . . . 120 6.54 Suggested model dedicated to situational context of an emergency call . . . . . . . . . . . . . . . 121 6.55 Proportion of positive and negative emotions among men and women in MECC corpora . . . . . . 121 6.56 Proportion of intense negative emotions among men and women in MECC corpora . . . . . . . . 121 6.57 Mean values of energy coefficients in 11 sub-bands for female speakers . . . . . . . . . . . . . . 122 6.58 Mean values of energy coefficients in 11 sub-bands for male speakers . . . . . . . . . . . . . . . 122 6.59 Mean values of energy coefficients in 11 sub-bands for all speakers . . . . . . . . . . . . . . . . . 123 6.60 Mean values of energy coefficients in 11 sub-bands and their standard deviations in training data set 123 6.61 Comparison of DWFT subbands and the same speaker (female) in joy and sadness, bands 1-6 . . . 125 6.62 Comparison of DWFT subbands and the same speaker (female) in joy and sadness, bands 7-11 . . 126 6.63 Example confusion matrices for automatic anger distinguishing between neutral and anger using GMMs; left - female speakers, right - male speakers . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.64 EER of emotion detection system depending on the number of Gaussian mixtures and combination of acoustic features used for modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.65 EER of emotion detection system depending on the analysis window length [s] . . . . . . . . . . 129 6.66 Example confusion matrix from cross-validation of emotion recognition system . . . . . . . . . . 129 6.67 A part of classification-regression tree which distinguishes ironic and neutral intonation . . . . . . 133 6.68 Comparison of F0 of Polish and Canadians voices . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.69 Comparison of modulation of Polish and Canadians voices . . . . . . . . . . . . . . . . . . . . . 136 6.70 Comparison of proportion of unvoiced frames in voice of Polish and Canadians . . . . . . . . . . 136 6.71 Comparison of time duration of Polish and Canadians conversation . . . . . . . . . . . . . . . . . 137. 10.

(11) 7.1. Speaker profiling system GUI: A – communication panel, B – results panel (1 – acoustic background, 2 – language in the background, 3 – identity, 4 – speech rate, 5 – language, 6 – emotional state, 7 - gender and age, 8 – physical features, 9 – voice characteristics, 10 – degree of intoxication) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140. 7.2. Architecture of the speaker profiling system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141. 7.3. Enrollment and recognition in the system, presented on the example of speaker ID recognition . . 142. 7.4. Example of emotion tracking in the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142. 11.

(12) List of Tables 2.1. Review of recent example works on paralinguistic categories . . . . . . . . . . . . . . . . . . . .. 23. 2.2. Interspeech Computational Paralinguistic Challenges from 2010 to 2016 . . . . . . . . . . . . . .. 23. 3.1. Comparison of works on sentence boundary detection . . . . . . . . . . . . . . . . . . . . . . . .. 38. 3.2. Comparison of the most common types of emotional speech corpora . . . . . . . . . . . . . . . .. 42. 4.1. Lexical content of the recordings in the acted emotional speech corpus . . . . . . . . . . . . . . .. 47. 4.2. Results of perceptual tests of AGH acted emotional speech corpus with respect to test type, sex of speakers and listeners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 47. 4.3. Confusion matrix of perceptual tests of AGH acted emotional speech corpus . . . . . . . . . . . .. 48. 4.4. Semantic differential scale adopted for investigation of voice timbre . . . . . . . . . . . . . . . .. 51. 4.5. Example time annotation in an MLF file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 53. 4.6. Summary of the content of speech corpora used in the Thesis . . . . . . . . . . . . . . . . . . . .. 54. 4.7. Summary of the technical parameters of speech corpora used in the Thesis . . . . . . . . . . . . .. 54. 6.1. Explanation of abbreviations adopted for description of pauses . . . . . . . . . . . . . . . . . . .. 62. 6.2. Frequency of silent, breath and filled pauses per minute: mean (standard deviation) . . . . . . . .. 63. 6.3. Frequency of punctuation in transcripts: mean (standard deviation) . . . . . . . . . . . . . . . . .. 63. 6.4. Percent of pauses events denoting full stops and commas, n_p. - no pause, s_p - full stops sygnalized by silent pauses, f_p. - by filled pauses and b_p. - by breath pauses. Color intensity of cells denotes a degree of frequency of each type of pauses for each speaker. Ppresentations/orations, T- translation, R – radio interviews [133] . . . . . . . . . . . . . . . . . .. 6.5. 65. Comparison of selected features for experienced and inexperienced speakers: average values and standard deviation (in brackets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 70. 6.6. Results of automatic recognition of types of spontaneous speech . . . . . . . . . . . . . . . . . .. 74. 6.7. Comparison of classifiers applied to recognition of speech type . . . . . . . . . . . . . . . . . . .. 74. 6.8. List of the disfluencies, were used to make a ranking in order of the most disturbing in listening to. 6.9. long speeches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75. Statistics of the types of disfluencies for the whole database (3 male, 3 female) . . . . . . . . . . .. 77. 6.10 Calculated parameters of disfluencies I for training recordings (3 male, 3 female). IQR is interquartile range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 77. 6.11 Summary of formants F1 and F2 values [Hz] for filled pauses with prolongated I for each speaker. 80. 6.12 Evaluation of filled pauses detection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. 6.13 Number of counts of breath pauses in 1 hour of spontaneous speech (P - presentations , T translations, C - whole corpus; b_p. - number of breath pauses denoting fullstops, b_p, - pnumber of breath pauses denoting commas, b_p_ - number of breaths not correlated with fullstops nor commas, #b_p/min - mean number of breaths per minute) . . . . . . . . . . . . . . . . . . . . . .. 81. 6.14 Statistical parameters of breaths length in audio . . . . . . . . . . . . . . . . . . . . . . . . . . .. 82. 12.

(13) 6.15 Statistical parameters of breaths energy in audio . . . . . . . . . . . . . . . . . . . . . . . . . . .. 82. 6.16 Results of automatic breath detection: tp - true positives, fp - false positives, fn - false negatives, R - recall [%], P - precision [%], F - F-measure [%] . . . . . . . . . . . . . . . . . . . . . . . . .. 85. 6.17 Results of evaluation of false positives by human perception, % - percent of false positives classified as breaths by human perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. 6.18 Frequency of appearance of phonemes’ realizations in databases Db1 (isolated sentences) and Db2 (continuous speech). Notation of phonemes according to CORPORA . . . . . . . . . . . . .. 89. 6.19 Distribution of duration, energy and power parameters for 3 example phonemes with lognormal probability density function added . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 90. 6.20 Expected values of duration, energy and power for phoneme classes . . . . . . . . . . . . . . . .. 91. 6.21 Speaker-specific coefficients; in column 1: speaker acronym, 2: d, 3: E, 4: P , 5: d standard deviation, 6: E standard deviation , 7: P standard deviation, 8: γ s (6.16). All the values are standardised. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 95. 6.22 Percentage of phonemes much longer than the average for particular class according to its localization in a sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 99. 6.23 Example of phoneme durations within a sentence (blue squares) with a dynamic average of the ratio (red line with standard deviations as column markers) . . . . . . . . . . . . . . . . . . . . .. 99. 6.24 Coefficients of linear regression for energy and power parameters . . . . . . . . . . . . . . . . . . 100 6.25 Parameters describing pitch cadence in endings of informative sentences . . . . . . . . . . . . . . 105 6.26 Comparison of classifiers using resubstitution error and cross-validation error . . . . . . . . . . . 109 6.27 Comparison of sentence boundary detection performance on the testing set . . . . . . . . . . . . . 110 6.28 Local increase of F0 on accented vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.29 Comparison of selected properties of glow-mat and bright-dark antonyms: spectrograms with F0 and formants, values of F0 and F1-F4 [Hz], jitter and pulses standard deviation . . . . . . . . . . 114 6.30 Coordinate values of the voice timbres in the space of three PCA components . . . . . . . . . . . 118 6.31 Categories and levels of tagging emotional content in MECC corpus . . . . . . . . . . . . . . . . 118 6.32 Results of automatic classification of recordings using DWT approach . . . . . . . . . . . . . . . 124 6.33 Inner products of characteristic vectors with average characteristic vectors appropriate for each speaker sex in the experiment of emotion detection using DWFT . . . . . . . . . . . . . . . . . . 124 6.34 Position of proper detection of emotional state by the DWFT-based algorithm . . . . . . . . . . . 127 6.35 Confusion matrix of the results of automatic classification of emotions in recordings . . . . . . . . 128 6.36 Perceptual assessment of the emotional recordings in the original form . . . . . . . . . . . . . . . 130 6.37 Perceptual assessment of the emotional recordings in the form of hum - without semantic content . 130 6.38 Binary assessment of the quality of realization of 2 neutral and ironic sentences spoken by 27 speakers; 0-rejected, 1-accepted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.39 Distribution of acoustic features which are significantly different between neutral intonation (left) and irony (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.40 Example of intonation contours of ironic and neutral statement; line - neutral, bold line - ironic . . 133 6.41 Differences between ironic and neutral state, measured within a sentence and a phrase (* denotes p>0.05) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.42 Results of automatic classification of irony in voice . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.43 Multi-lingual comparison of manifestation of irony in speech prosody . . . . . . . . . . . . . . . 135 6.44 Comparison of emotional states proportions detected in Polish and Canadians voices . . . . . . . 138 7.1. Structure of the database collected for evaluation of emotional recognition module . . . . . . . . . 142. 13.

(14) 7.2. Results of the emotion recognition module evaluation . . . . . . . . . . . . . . . . . . . . . . . . 143. 7.3. Change of the selected parameters within sessions - values of parameters and their linear regression (blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145. 7.4. Trends in changes observed for all the speakers within the entire course of sessions: in green - the decreasing trend, in red - the increasing trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146. 7.5. Decrease of the selected parameters for the voice training participants . . . . . . . . . . . . . . . 146. 7.6. VHI questionnaire with mean scores of participants before and after 5-months training . . . . . . 147. 7.7. Results of VHI self-evaluation (scores) of participants . . . . . . . . . . . . . . . . . . . . . . . . 148. 7.8. Results of GRBAS evaluation of all training participants by the phoniatrician . . . . . . . . . . . 148. 14.

(15) Contents 1. Introduction. 17. 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 18. 1.2. Thesis and aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20. 2. Para- and non-linguistic information in speech. 22. 2.1. Background and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 2.2. Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. 2.3. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.4. Computational analysis of speech signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 2.4.1. Parameterization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 2.4.2. Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29. 2.4.3. Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 2.4.4. Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31. 3. Literature review. 33. 3.1. Paralinguistic descriptors of voice and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 3.1.1. Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 3.1.2. Segmental structure of speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35. 3.1.3. Sentence boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 3.1.4. Accents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39. 3.1.5. Voice quality and timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. 3.2. Non-linguistics: speaker state and traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 3.2.1. Emotional state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 3.2.2. Attitude: irony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. 3.2.3. Attitude: politeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 44. 4. Databases. 46. 4.1. Acted emotional speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 4.2. Emergency calls database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 48. 4.3. Media database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 4.4. Voice quality database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 4.5. Voice timbre database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. 4.6. Corpus for research on pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. 4.7. Other corpora used in experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. 4.8. Summary of databases used in the research . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54. 5. New methods of parameterization and improvements of existing methods 5.1. Wavelet analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 55 55.

(16) 5.2. Wavelet-Fourier analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56. 5.3. Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. 6. Models, algorithms and their evaluation. 61. 6.1. Paralinguistic descriptors of voice and speech . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 61. 6.1.1. Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 61. 6.1.2. Segmental structure of speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87. 6.1.3. Sentence boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 96. 6.1.4. Accents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1.5. Voice timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2. Non-linguistics: speaker state and traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2.1. Emotional state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2.2. Attitude: irony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2.3. Attitude: politeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7. Applications. 139. 7.1. Caller profile assessment for emergency call center . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.2. Voice emission training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8. Summary and conclusion. 150. 9. Author’s achievements. 152. Bibliography. 155.

(17) 1. Introduction The development of human-computer interfaces tends to reach for less absorbing, more intuitive methods of interaction. Mouses or keyboards are being supported with microphones, cameras or touchpads as they can base more on human natural ways of communication. Speech communication is the most natural, intuitive and fast modality of communication, therefore any system speech-based interface would be more comfortable and satisfactory for users, given sufficient efficiency of speech recognition. Along with dynamically increasing quality, miniaturization and speed of computation devices as well as their popularity among information society, there is an observable trend of the growing importance of voice interfaces. Speech technology is an interdisciplinary field, incorporating signal processing, computational linguistics, phonetics and phonology as well as computer science.. Regarding fields of commercial applications, the. mainstreams of speech technology are focused on automatic speech recognition (ASR), automatic speaker recognition, speech synthesis and translation. Speech recognition systems change spoken language into text (STT - Speech to Text systems). The systems vary in range of recognized speech - from single words, e.g. steering commands or voice dialing, to the complex systems recognizing natural, spontaneous speech (dictation, transcribing speech), multimedia information retrieval. Speech recognition engine usually includes both acoustic modeling and language modeling. Speaker recognition is the process of a person identity analysis based on voice characteristics. The main tasks of a speaker recognition system include verification or identification. The aim of identification is to choose one of many speakers based on a speech signal, whereas verification is the process of determination whether assigned speaker was chosen correctly. In line with the particular usage specification, those systems may be divided into text-dependent or text-independent. A text-dependent system assumes that recognition process is based on a specific fixed phrase i.e. each analyzed recording contains the same sentence. In text-independent scenario, speakers may be identified or verified by a random utterance. The enrollment process creates compact speaker model (voiceprint) that can be used to discriminate one speaker from another or to identify or verify the speaker. Speech synthesis systems produce speech artificially, usually on the basis of text (TTS - Text to Speech systems). Synthesis can be performed by a computer or dedicated hardware. Among computer speech synthesizers, the most popular techniques are concatenating utterances from segments of recorded speech stored in the database or formant synthesis, where the artificial speech is created from acoustic models. The main challenge of the technology is to achieve the best possible level of naturalness and intelligibility of the synthesized speech. The systems mentioned above are focused on semantic content ("what is said?") or speaker identity ("who is speaking?"). At the same time, the content of speech that lays beyond linguistic message, classified as paralinguistic or non-linguistic information, is an integral part of speech signal. It covers both the form of speech ("how do we talk?" - intonation, voice modulation, prosody) and any other cues expressing speaker states or traits. Voice timbre and a manner of speaking carries information which is complementary to the semantic content of speech but it is lost in standard speech to text transcription. The recognition of the non-verbal information does not only enrich the meaning of the words being spoken, but also brings significant information on the speaker profile.. 17.

(18) 1.1. Motivation From the speech technology point of view, the research on paralinguistics is relevant for speech and speaker recognition systems. Their efficiency is decreased by the variability of the way of speaking caused by emotions and other speaker states or attributes. Usually such systems are trained with emotionally neutral recordings of healthy speakers aged 20-40. Therefore, research on paralinguistics will directly contribute not only to better understanding of the complexity of speech signal, but also to develop proper normalization techniques that could compensate the differences and make the systems more robust. While from one point of view, speech and speaker recognition systems will benefit from reduction of interand intra-personal variability of speech signal, from the other - extraction of the non-linguistic information itself can be valuable for enriching their performance. For example, improving the ASR with the automatic insertion of punctuation marks makes voice interface more comfortable (without the need for dictating punctuation marks), increases transcripts’ legibility and enables us to adapt them directly for natural language processing systems. Recently, there are trends in speaker recognition to explore high level prosodic features that could increase system efficiency, especially in conditions of high level of environmental noise. Using prosodic models will also enable creating more naturally sounding voices in speech synthesis systems. Digital Signal Processing Group1 (Department of Electronics, AGH University), develops the speech recognition system dedicated to Polish and the system of speaker recognition. One of the motivations of research led by author in this Thesis, was elaborating methods and models that could have impact on the development of those systems. Beyond the mainstreams of speech technology, there is much to investigate concerning paralinguistics for other disciplines. The scope of potential applications of voice analysis in human-computers interaction, cognitive science, voice emission, medicine and forensic science, will be essentially presented below. For the needs of voice human-computer interfaces, automatic detection of user emotions and attitudes can directly help in making the systems more adaptable, increasing user satisfaction and, at the same time, improving the application effectiveness. Likewise, Interactive Voice Response (IVR) systems can be supported with emotion detection. Also development of emotional robots and virtual reality systems will benefit greatly from judging users attitude and adapt according to the feedback. It is assumed that future computing will be human-centered [222] and it should adapt automatically to the users’ behavioral patterns and take into account user’ affect and individual habits. The vocal emotions recognition has become an important part of the affective computing technology [226]. An integral part of society relationships, is to judge and to be judged by impressions, even unconsciously. Vocal cues have large impact on the attitude towards person. When listening to a person we have not yet met, we attribute them a set of psychosocial features, and our attitude to the speaker forms spontaneously. Also the factors that decides of attractiveness of the speaker voice are the issue that raises interest around the voice perception analysis. Observations and conclusions made on base of tests of speaker image evaluation, may result in deepening knowledge of factors that create person’s impressions and influences other’s attitude towards them, determining character of interpersonal relations (contribution to cognitive sciences). As a result, speech technology systems could be enriched with modules supporting speaker’s psychosocial image valuation. Recently, in Poland there is an observable trend of an increasing popularity of voice emission techniques, not only among professions based on working with voice (teachers, speakers, actors, politicians, journalists, call center workers). Other people also notice that improving of their speaking abilities is an element of their personal growth and soft skills development. Role of voice and its perception as a basic medium of everyday social interactions and a factor of image building, has been more appreciated. The current market offering concerning voice training starts from a broad range of handbooks and guidebooks, but also covers postgraduate studies (e.g. SWPS University: “Voice and speech training”, “Voice as a work tool”), numerous courses and workshops, or even on-line education 1. More information can be found at: http://dsp.agh.edu.pl. 18.

(19) [4] and coaching by e-mails [7] or Skype [8]. In line with the development of this market, there is also growing demand for applications with automatic voice analysis. Increasing number of voice organ job-related diseases are another argument for the need of consciousness raising on the proper voice exploitation. According to Occupational Health Institute report, chronic voice organ diseases are in the third place of ranking in Poland and they apply to 10,6% of all ocupational diseases [3]. In the education sector they are 90% of cases. As the report states, there is a need of unification of diagnosis and preventive care tools. Objective voice analysis tools could fit well that need. Another important problem is the surprisingly high percent of citizens with speech impediments. According to statistics in USA, it touches almost 9% of children [2]. In Poland, there is a critical tendency to broadening of this phenomenon. Speech impediments touch about 40-60% of 7-years-old children, most of them from the group of peripheral dyslalia [139]. Solving this problem could be supported by educational applications with automatic voice analysis tools. Voice research can be also applied in medical field: as the non-invasive method, it might be used as the bio-marker of beginning of a disease or in monitoring of patients’ therapy. Both psychic disorders (autism, bipolar disorder, depression), cardiovascular or endocrine disorders affect the way patients talk. Stroke and neurodegenerative diseases often present with speech disturbances as the key symptoms. Earlier diagnosis may improve treatment and rehabilitation of these civilization diseases that create quickly increasing problem to the society. In forensic psychology, detection of emotion and other features from voice is crucial in the phonoscopic analysis, especially in the cases of absence (e.g. phone calls) of any other evidence. From 2013, AGH University is working on the project "The development of a system allowing the identification of a voice of people calling an emergency number” as a part of homeland security program [316]. To get help in emergency situations, we usually call a public safety answering point (PSAP) in the first place. This essential service has to be very fast and as reliable as possible given current technology capabilities. A huge effort have been already made to improve quality of public safety with speech and language technology (see [6, 68, 199, 299]). Experience and abilities of operators are crucial during emergency notification. This mentally exhaustive work leads to loss of concentration. During a single call, emergency responders have to process a lot of information quickly, including creating a description of a notification. Therefore any system that effectively supports such a process, reduces call time and increases efficiency of operators. Consequently they may concentrate on the conversation and provision efficient help to citizens in emergency situations. For example, multiple calls from one person during handling an emergency situation can be automatically recognized and the data form can be initially fulfilled. What is more, the system can help in early detection of speakers that have frequently abused PSAP services. The research undertaken in this thesis tries to answer the need for better understanding of those aspects of speech signal complexity that could help in the paralinguistic and non-linguistic analysis. Demand for majority of the listed above applications came directly from the specialists who noticed and recognized the potential of using voice analysis in their fields (physicians, voice emission trainers, phoniatricians, public safety organizations). Part of the presented research was conducted by author in collaboration with the specialists. During the past two decades, an interest in paralinguistics has significantly grown. The tendency is reflected in the number of scientific papers (Fig. 1.1), as well as numerous conferences dedicated to or including in their scope sessions on paralinguistics. As a key example, it has become an annual tradition that Interspeech conference hosts a computational paralinguistics challenge (each year on different themes) since 2010. The authors of the challenges noticed and highlighted that the field of paralinguistics is currently emerging from loosely connected research in speech analysis: state-of-the-art computational approaches and basic - phonetic or linguistic - research [261]. Consequently, there is a gap between those approaches. Schuler et al. [257] have described the need of combining them together as a future need of this field:. 19.

(20) 160 140 120 100 80 60 40 20 0 1996. 1998. 2000. 2002. 2004. 2006. 2008. 2010. 2012. 2014. 2016. Figure 1.1: Number of scientific articles containing keyword ’paralinguistic’ [5]. "The basics of paralinguistics, rooted in phonetics, linguistics, and all the other scholarships, are not yet tied to the methodologies of automatic speech processing. (...) Results or interpretations from basic research are simply taken as suggestion for putting into operation the machinery of data mining and pattern recognition. Most likely, for the next time to come, the two approaches will simply run in parallel, without too much interdependence. It is evident, though, that optimal performance can be reached only with a deeper understanding and harnessing of the underlying processes.". This thesis tries to contribute to the answering that need by covering both basic research (phonetic, linguistic) and computational machine learning of a selected aspects of paralinguistics and, where it was possible, finding connections between those approaches.. 1.2. Thesis and aims In this work the following theses were set: T1. Acoustic descriptors of voice and speech are universal cues of paralinguistic contents of speech as well as speaker states and traits (emotional states and attitudes). T2. The state-of-the-art algorithms borrowed from speech or speaker recognition techniques can be directly applied also to automatic classification of paralinguistics and improve its performance, given proper training material. The scope of investigated categories of speech signal content covers: — speech descriptors: temporal structure of speech, phonemic variation, prosody, metaphorical descriptors of voice timbre, voice quality; — features of phonological or structural functions: pauses, accents and sentence boundaries; — speaker states and traits: emotions, attitude (irony), psycho-social features. In order to prove the theses, the following objectives were identified: 1. Collection and review of information of state-of-the-art; 2. Preparing a categorization of paralinguistics;. 20.

(21) 3. Design and/or collection of speech corpora; 4. Determination of acoustic descriptors of para- and nonlinguistic features; 5. Applying machine learning methods for automatic recognition of para- and nonlinguistics; 6. Evaluation of their performance in aspects of particular para- and nonlinguistic properties;. Investigation of different aspects of paralinguistics (objective 5) consists of choosing or preparing model of classes, a training and testing datasets, choosing evaluation measure and conducting a series of validation test to optimize recognition efficiency. This work has been divided into eight chapters. In the beginning of Chapter 2, the main issues of paralinguistics are introduced - from the background, terminology and discrepancies that concerns the definitions and finally different taxonomies. Then, the fields of applications are summarized. The last part of Chapter 2 introduces the fundamentals of the acoustics parameters measurement technique and provides a general description of analysis, classification and evaluation methods. Chapter 3 reviews the state-of-art research on particular categories of paralinguistics in terms of databases and methods used by researchers and the results they obtained. Chapter 4 first introduces the databases prepared by the author, and then - the other corpora used in the experiments. Chapter 5 presents the new methods applied to analysis of paralinguistics in this thesis. Chapter 6 is the most important part of this thesis - the experiments and their results are presented and discussed. The structure of this chapter is fitted to the review in Chapter 3: for each sub-section’s theme from Chapter 6, the review of state-of-art is provided in the corresponding sub-section of Chapter 3. The 7th Chapter indicates the major practical applications which used the results of the research. The last Chapter concludes the major achievements presented in the Thesis and indicates further possible directions of development in the field..

(22) 2. Para- and non-linguistic information in speech Place of paralinguistics in speech technology field has not been strictly constituted yet. There are some inconsistencies in terminology as well as researchers’ views. This chapter briefly covers the fundamentals of paralinguistics. Then, the taxonomies are presented and example applications fields are reviewed. The last Section of this Chapter aims at reviewing the technical apparatus of speech signal processing.. 2.1. Background and definitions Information conveyed in speech is typically divided into the linguistic information (sharing language code and corresponding meaning, like thoughts, opinions, expressing verbal communication) and the second type of information laying beyond linguistics - that expresses speaker’s affective states, attitudes or reveals their personal characteristics - classified as non-linguistic, paralinguistic or extralinguistic. Before taking a closer look to these terms, some general definitions will be recalled. The dualism and defining ex negativo is observable also in the broader perspective: — Communication is the purposeful activity of information exchange between two or more participants in order to convey or receive the intended meanings through a shared system of signs and semiotic rules. Meta-communication is a secondary communication (including indirect cues) about how a piece of information is meant to be interpreted. — While verbal communication is based on symbols (sometimes known as lexemes) and the grammars (rules) by which the symbols are manipulated, non-verbal communication describes the process of conveying meaning in the form of non-word messages. — While a language is defined as the method of human communication, either spoken or written, consisting of the use of words in a structured and conventional way, paralanguage is a component of meta-communication that may modify or nuance meaning, or convey emotion. Authors of the Interspeech computational paralinguistic challenges (Tab. 2.2) defined the field as: “the discipline dealing with those phenomena that are modulated onto or embedded into the verbal message, be this in acoustics (vocal, non-verbal phenomena) or in linguistics (connotations of single units or of bunches of units)” [261]. Karpi´nski describes paralinguistics as the component of language, laid on the boundaries of language. The study suggests that “there should be more reliance on continua than on binary categorizations of features, that multi-functionality and multimodality should be fully acknowledged and that clear distinctions should be made among the levels of description, and between the properties of speakers and the speech signal itself.” [147]. Among the main future trends of development in this field, the following were pointed in [261]: more tasks and coupling of tasks; more continuous modeling; more synthesized, agglomerated, and cross data; more and novel features; more coupling of linguistics and non-linguistics; more optimization, standardization and robustness, more realism and more cross-cultural and cross-lingual evaluation. The terms non-linguistic, paralinguistic or extralinguistic are often used interchangeably or as synonyms. The term paralinguistic seems to be currently more common.. 22.

(23) Table 2.1: Review of recent example works on paralinguistic categories Category affect affect affect physical. Feature emotions attitude mood age. physical pathology pathology pathology pathology pathology psycho-social. weight and height voice pathology Parkinson’s condition bipolar disorder autism schizophrenia personality. psycho-social psycho-social. attractiveness likability. psycho-social. charisma. state state. sleepiness cognitive or physical load. state. intoxication. traits traits. origin, dialect degree of nativeness. Works Koolagudi and Rao, 2012 [162] Madzlan et al., 2015 [191] Bachmann et al., 2016 [14] Grzybowska and Kacprzak, 2016 [108] Pisanski et al., 2016 [228] Panek et al., 2015 [221] Bandini et al. 2015 [16] Faurholt-Jepsen et al., 2016 [77] Fusaroli et al., 2016 [84] Graux et al., 2015 [99] Mohammadi and Vinciarelli, 2012 [206], McAleer et al. 2014 [201] Feredenzi et al. 2013 [80] Shuller et al. 2012 [260], Hantke et al. [110] Signorello et al., 2012 [273], Rhee and Signorello, 2016 [244] Günsel et al., 2013 [109] Schuller et al., 2014 [258], Quatieri et al., 2015 [239] Jung et al. [141], Bone et al. 2014 [29] Gałka and Jabło´nski, 2015 [86] Black et al. 2015 [26], Schuller et al., 2016 [259]. Table 2.2: Interspeech Computational Paralinguistic Challenges from 2010 to 2016 Year 2010 2011 2012 2013 2014 2015 2016. Sub-challenges Age, Gender, Affect Intoxication, Sleepiness Personality, Likability, Pathology Social Signals, Conflict, Emotion, Autism Cognitive & Physical Load Degree of Nativeness, Parkinson’s, Eating Condition Deception, Sincerity, Native Language. In this thesis, author would suggest another approach, in which paralinguistic would refer to the form of speech and language, its prosody and quality (how do we speak?), and non-linguistic would refer to the information on the speaker, speaker’s traits and states, conveyed in the speech signal (what does it say about speaker?). In parallel to this distinction, Section 6.1 of this work deals with the paralinguistic phenomena like accents, pausing behavior or voice timbre while Section 6.2 addresses problems like detection or speaker’s emotion or attitude. Although both categories are embedded into the verbal message, paralinguistics seem to be more low-level, while non-linguistics is high-level communicate conveyed by speech signal.. 2.2. Taxonomy The most common taxonomy divides paralinguistics into three groups [257]:. 23.

(24) 1. Long term traits: — biological trait primitives such as height, weight, age, gender; — group/ethnicity membership: race/culture/social class with a weak borderline towards other linguistic concepts, i.e., speech registers such as dialect or nativeness; — personality traits: likability; — personality in general; — a bundle of traits constitutes speaker idiosyncrasies, i.e., speaker-ID; speaker traits can be used to mutually improve classification performance. 2. Medium term between traits and states: — (partly) self-induced more or less temporary states: sleepiness, intoxication (e.g., alcoholization), health state, mood (e.g. depression); — structural signals (behavioral, interactional, social): role in dyads, groups, the likability, friendship, identity, positive/negative attitude, (non-verbal) social signals, entrainment. 3. Short term states: — mode: speaking style and voice quality; — emotions (full-blown, prototypical); — emotion-related states or affects: for example, stress, intimacy, interest, confidence, uncertainty, deception, politeness, frustration, sarcasm, pain. Another taxonomy (Fig. 2.1) is a layered model [256]. Schuller et al. [256] distinguished 12 dimensions (pairs of antonyms) in which every paralinguistic feature can be assessed (Fig. 2.2).. Figure 2.1: Layered figure-ground relationship of paralinguistic functions [256]. 2.3. Applications Beyond the applications pointed in Introduction Chapter, paralinguistics/nonlinguistics will contribute to the following fields:. 24. the outcome of the research on.

(25) Figure 2.2: Multidimensional space of paralinguistics description based on Schuller and Batliner work [256]. User Interfaces For the needs of voice human-computer interfaces, automatic detection of user emotions can directly help in making the systems more adaptable, increasing user satisfaction and, at the same time, improving the application effectiveness. Likewise, Interactive Voice Response (IVR) systems can be supported with emotion detection. Also development of emotional robots and virtual reality systems will benefit greatly from judging users attitude and adapt according to the feedback. In the field of human-computer interaction cognitive aspects, investigation of vocal cues of emotions can also contribute to creating more naturally sounding emotional voices in speech synthesis systems.. Speech modeling on segmental and suprasegmental level For the needs of automatic speech recognition system, models of each phoneme are created to build a database of patterns [337]. The acoustic features of the same phoneme can differ for many reasons. The most important are neighborhood of other phonemes, speaker individual features [335], speaker emotional state, speech pathology, the accentuation and finally the phoneme position in a sentence. Therefore, understanding the most significant location-dependent changes in phoneme characteristics will be useful for language modeling for ASR. From the speech technology point of view, the research on emotions in speech is relevant for speech and speaker recognition systems. Their efficiency is decreased by the variability of the way of speaking caused by emotions. Proper normalization techniques will compensate the differences and make the systems more robust.. 25.

(26) Computational linguistics Both lack of punctuation and occurrence of disfluencies in spontaneous speech transcripts are factors that disturb their processing by Natural Language Processing (NLP) systems, parsers, summarization, topic detection or information extraction systems, mainly because usually language models do not contain disfluencies and operate on full sentences. Automatic segmentation of the speech into phrases will adapt them to be processed by language models in large scale NLP [298]. Descriptive linguistics Modeling acoustic correlates of punctuation can bring additional arguments to discussion on language-dependent nature of punctuation and discourse analysis. Speaker biometry Connotations between pauses and punctuation, as well as frequency and types of pauses vary between individuals and depend on speaking style of each person, speech quality, culture, experience and preparation for oral presentations. Thereby, the temporal features can be used for speaker biometry [335], speakers diarization or evaluation of speaker oratorical skills [18, 126]. Speech synthesis More natural sounding can be achieved thanks to proper predicting of the positions and durations of prosodic breaks on the basis of syntactic structure and accurate modeling of intonation within prosodic phrases [46, 211]. Biomedical field Beyond applications of the research for speech technology systems, it can be used also directly in the biomedical field.. Among the reasons of disfluencies frequency and duration variability there are speaker. personality and mental condition - quantity and duration of silent pauses can be indicators of emotional state of the speaker or a measurable symptom of psychic disorders like schizophrenia or bipolar affective disorders [241]. Frequency of filled pauses and breath pauses are the significant marker of speaker stress, emotional arousal, physical effort level or physical fitness. In medical field, as a non-invasive method, it might be used in monitoring psychic disorders therapy (autism, bipolar disorder). In forensic psychology, detection of emotion from voice is crucial especially in the cases of absence of any other evidence (e.g. phone calls).. 2.4. Computational analysis of speech signal Mathematical model of analog speech signal sa (t), sa ∈ L2 (0, T ) is the function which describes variations of sound pressure in time. The model of the discrete, one-dimensional speech signal is numerical sequence s∆ (n), n = 0, 1, 2, ..., N − 1 which is the effect of discretization (sampling) of the analog signal s∆ (n) = sa (n∆t),. (2.1). where ∆t is the sampling period. Frequency representation of signal s(t) can be obtained by applying Fourier transformation: sˆ( f ) =. Z+∞ s(t)e−2π j f t dt, −∞. 26. (2.2).

(27) where j2 = −1. The obtained spectrum gives an overall look into the frequency characteristics of speech signal. The most specific is spectral structure investigated locally, on a small segment with center in time τ. For this purpose, the Short-Time Fourier Transform (STFT) is used: Z+∞ sˆw ( f, τ) = s(t)w(t − τ)e−2π j f t dt,. (2.3). −∞. where w(t) is the window function, commonly Hamming or Gaussian window. For the discrete signal s∆ (n), Fourier transform has form sˆ(k) =. N−1 X. s∆ (n)e(−2π jkn)/N dt, k = 0, 1, ..., N − 1. (2.4). n=0. and its short-time version is: sˆw (k, m) =. N−1 X. s∆ (n)w(n − m)e(−2π jkn)/N dt.. (2.5). n=0. In speech processing, STFT is usually computed for 20 ms frames (with 1/2 overlap). In its basic form, both temporal and spectral representations of speech carry redundant information. In order to extract from speech the characteristics, which is the subject of investigation, there is a necessity to parameterize signal by describing it by a set of representative parameters - the feature vector. The parameterization process is usually preceded by some basic preprocessing techniques (energy normalization) or applying voice activity detection (VAD). The feature vectors are usually too big and redundant because of the correlated phenomena), therefore dimensionality reduction methods are applied before elaborating numeric models. The typical path of speech processing is shown in Fig. 2.3. In the process of building automatic classifier of some feature of speech, the elements of this path are utilized in the order shown in Fig. 2.4. In the enrollment process, models of each class are calculated on the basis of acoustic parameters of the training voice samples and saved in the database. In the identification process, based on calculated similarity scores, the system decides which user from the database is the most likely to generate given unknown voice sample.. Figure 2.3: General phases of typical speech processing path. 2.4.1. Parameterization methods This section will esentially review some crucial parameters generally known and applied in the speech technology with special focus on research of paralinguistics and non-linguistics. However, there also some standard feature sets, for example Geneva minimalistic acoustic parameter set (GeMAPS) [71] extraced with openSmile toolkit [72] or Praat features [27]. Pitch and fundamental frequency (F0) Pitch is an auditory sensation in which a listener assigns musical tones to relative positions on a musical scale based primarily on their perception of the frequency of vibrations. Pitch is closely related to frequency, but is not. 27.

Cytaty

Powiązane dokumenty

Titration experiments of complementary 80b target ssDNA (or specific IBMP8-1 target) to 80b probe DNA (or anti-IBMP8-1-coated nanowires) were performed by injecting 50 µl of

Autor przedstawia w aspekcie socjologicznym rolę Savonaroli w zapoczątkowaniu rewolucji w e Florencji, stara się też ukazać w szerszym zakresie osobowość tej

Europejska Polityka Bezpieczeństwa i Obrony (ESDP) (ang. European Security and Defence Policy) została ustanowiona w 1999 roku jako część drugiego filaru UE -

This paper aims at understanding how geopolitical discourses over the Balkan and its place in the ‘new Europe’ shaped social relations and produced daily practices nested into

Wydziaá Gospodarki ĩywnoĞciowej w AR Szczecin, Wydziaá Rolniczo-Ekonomiczny w AR Kraków (2001 r.). Przedmiotem opracowania jest przedstawienie najnowszej, wspóáczesnej historii

[r]

„Oni11, czyli Joanna Ostafin, Daria Bodziony, Do­ minika Sularz, M ateusz Mucha i Tomasz Osowski. Ze­ spół pięciu osób, które nie boją się wyzwań, a przede

Szkoły wyższe są w tym bardzo pomocne: oferują og­ romnie bogatą i różnorodną ofertę kształcenia dla osób legitymujących się już wyższym wykształceniem,