Index of /rozprawy2/11470

(1)

AGH University of Science and Technology

Challenges in Speech Recognition Industry:

Data Collection, Text Normalization and

Punctuation Modelling

mgr inż. Piotr Żelasko

PH. D. THESIS

Supervisor

dr hab. inż. Bartosz Ziółko

(2)

I hereby declare that the work in this Thesis is my own original work, except where indicated in the text. Following is the list of candidate publications. Parts of them were used in the thesis.

Conferences:

• P. Żelasko, P. Szymański, J. Mizgajski, A. Szymczak, Y. Carmiel, N. Dehak, Punctuation Prediction Model for Conversational Speech, Accepted for Inter-speech 2018 Conference

• B. Ziółko, P. Żelasko, I. Gawlik, T. Pędzimąż, T. Jadczyk, An Application for Building a Polish Telephone Speech Corpus, Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, 2018

• P. Żelasko, Expanding Abbreviations in a Strongly Inflected Language: Are Morphosyntactic Tags Sufficient?, Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, 2018

• M. Witkowski, S. Kacprzak, P. Żelasko, K. Kowalczyk, J. Gałka, Audio replay attack detection using high-frequency features, Proceedings of Interspeech 2017, p. 27-31, Stockholm, 2017

• P. Żelasko, B. Ziółko, T. Jadczyk, T. Pędzimąż, Linguistically motivated tied-state triphones for polish speech recognition, Proceedings of IEEE 2nd Inter-national Conference on Cybernetics (CYBCONF), Gdynia, 2015

• B. Ziółko, T. Jadczyk, D. Skurzok, P. Żelasko, J. Gałka, T. Pędzimąż, I. Gaw-lik, S. Pałka, SARMATA 2.0 Automatic Polish Language Speech Recognition System, Proceedings of Interspeech 2015, Dresden, 2015

(3)

ination in ASR, Proceedings of the IEEE International Conference on Signal Processing and Multimedia Applications (SIGMAP), p. 255-260, Vienna, 2014 • B. Ziółko, P. Żelasko, D. Skurzok, Statistic of diphones and triphones pres-ence on the word boundaries in the Polish language. Applications to ASR, Proceedings of the XXII Annual Voice Pacific Conference, p. 1-6, Kraków, 2014

Journals:

• M. Igras-Cybulska, B. Ziółko, P. Żelasko, M. Witkowski, Structure of pauses in speech in the context of speaker verification and classification of speech type, EURASIP Journal on Audio, Speech, and Music Processing (2016 IF = 1.579), vol. 2016, issue 1, Springer, 2016

• P. Żelasko, B. Ziółko, T. Jadczyk, D. Skurzok, AGH corpus of Polish speech, Language Resources and Evaluation (2016 IF = 0.922), vol. 50, issue 3, p. 585-601, Springer, 2016

• P. Żelasko, A. Trawińska, B. Ziółko, M. Czyżyk, E. Stanisławek, M. Ślusarz, Application of DTW algorithm as a tool in speaker identification, Problemy Kryminalistyki, 280, p. 53-57, Centralne Laboratorium Kryminalistyczne Policji, 2013

(4)

This thesis investigates several approaches to building an automatic speech recogni-tion system with an applicarecogni-tion-oriented focus. Three major hypotheses are being investigated.

The first one is that a careful design of an automated annotated recording collec-tion process provides superior data for acoustic model training compared to existing Polish corpora. The second relates to text normalization for language model prepa-ration and states that a substantial number of abbreviations in strongly inflected languages can be expanded to their full, morphologically correct forms with an ap-plication of a recurrent neural network model which predicts based only on the morphosyntactic features of a sentence. The last main point is that punctuation can be restored in transcripts of conversational speech by means of deep neural net-work models and word timing features, where the model processes both sides of the conversation at once.

The research starts with an overview of datasets available for Polish speech recognizer development and outlines the problem of limited data availability. An initial approach is suggested where expert rules are incorporated into the system, but due limited efficiency, the research steers towards finding a method to produce high-quality acoustic dataset. A voice over IP recording tool, capable of automatic annotation is then described. Recordings obtained in this part of research were used to train the acoustic model in a commercial speech recognition system.

Another part of research investigates the text-related aspects of automatic speech recognition system development. It briefly describes the specific text normalization problems encountered in strongly inflected languages and explains why this problem is important for speech recognition. Then a method for automatically expanding abbreviations is suggested and evaluated on two publicly available Polish text cor-pora.

Finally, the last part of the research is focused on designing a punctuation predic-tion model for conversapredic-tional speech. To that end, a novel applicapredic-tion of Needleman-Wunsch algorithm is proposed to create a training and evaluation dataset using the English Fisher corpus. Two punctuation prediction models are investigated, one based on a convolutional neural network and the other on a bidirectional recurrent neural network. Both models achieve competitive results and have been successfully implemented as part of a commercial system.

(5)

Niniejsza praca skupia się na tematyce budowania systemów automatycznego rozpoz-nawania mowy, zorientowanych na zastosowanie przemysłowe. Badane są trzy główne hipotezy.

Pierwsza z nich to stwierdzenie, że starannie zaprojektowany proces gromadzenia nagrań anotowanych w sposób automatyczny pozwala na uzyskanie lepszych danych treningowych do modelu akustycznego w porównaniu do istniejących w języku pol-skim korpusów. Druga z nich dotyczy zagadnienia normalizacji tekstu do treningu modelu językowego i wyraża, że znaczna liczba skrótów występujących w językach o silnej fleksji może zostać rozwinięta do form pełnych i morfologicznie poprawnych z pomocą modelu opartego o rekurencyjną sieć neuronową, dokonującego predykcji wyłącznie w oparciu o cechy morfosyntaktyczne występujące w zdaniu. Ostatnia hipoteza mówi że możliwe jest przywrócenie interpunkcji w transkrypcjach kon-wersacji z pomocą modelu opartego o głębokie sieci neuronowe i cechy czasowe poszczególnych słów, przy modelu przetwarzającym dane z obu stron konwersacji jednocześnie.

Badanie rozpoczyna się od przeglądu dostępnych w języku polskim zbiorów danych, pozwalających na budowę i rozwój polskiego systemu rozpoznawania mowy, z wyszczególnieniem problemów z dostępnością danych. Opisane jest wstępne badanie w którym w celu pokonania problemu system wzbogacony jest o reguły eksper-ckie, jednak z powodu ograniczonej efektywności, badania zmieniają kierunek w stronę pozyskania wysokiej jakości zbioru nagrań. Przedstawiony jest opis nagry-warki opartej o telefonię internetową, która dokonuje automatycznej anotacji nagrań. Nagrania pozyskane w trakcie tych badań zostały wykorzystane do treningu modelu akustycznego w komercyjnie dostepnym systemie rozpoznawania mowy.

Kolejne badania dotyczą aspektów związanych z warstwą językową systemów automatycznego rozpoznawania mowy. Krótko opisano problemy w normalizacji tekstu specyficzne dla języków o silnej fleksji i wyjaśniono dlaczego ten problem jest ważny w dziedzinie rozpoznawania mowy. Zaproponowano metodę automatycznej zamiany skrótów na odpowiadające im formy pełne oraz przedstawiono wyniki jej ewaluacji na dwóch publicznie dostępnych zbiorach danych w języku polskim.

Ostatnie przedstawione badania skupiają się na zaprojektowaniu modelu zdol-nego do przewidzenia interpunkcji w transkrypcjach konwersacji. W tym celu zapro-ponowano oryginalne zastosowanie algorytmu Needlemana-Wunscha do stworzenia treningowego i ewaluacyjnego zbioru danych w oparciu o korpus języka angielskiego Fisher. Zostały zaproponowane dwa modele przewidujące interpunkcje, jeden oparty na splotowej sieci neuronowej, a drugi na dwukierunkowej rekurencyjnej sieci neu-ronowej. Obydwa modele osiągnęły silne wyniki i zostały zaimplementowane w komercyjnie dostępnym systemie.

(6)

First I want to thank my family: my wife, Małgorzata, who continually supported me in my pursuit of completing this thesis; my daughter, Malwina, for bringing a new kind of joy in my life; and my parents, whose support made it possible for me to discover and hone my passions.

I also thank the brilliant scientists and engineers who had the impact on my professional career: beginning with my supervisor, Dr Bartosz Ziółko, for giving me the opportunity to learn the craft of speech recognition. I owe Tomasz Pędzimąż sin-cere gratitude for infecting me with his thoroughness and attention to detail, today indispensable in my work. I would also like to thank my colleagues from the Digi-tal Signal Processing1 _{group and Techmo Voice Technologies}2 _{for many interesting}

conversations and opportunities to learn: Tomasz Jadczyk, Dawid Skurzok, Marcin Witkowski, Dr Magdalena Igras-Cybulska, Stanisław Kacprzak and Dr Jakub Gałka, as well as the head of DSP group, Professor Mariusz Ziółko. Finally, I thank Yishay Carmiel and Jan Mizgajski for numerous insightful conversations.

And last but not least, I thank my friends from Kraków Street Band3 _{for making}

music a part of my life.

1_{http://dsp.agh.edu.pl} 2_{http://techmo.pl/}

(7)

AJAX - Asynchronous Javascript for XML

ASR - Automatic Speech Recognition (or Recognizer) BLSTM - Bidirectional Long Short-Term Memory BOP - Bag of Phones

CNN - Convolutional Neural Network CRF - Conditional Random Field DBN - Deep Belief Network DNA - Deoxyribonucleic Acid DNN - Deep Neural Network

DTMF - Dual-Tone Multi-Frequency

ELRA - European Language Resources Association EM - Expectation-Maximization

GMM - Gaussian Mixture Model GPU - Graphical Processing Unit

GSM - Global System for Mobile communications HMM - Hidden Markov Model

HTK - Hidden Markov Model Toolkit HTTP - Hypertext Transfer Protocol

(8)

IVR - Interactive Voice Response LDC - Linguistic Data Consortium LM - Language Model

LSTM - Long Short-Term Memory

LVR - Large Vocabulary Automatic Speech Recognition MB - Megabyte

ME - Maximum Entropy

MFCC - Mel-Frequency Cepstral Coefficients MLF - Master Label Format

MMI - Maximum Mutual Information MT - Machine Translation

NCP - National Corpus of Polish NER - Named Entity Recognition

NKJP - Narodowy Korpus Języka Polskiego NLP - Natural Language Processing

PC - Personal Computer

PCM - Pulse Code Modulation POS - Part of Speech

PSC - Polish Sejm Corpus

PSTN - Public Switched Telephony Network ReLU - Rectified Linear Unit

(9)

RNNLM - Recurrent Neural Network Language Model RTP - Real-time Transfer Protocol

SAMPA - Speech Assessment Methods Phonetic Alphabet SELU - Self-Normalizing Linear Unit

SIP - Session Initiation Protocol

SLU - Spoken Language Understanding SPA - Single Page Application

SVM - Support Vector Machine TDNN - Time-Delay Neural Network TTS - Text to Speech

URL - Uniform Resource Locator

WCRFT - Wrocław Conditional Random Fields Tagger WER - Word Error Rate

WFST - Weighted Finite State Transducer XHR - XmlHttpRequest

(10)

4.1 Percentage of each of subcorpus contribution to our corpus in terms of recordings length. . . 47 4.2 Distribution of occurrence frequency of unique words in the corpus,

presented using a logarithmic scale. Single words and words with count higher than 250 were omitted. . . 54 4.3 The application starts with collecting information about the speaker

(from the top): the nickname, phone number, gender and the choice of a phrase set. All the information can be also inserted by a link to speed up the process. . . 68 4.4 After user fills the questionnaire, a short explanation of how the

ap-plication works and what is expected from the speaker is displayed. . 69 4.5 The user is informed that the recordings may be further processed,

demonstrated and sold as part of a corpus, without revealing infor-mation about the users identity. In order to proceed, the user has to agree. . . 70 4.6 Information that the user is being called by the server (after he agrees

to the terms of service). . . 71 4.7 The main working screen consists of a display of the phrase to be

read (here ‘Stare Miasto’), information about the next phrase (here ‘Grzegorzki’) and steering buttons - from the left (previous, repeat and next) and the large red one - finish. . . 71

(11)

5.1 A comparison of abbreviated and full form of the lemma numeru in Polish equivalent of English sentence "I don’t have your number". The upper block contains lemmas and the lower block their corresponding lexemes and morphological tags. . . 76 5.2 Tag prediction neural network architecture. . . 82 6.1 An example of alignment between two word sequences in Fisher: the

time-annotated and the punctuation annotated. The s stands for start time and d stands for duration, both in seconds. The circles represent a blank symbol, i.e. no match for a given word in the second sequence. . . 92 6.2 Confusion matrix for the BLSTM+T model, normalized with regard

(12)

3.1 Results of recognition for monophone model, a full triphone model and the proposed generalized triphone model. . . 36 4.1 Contribution of each subcorpus in terms of recordings length.

Num-bers following the subcorpus’ name indicate in which section its de-scription can be found. . . 51 4.2 List of 60 most frequent words along with their translation to

En-glish and their occurrence frequency in our corpus and in a 1 billion words corpus from earlier works [Ziółko and Skurzok, 2011]. Numbers denoted as 0.00 are less than 0.005. . . 52 4.3 Continuation of table 4.2.3. . . 53 4.4 Recognition results for each data set in cross-validation testing

pro-cedure. The results shown are percentages of correctly recognized phrases. . . 57 4.5 Recognition results for tests concluded on GlobalPhone with system

being trained separately on CORPORA, AGH corpus and their sum. The results shown are percentages of correctly recognized phrases. . 58 4.6 Detailed information about sets of collected recordings. The duration

is given in [hh:mm] format. . . 66 4.7 WER of the SARMATA ASR system trained exclusively on: I.

Glob-alPhone; II. Continuous corpus; III. Sum of GlobalPhone and Con-tinuous. The test corpora are the Streets and Bus corpora. . . 66 5.1 Illustration of the correct and incorrect ASR transcripts for each of

the categories. . . 73 5.2 Data sets used in the experiment. . . 78

(13)

5.3 Abbreviations expanded in the experiment, along with their base form and frequency relative to the total word count in PSC corpus. . . 79 6.1 The total count of labels for each of the punctuation classes available

in our training data set. . . 86 6.2 The per-class precision, recall and F1-score (in %) achieved by the

CNN and BLSTM models with pre-trained GloVe embeddings. All models used 300-dimensional word embeddings and 1-dimensional boolean conversation side features, and the +T models additionally used two 1-dimensional time features. The symbol denotes a blank prediction. . . 89

(14)

1 Introduction 20

1.1 Thesis and aims . . . 23

1.2 Contents . . . 24

2 Speech recognition - a review 26 2.1 Language models . . . 27

2.2 Acoustic models . . . 27

3 Linguistically motivated triphone state-tying 30 3.1 Acoustic model description . . . 30

3.2 Generalized triphones . . . 32

3.3 Experiment . . . 34

3.4 Results . . . 37

4 Data collection 38 4.1 Polish language datasets . . . 38

4.1.1 Speech corpora . . . 40

4.1.2 Text corpora . . . 42

4.2 AGH corpus of Polish speech . . . 45

4.2.1 Master Label File Format . . . 45

4.2.2 Contents of the corpus . . . 46

4.2.3 Corpus statistics . . . 51

4.2.4 Corpus evaluation . . . 56

4.3 Recording tool . . . 60

4.3.1 Early Development Stage . . . 61

(15)

4.3.3 Speech diversification . . . 64

4.3.4 Issues encountered during corpus collection . . . 65

4.3.5 Collected resources . . . 66

4.3.6 Evaluation . . . 67

5 Inflected abbreviation expansion 72 5.1 Relation to text normalization . . . 72

5.2 Problem statement . . . 74

5.3 Proposed method . . . 76

5.4 Dataset . . . 77

5.5 Experiment . . . 79

5.6 Results . . . 80

6 Punctuation modelling for conversational speech 83 6.1 Data preparation . . . 84

6.2 Punctuation model . . . 86

6.2.1 Features . . . 86

6.2.2 Architecture . . . 87

6.3 Results . . . 88

7 Summary and conclusions 94 7.1 Data collection . . . 94

7.2 Text normalization . . . 96

(16)

Introduction

Research on automatic speech recognition (ASR) started around the half of the 20th century [Fry, 1959, Denes, 1960, Denes and Mathews, 1960, Halle and Stevens, 1962], but in the last twenty years we have observed that it grew much more rapidly. This accelerated pace of research can be attributed mostly to breakthroughs in the area of deep learning. Many contributions fostered progress in our ability to train Deep Neural Network (DNN) models, starting with (now passé) Deep Belief Networks (DBN) [Hinton et al., 2006], then progressing to renewed interest in Recurrent Neu-ral Networks (RNN) and their frequently used Long Short-Term Memory (LSTM) network variant [Hochreiter and Schmidhuber, 1997, Mikolov et al., 2010, Graves et al., 2013] and improved regularization techniques such as the rectified linear unit (ReLU) [Nair and Hinton, 2010], dropout [Srivastava et al., 2014] and batch normal-ization [Ioffe and Szegedy, 2015]. These and other discoveries led to the development of increasingly elaborate DNN architectures for the acoustic models [Hannun et al., 2014, Peddinti et al., 2015, Chan et al., 2016, Cheng et al., 2017] and, to somewhat lesser extent, language models [Mikolov et al., 2010, Merity et al., 2017] used in ASR.

The successes of the deep learning approach, however, come at a price: these models typically require vast amounts of data in order to learn representations which generalize well to new, previously unseen data. In order to perform a supervised training of a DNN acoustic model, we require not only the recordings of speech, but also their transcripts in text. Since most of the ASR research is done for the English language, we have observed the rising of several large language resources, both

(17)

public ones (such as Fisher [Cieri et al., 2004], Switchboard [Godfrey et al., 1992], CallHome [Canavan et al., 1997]), as well as proprietary (and hence restricted: often created and used by private companies such as Baidu, Google, Microsoft or Amazon for their in-house research). Unfortunately, the effort and cost of preparation of such data sets is typically very large, which makes it difficult to gather such resources in most of the world’s natural languages.

The best published ASR systems - for English - are claimed to beat humans in terms of recognition accuracy [Xiong et al., 2017, Saon et al., 2017, Han et al., 2017]. However, upon closer investigation, careful reader will observe the complexity of these systems, which are often composed of several different DNN architectures, predictions of which are fused together by a top-level classifier. In practice, deploy-ment of such system in production would be very costly due to their great compu-tational complexity - and in cases where real-time (online) recognition is required, perhaps not feasible at all with modern hardware.

Deployment of speech recognition systems is an intricate and multidimensional problem. The architect of such a system has to consider multiple criteria to deliver a satisfactory product: accuracy, latency and scalability. Accuracy is typically mea-sured by the word error rate (WER) metric, often presented as the only relevant evaluation criterion in the research literature. Latency determines how fast the sys-tem is able to recognize utterances. Speech recognizers can be broadly classified as online - i.e. real-time - and offline systems. The former ones offer more possibilities in terms of usage - e.g. in chatbots, voice search or dictation, when the feedback must be quick in order not to degrade the user experience. Finally, the scalability is dependent on how compute intensive is the system. Highly scalable systems will require less computation to perform the recognition task, thus lowering the costs of system operation.

Besides these three, basic criteria for production deployment, the ASR developer needs also to consider the specific requirements of the ASR system application. Recognized speech will either serve as an artifact to be presented to the end user, or as an input to a different - e.g. a natural language processing (NLP) - system, which will perform additional tasks on top of it. In the former case, the text will need additional post-processing to improve the readability - such as addition of

(18)

punctuation or truecasing (for named entities). The latter application can also benefit from these, but it will also depend on the consistency of the recognized text. The research presented in this work constitutes the authors contribution to solv-ing the problems of data acquisition, text normalization and punctuation modellsolv-ing, which are relevant to implementation of the ASR systems in the industry.

At the beginning, an initial approach to data scarcity problem is presented. It incorporates the expert linguistic knowledge into a triphone GMM-HMM system to improve its effectiveness in the face of data limitations. The GMM-HMM systems were the state-of-the-art before DNN-based systems and are sometimes still used when dealing with smaller data sets.

First major part of the thesis is focused on the data aspects of Polish ASR. Although there exist some Polish speech corpora, they are not nearly as large as their English counterparts, which makes it difficult to use state-of-the-art methods for ASR development, and thus develop competitive and functional ASR products in the industry. The author’s aim was to develop a methodology which would allow to gather a set of recordings sufficient for DNN acoustic model training, especially in the telephone speech domain, which is one of the most promising markets for ap-plication of ASR technology. This research, which took place in AGH-UST Digital Signal Processing group and Techmo Voice Technologies, has resulted in methods and tools for acoustic data acquisition and a corpus of over 100 hours of high quality telephonic speech. These recordings have been used in the development of commer-cially available Polish ASR system SARMATA.

The second part of the thesis addresses text normalization, a problem related to data acquisition for large vocabulary language model (LM) development in Polish. Typically, LMs for continuous speech are implemented as statistical models, such as n-grams (for the ASR decoding pass) or DNNs (for the lattice rescoring pass). As such, they also benefit from larger quantities of data - a trick that is often applied is to train and interpolate two LMs, one of them being general, trained on large quantity of out-of-domain data, and the other one being trained only on a small amount of text from a specific domain. In Polish, there are several text resources which can help in training of the general LM, such as NCP, PSC, Wikipedia and OpenSubtitles corpora (described in chapter 4), however all of them require some

(19)

kind of text normalization before they can be applied to LM training. In particular, this research shows that a RNN model can be trained in order to help in abbreviation expansion with proper inflection by using the words morphosyntactic tags in the context of the whole sentence.

Punctuation modelling is the topic of the last part of the thesis. In many ASRs, a serious limitation is the lack of any punctuation or capitalization (with the exception of some recent end-to-end models). This can be problematic both in the case of visual presentation of the outputs, where the non-punctuated transcripts are confusing and difficult to read, and when these transcripts are used as inputs for downstream tasks such as those in the domain of NLP. Off-the-shelf NLP systems are usually trained on punctuated text, thus lack of punctuation can cause a significant deterioration of their performance. This issue is especially interesting in the domain of telephone conversational speech, where the dialogue interactions can be very dynamic, and the task is made more difficult due to disfluencies in spontaneous speech [Igras-Cybulska et al., 2016]. This work has been performed by the author during his affiliation with Intelligent Wire, a company working on an application transcribing telephone calls between customers and agents and performing their semantic annotations to find particular and specific events, as well as an intents and moods of the interlocutors. Displaying punctuation has been crucial to provide a high quality service.

1.1 Thesis and aims

The main aim of the presented research is to address the productization aspect of speech recognition by proposing solutions to issues such as data collection, text normalization and punctuation modelling. The following aims are of particular interest:

• Design the recordings acquisition process and identify good and bad practices with regard to data quality.

• Gather high quality audio data for Polish ASR system training and evaluation. • Improve the performance of a GMM-HMM triphone acoustic model trained on several hours of recordings by implementing linguistically motivated

(20)

state-tying rules.

• Identify the main issues in application of publicly available text data in Polish ASR system development.

• Investigate the performance of an LSTM network in the task of inflected ab-breviation expansion.

• Propose a model for punctuation restoration in transcripts of conversational speech, applicable in real-time speech recognition.

The following theses were set in this work:

• The proposed process of recording acquisition yields training data which results in an ASR acoustic model significantly outperforming those trained on publicly available Polish data sources such as CORPORA or GlobalPhone.

• Application of linguistically motivated state-tying rules in an acoustic hidden Markov model improves the speech recognizer when training data is scarce. • Abbreviations frequently encountered in available Polish text data can be

ex-panded to their full forms with an application of a context-aware LSTM model. • High accuracy punctuation restoration in conversational speech can be per-formed by DNNs which use the features from both sides of the conversation, as well as relative word timing information.

1.2

Chapter 1 introduces the reader into the ASR domain and presents a discussion on problems which are important from the point of view of the industry. Basic principles of ASR are reviewed in chapter 2, and the the initially adopted method of applying generalized triphone contexts to overcome data scarcity issues is presented in chapter 3. Problem of data collection is presented in chapter 4, where related work is described in section 4.1, and the contributions of the author are presented in the subsequent sections: section 4.2 presents the AGH speech corpus and 4.3 the methodology and tools developed for high quality speech corpus creation. In

(21)

chapter 5 the problem of text data preparation is presented along with authors contribution in the area of abbreviation expansion for strongly inflected languages. Chapter 6 describes a novel approach to punctuation modelling in conversational speech transcripts, with experimental part being performed on the English Fisher corpus (as there exist no such resources in Polish which would allow to perform the experiment). The thesis is then concluded in chapter 7.

(22)

Speech recognition - a review

Given a language L, consisting of words w : w ∈ L, a classical speech recognition system solves the following equation, derived from the Bayes’ rule

W∗ = argmax

W

p(W |X) = argmax

W

p(X|W, Θ)p(W, Ψ) (2.1)

where W ∈ Ln _{denotes an arbitrary length sequence of words (utterance), W}∗ _is

the most likely path given the acoustic and language models, p(W |X, Θ) is the a posteriori likelihood of utterance W given the acoustic evidence X, p(X|W, Θ) is the likelihood of observing the acoustic evidence X given utterance W and parameters of the acoustic model Θ and p(W, Ψ) is the a priori likelihood of utterance W existing in the language given language model with parameters Ψ.

In other words, the task of an ASR system is to identify the most likely utterance given its acoustic and linguistic likelihood. Therefore, effectiveness of the ASR depends strongly on how well these models represent true distribution of acoustic evidence and words in the language. In order to solve the equation, the speech recognizer must perform a search over the space of all possible utterances - since this space grows exponentially with each word added to the utterance, the search typically incorporates some kind of pruning, such as beam search [Steinbiss et al., 1994].

In recent years, several types of ASR systems (especially end-to-end systems) have been proposed which in some ways modify this paradigm [Graves et al., 2006, Chan et al., 2016] - however, they are beyond the scope of this thesis.

(23)

2.1 Language models

The language model G defines the vocabulary which can be recognized. There are two categories of LMs:

• A context-free grammar, strictly defining which utterances can exist and which cannot;

• A statistical language model, which, in essence, allows any sequence of words w ∈ L to be recognized, however it might assign them an appropriately low likelihood.

Context-free grammars are often used in simple ASR applications, such as IVR systems with limited and strictly controlled vocabulary. Large vocabulary ASR most frequently use n-gram models [Goodman, 2001], with Kneser-Ney interpolated n-gram with backoff being the n-gram state-of-the-art. Another popular class of statistical LMs are the DNN-based models, especially RNNLMs [Mikolov et al., 2010]. Applicability of RNNLMs is, however, limited due to their computational complexity in beam search. It can be easily illustrated with a somewhat simplistic, but illustrative example: given that the ASR is considering 1000 competing active paths at time step t, and at the timestep t + 1 there are 100 possible next words, it has to compute the RNNLM state for each of the active paths separately, giving 100000 computations to perform. Since the RNNLM state computation involves multiplication of several, usually large matrices, this cost cannot be afforded at the time of decoding. This is why RNNLM is often considered as a rescoring model -after the decoding phase is finished, the RNNLM is used to obtain a priori language likelihood of competing hypotheses - under assumption that RNNLM is trained to better resemble the true distribution of the language.

2.2 Acoustic models

Words are the usual output of an ASR system, however, at the lower level, they are divided into smaller chunks of acoustic signal, called phones. A phone is assumed to be a basic unit of speech [Fry, 1959]. A lexicon L defines phonetic transcription of words found in the language model. In practice, the same phone can sound very

(24)

different when found in different contexts. To better model this phenomenon, ASR systems use the notion of contextual phones - these will be explained in more detail in chapter 3. The final level of granularity is attained when each of the contextual phones is being modelled with a hidden Markov model (HMM) [Rabiner and Juang, 1993]. This approach allows to incorporate temporal effects into the model - such as a beginning, middle and ending of a phone, where each of them is represented by a corresponding HMM state.

ASR systems are often based on the Weighted Finite State Transducer (WFST) framework [Mohri, 2009]. The discussion of WFST is beyond the scope of this thesis; for details, the reader is directed towards [Mohri, 2009] and the documentation of the open-source Kaldi ASR system1_{. However, it is worth mentioning that the}

four components: language model G, lexicon L, phonetic context dependency C and HMM structure H can be represented as WFST and efficiently composed to create a static (i.e. fully known ahead of time) decoding graph HCLG, which is a transducer consuming HMM states as input, and producing words from L on the output. This framework is used e.g. in the open-source Kaldi ASR system [Povey et al., 2011b].

The acoustic model is a probabilistic model which, given acoustic evidence X = x, i.e. a sequence of acoustic feature frames x, defines the posterior probability distribution over the HMM states. To make this explanation more straightforward, let’s assume that our phonetic alphabet has 10 phones, and we’re using diphones (a phone with context of the previous phone) as our contextual phone representation, which gives us 100 possible diphones. Each of the diphones is modelled by a 3-state HMM, giving us a total of 300 HMM states in our system. Then, at each timestep t, the acoustic model’s task is to predict a 300-dimensional vector of posterior acoustic probabilities given the HMM parameters and acoustic features xt. Given the acoustic

likelihood distribution, a decoder iteratively traverses the HCLG graph until there are no more acoustic frames.

The acoustic parameters are extracted from the time-frequency spectrum of the signal divided into small segments (typically of duration 10-15ms) - a popular choice are the Mel-frequency cepstral coefficients (MFCC) [Vergin et al., 1999], filterbanks

(25)

and sometimes even raw the audio waveform [Deng et al., 2013].

Until approximately 2006, the most popular choice for an acoustic model was a Gaussian mixture model - at that time, Hinton proposed an algorithm for training a DBN [Hinton et al., 2006], which was initially adopted in ASR, but then superseded by models such as RNN [Hochreiter and Schmidhuber, 1997, Graves et al., 2006, Graves et al., 2013] or TDNN [Lang et al., 1990, Peddinti et al., 2015]. Detailed discussion of acoustic model architectures is beyond the scope of this thesis; the reader is instead directed towards the works mentioned here, especially [Rabiner and Juang, 1993] for the GMM-HMM models, as well as [Graves et al., 2006, Vesel`y et al., 2013, Hannun et al., 2014, Peddinti et al., 2015, Amodei et al., 2016, Chan et al., 2016, Cheng et al., 2017, Hadian et al., 2018] for various DNN-based models.

(26)

Linguistically motivated triphone

state-tying

This chapter presents the initial approach taken towards solving the problem of data scarcity for design and development of a Polish ASR system. As argued in section 4.1, Polish is not abundant with commonly available data for acoustic mod-elling, and at the time this research was conducted, the data was even harder to obtain. At that stage, the Digital Signal Processing team, which the author is a part of, has been working on an ASR system which would be able to recognize utterances generated by simple grammars, describing e.g. a list of streets in a particular city or a combinations of digits. At the time, there was no ready to use large vocabulary ASR (LVR) software for Polish, however, several attempts were conducted [Demenko et al., 2008, Pawlaczyk and Bosky, 2009, Pułka and Kłosowski, 2008, Marasek et al., 2009, Ziółko et al., 2011, Ziółko et al., 2008]. The following research was based on the premise of using the available data in a more efficient manner to be able to train a triphone acoustic model.

3.1 Acoustic model description

A typical GMM-HMM acoustic model involves a separate HMM for each phone in the phonetic alphabet, which consists of several hidden states, responsible for modelling temporal changes in the phone articulation. The emission probability of each of these states is modelled by a GMM. The simplest variant of this model is

(27)

based on monophones, i.e. phones which are contextually independent. This kind of model requires less data to achieve its full efficiency, but its modelling capabilities are limited by the lack of context. To create a contextual model, we enrich the phones with information about their context, obtaining a new set of phonetic symbols - e.g. if we take into consideration the previous phone, we obtain diphones. Examples of such transforms will be shown in the section 3.2. Typically the phonetic unit used in acoustic models is a triphone (a phone with left and right context).

Context modelling is frequently used in ASR on all layers, from processing of acoustic features to semantic modelling [Niewiadomy and Pelikant, 2008]. It can highly improve recognition if used properly and with enough training data which are representative. In the other case, the context model can be easily over-trained. On acoustic level, context can be typically modelled by neighboring frames, tri-phones or syllables [K¸epiński, 2005]. They are all strongly motivated by anatomy as a consequence of co-articulation. For discussion and some examples of the most common triphones in Polish, see [Ziółko et al., 2009].

Acoustic models trained on all of the triphones can be more accurate than mono-phone models, but they face data sparsity problems. One solution to that is state-tying - a parameter-sharing technique used in HMM which, in the context of ASR, allows to re-use the same model for different phonetic units. ASR systems can ben-efit from properly tied-state triphone models, however, the choice of rules for tying states is not trivial. There are methods for automatic discovery of state-tying rules with the use of phonetic decision trees, which yield good results with larger data sets [Young et al., 1994]. We present a different approach, where we construct sim-ple rules based on prior linguistic knowledge and demonstrate their usefulness in the context of training generalized triphone acoustic models.

Before we dive into details, we should review the notation we use for triphones. Given three phones, A, B and C, when we write about triphone A+B-C we mean phone A, with left context of C and right context of B. Let us now give an example of state-tying: given a three state hidden Markov model (HMM) of a triphone A+B-C and three state HMM of another triphone A+D-E, the second (middle) state in both models shares a common probability distribution. This technique may also be pushed further - given similar models A+B-C and A+B-D, the third (last) state

(28)

responsible for context of B can also share a common probability distribution. In general, this technique employs state-tying rules, thus causing the models to be less specific towards given context, but trained on larger amount of data.

3.2 Generalized triphones

Before we begin the description of our method, we would like to show a short example of what it is capable of and what are its consequences. Let us take a Polish word, tak (Eng. yes), and its phonetic transcription: t, a, k. With regular triphones, the transcription looks like t+a-#, a+k-t, k+#-a, where # denotes silence. After usage of our method, the transcription looks like t+VOWEL-#, a+STOP-STOP, k+#-VOWEL. The information about current phone is preserved, but its context is generalized. If we now take another Polish word, kat (Eng. executioner ), with analogous generalized triphone transcription k+VOWEL-#, a+STOP-STOP, t+#-VOWEL, it means that we can use the same context dependent model of phone a surrounded by two stops during recognition, or use both utterances of a to train this model. This is in contrast to regular triphone model, where a would have different contexts and thus separate models.

Our method, in essence, involves following steps:

1. Define phonetic categories, which will later allow to generalize the context of a phone (e.g. a category of “vowels”);

2. Prepare phonetic transcription for text to be recognized - use regular triphones to mark the context of each phone;

3. In each triphone, swap the phones marking the context with their categories (defined in step 1).

The effect of this procedure is that we end up with much lesser quantity of context-dependent models to be trained. The number of all possible triphone combinations is substantial [Ziółko et al., 2009]. Suppose we have phonetic alphabet with 39 phones -to model all possible contexts with regular triphones, we need 393 = 59319 acoustic models. In our approach, with 8 linguistic categories, we can model all possible context with only 39 ∗ 82 = 2496 acoustic models. It also means, that resulting

(29)

acoustic models are trained on larger amount of data and have better reusability when applied to a recognizer module. The described method is speaker independent with isolated words use case.

In our research we used slightly modified versions of Grocholewski Polish phones categories [Grocholewski, 1997] (They were also used by K¸epiński [K¸epiński, 2005] and by Steliga [Steliga, 2011] in research on clustering of Polish phones. And also by our group in research on evaluation of errors in Polish phones segmentation for different types of transitions [Ziółko et al., 2010]):

1. Stops (/p/, /b/, /t/, /d/, /k/, /g/) 2. Nasal consonants(/m/, /n/, /ni/, /N/)

3. Mouth vowels (/i/, /y/, /e/, /a/, /o/, /u/, /l_/) 4. Nasal vowels (/e_/, /a_/)

5. Glides(/j/)

6. Unstables (/l/, /r/)

7. Fricatives (/w/, /f/, /h/, /z/, /s/, /zi/, /si/, /rz/, /sz/) 8. Closed fricatives (/dz/, /c/, /dzi/, /ci/, /drz/, /cz/)

The idea of merging of context dependent acoustic models is well known and established. In contrast to our linguistic categorization, some of the data-driven approaches include use of phonetic decision trees [Bahl et al., 1991], others define a distance measure and automatically merge the closest models or share probabil-ity distributions of the closest models [Lee, 1990]. To the best of our knowledge, linguistically motivated generalization of triphones for Polish language had not yet been investigated. The method we describe has several advantages: no metric has to be established in order to group the models, rules for grouping are provided by linguistic categories of a given language and acoustic models training times are lower due to no computational overhead associated with the method. One may argue that data-driven approach allows for better fit to available training data - the question is, however, whether data-driven method provides correct results, when data is scarce.

(30)

We investigated an opportunity of applying linguistic categorization of phones to create tied-state triphones according to rules listed above.

3.3 Experiment

The models were trained and tested on three groups of recordings coming from the AGH speech corpus: Streets, Digits and Commands. The speech was recorded using a Voice over Internet Protocol (VoIP) based recorder program which answered calls from regular telephones and cellphones. The Streets subcorpus contains recordings of 14 people reading out 1250 unique names of streets in the Polish city Bielsko Biała (not every speaker has a complete set). Total time of recordings is 12 hours and 30 minutes. Subcorpus of Digits involves recordings of 72 people, who were asked to repeat random combinations of three digits, which were read out by a lector. Summary duration of these recordings is 5 hours and 40 minutes. The last group are recordings of isolated Commands, such as “oferta” (Eng. “offer”), “internet” or “awaria” (Eng. “breakdown”). They were recorded by 55 persons, resulting in 3 hours and 20 minutes of recordings. Although the speech was encoded and recorded in 8kHz sampling rate, we converted recordings to 16kHz WAVE Pulse Code Modulation (PCM) format to comply with requirements of the SARMATA system.

In order to check whether the proposed method improves overall recognition of our ASR system, we checked three scenarios: a system with monophone acoustic models (no context modelling), a system with triphone acoustic models and a sys-tem with linguistically generalized triphone acoustic models. For each training/test set, a monophone dictionary was automatically created from transcriptions using the OrtFon software, and was later converted to an appropriate triphone dictio-nary. For parametrisation we used standard MFCC along with frame log-energy, deltas and double deltas with 20ms window length and 10ms frame shift [Davis and Mermelstein, 1980]. To initialize monophone models training, we used 5 minutes of manually time aligned phone level annotations, assigning a single multivariate Gaus-sian probability density function to model each phone. Models were then trained with 6 passes of Viterbi algorithm to obtain alignment of the training data, and

(31)

re-estimated after each pass. As a result of this training, we obtained 38 HMM 3-state acoustic models (including silence model) with Gaussian mixture model containing 20 components for each state.

Because many triphones occur rarely, and thus are provided with non-satisfactory amount of training data, we counted their occurrence frequency in our corpus and picked those which occur frequently enough to provide at least 100 frames per HMM state per GMM component (with the assumption that average duration of a triphone would be 100ms, which may not necessarily be true, but is irrelevant in creation of a ranking list due to the fact that every acoustic model has the same number of states and number of Gaussians in the mixture). Because of this criterion, there are more linguistically generalized triphones than standard triphones in the training set. Previously trained monophone model was used as a starting point for triphone models training, which also included 6 passes of Viterbi algorithm and model re-estimations. Resulting acoustic models were 95 regular tied-state triphone models and 147 linguistically generalized tied-state triphone models - note that both model sets include the monophones from the first training set, which are used in every context which was too rare by our criterion to be modelled by triphones.

Because we are interested in improvement of a speaker independent system, we designed an evaluation procedure to verify improvement in a scenario where the test users speech is excluded from training data. Following is the description of this procedure: for each speaker, monophone and triphone models are trained using all available data of other speakers, and then each model is used for evaluation of ASR performance on current speaker. Because the largest amount of available data is in Streets, we decided that the procedure should be run for each speaker in Streets, thus allowing to re-use their models in testing ASR performance on every other subcorpus. However, not every speaker, who spoke in Streets, has spoken in Digits or Commands. Thus, we have 14 test speakers in Streets (14 175 utterances), 7 test speakers in Digits (512 utterances) and 6 test speakers in Commands (828 utterances).

(32)

Test corpus Monophones Triphones Generalized triphones Commands speaker BZI 84,6% 90,6% 85,5% speaker DSK 98,9% 98,9% 98,9% speaker IGA 100,0% 99,1% 100,0% speaker MWI 98,2% 98,8% 98,8% speaker PZE 94,2% 96,0% 95,6% speaker TPE 84,0% 88,0% 88,0% Digits speaker ALJ 91,3% 88,4% 85,5% speaker JGR 100% 96,2% 100% speaker KPE 46,9% 71,4% 71,4% speaker DSK 94,7% 100% 100% speaker MWI 95,5% 96,3% 94,8% speaker TJA 90,3% 92,9% 92,9% speaker TPE 93,8% 93,8% 93,8% Streets speaker ALJ 86,2% 87,6% 88,6% speaker JGR 71,7% 72,4% 70,9% speaker KPE 78,9% 80,8% 82,2% speaker MIG 89,9% 92,3% 90,7% speaker ETH 81,3% 81,7% 84,1% speaker MNO 49,5% 49,0% 54,5% speaker BZI 73,7% 76,3% 77,7% speaker DSK 90,0% 88,0% 87,0% speaker IGA 51,1% 48,0% 49,6% speaker MWI 84,6% 85,1% 85,4% speaker PJA 85,0% 82,6% 85,0% speaker PZE 78,9% 77,9% 76,1% speaker TJA 80,7% 81,7% 82,8% speaker TPE 83,7% 85,4% 85,6% Commands 95.2% 96.6% 95.9% Digits 88.5% 91.8% 91.0% Streets 77.5 % 78.1% 78.6% average 78.8% 79.6% 80.0%

Table 3.1: Results of recognition for monophone model, a full triphone model and the proposed generalized triphone model.

(33)

3.4 Results

Before we go on with the analysis of the results, we would like to point out the differences between our three test data sets. The Streets set is characterized by greatest phonetic variety, and so we believe that it is the greatest beneficiary of context generalization. The Commands and Digits sets, on the other hand, offer relatively low variety of phonetic contexts, given that Commands are just about 30 different words and Digits utterances are combinations of three randomly selected digits.

In each test data set, we observed that monophones yield worse results than any kind of triphones, which was expected. Linguistically generalized triphones scored overall better results than regular triphones, however, their introduction resulted in lower scores in Digits and Commands. This is caused by small quantity of phonetic contexts of utterances in those sets - regular triphone models were fit very tightly to this data. When generalized, acoustic models fit worse to those particular contexts, but also gained representativeness for large amount of other contexts, as shown in results from the Streets set. Overall, given our data, proposed method improved recognition rate by 0.4% in regard to regular triphones and by 1.2% in regard to monophones.

Another consequence of triphone generalization is that it allowed us to use 52 more context dependent models, which had minimal amount of training data avail-able as explained in section 3.3 (a gain of about 50%). It also means that more training data was efficiently used.

In our experiment, average time needed to train a monophone model was about 1 hour, and time needed to decode all test data for a given speaker was about 10 minutes. Triphone models training took roughly the same time as training of monophone models. To perform these tasks, we ran SARMATA system with 15 working threads on a server equipped with Intel Xeon E5 series processors.

(34)

Data collection

4.1 Polish language datasets

Only a handful of the world’s languages, like English, benefit from resources such as wide selection of hundreds of hours long speech corpora or representative text cor-pora. Librispeech [Panayotov et al., 2015] (about 1000 hours), TEDLIUM [Rousseau et al., 2012] (about 120 hours) and the AMI meeting corpus [McCowan et al., 2005] (about 100 hours) are only a few examples of publicly available English speech databases which can be obtained free of charge. There are even more databases which can be purchased from institutions such as Linguistic Data Con-sortium (LDC), such as the Switchboard [Godfrey et al., 1992], which serves as an ASR benchmarking corpus, the Fisher corpus [Cieri et al., 2004] or the Wall Street Journal corpus [Paul and Baker, 1992]. Finally, large private organizations such as Google or Baidu report results of their research using their own internal data sets, which exceed even the 10000 hours count [Amodei et al., 2016, Kim et al., 2017]. In spite of this abundance of data, great majority of world’s languages must be con-sidered under-resourced. Although Polish can hardly still be called under-resourced in the light of available data sets which will be presented in this section, it is still behind e.g. English in terms of available data - especially the kind which can be obtained either freely or via accessibly priced license purchase.

The main problem of creating language resources is the high cost of their pro-duction. Typically, the work associated with process of language resources creation is mostly manual, which is both costly and not very effective - e.g. manual

(35)

tran-scription can take several times as long as the recordings’ duration and manual annotation on the phone level takes even tens of times longer. On the other hand, to satisfy the needs of statistical models employed in modern speech-related tech-nologies, large amount of training data is required. It makes corpus creation a very difficult task for researchers who wish to provide such data. A consequence of high corpus creation cost is that the cost of purchase is typically also very high, which -from a practical point of view - is often the main prohibitive factor in acquisition of an existing corpus - take the Polish Speecon corpus’ [Iskra et al., 2002] academic license costing 50000e as an example1.

In case of text corpora, collecting text data from the Internet mitigates some of these problems. The Internet is a source which seems to be almost unlimited in size, and it also offers a wide variety of resource types: social media such as Facebook or Twitter offer short, colloquial texts, while blogs and news portals might offer more formal and longer texts. Also, a lot of literature has been already converted to digital form and is freely available on the Internet. It has been shown that the corpora built on Internet resources bring promising results [Scannell, 2007, Kilgarriff and Grefenstette, 2003]. Just a few of the examples of works which utilized the Internet in building a language resource or using it for some purpose are an attempt to build clean bilingual corpus [Resnik, 1999] or building an n-gram model [Ziółko and Skurzok, 2011].

This is, however, not the case in construction of speech corpora. Although lots of recorded speech can be found on the Internet, it is almost never transcribed, which rules out the ability of supervised training or evaluation of a system under development. Even with the abundance of unlabeled recordings, which could have been used for development of unsupervised learning algorithms, the usage (especially commercial one) is typically restricted by the licensing (as is the case with e.g. YouTube). Another issue with freely available speech recordings is their quality - most people do not own a high quality recording device, and so the resulting recordings tend to feature any combination of the following qualities: distortion, narrow spectrum, high levels of noise, both environmental and originating from the

1_Price _at _ELRA’s _website _on _23.02.2018 _- _{http://catalog.}

elra.info/en-us/repository/browse/polish-speecon-database/ b0b2ace2a9de11e7a093ac9e1701ca02e55350cd3b1544809e60565f90b23e49

(36)

recording device itself, dynamics compression, effects of encoding the signal using lossy codec such as MP3 and others. While some of these features may be useful to have in a corpus dedicated to some particular application (e.g. low-quality telephonic recordings for a telephonic ASR system), they can be obstacles in development of other speech-related applications. There are also legal issues that need to be covered when using a recording of somebody’s speech, which vary among different countries. Asking the speaker for permission to use his voice might be a severe obstacle in automatic retrieval of speech samples from the Internet.

Therefore, the approaches which are left for the development of speech databases are:

• creation of a new set of annotated recordings, which meet certain requirements established by the corpus designer (e.g. phonetic variety, high number of unique speakers, continuous speech, etc.), and

• adaptation of available collections of recordings, such as recordings from call-centers or public speeches and lectures, which most often involves creating a manual annotation of time boundaries of recorded utterances.

An example of the first kind of corpus for Polish is CORPORA [Grocholewski, 1997] and an example of the second one is LUNA [Marciniak, 2010] - both are described in the next section.

It needs to be mentioned that some researchers found a way to deal with lack of resources during the construction of phonetic models for purpose of ASR by the means of bootstrapping [Schultz and Waibel, 1997, Le and Besacier, 2009]. The idea is to transfer models created for phonemes of one language to another, with some kind of a transformation or retraining involved. The results of this technique seem promising, and come with a great advantage - it may not require as much resources as the standard training techniques. It still has to be considered that an annotated corpus is needed for the evaluation of the resulting ASR system.

4.1.1 Speech corpora

The speech databases consist of two resources - audio recordings and text tran-scripts. Both are extremely useful in the ASR development, as their combination

(37)

allows to train acoustic models with some kind of an expectation-maximization (EM) algorithm. The EM procedure makes it possible to train the acoustic model just by having a sentence-level time alignment (i.e. the time-markers for beginning and ending of a sentence, not individual words). During training, it is converted to phone-level (or, more precisely, HMM-state-level) alignment by the forced alignment technique - the language model applied in the recognition allows only the recognition of the transcript, and nothing else. The transcripts, especially in the case of larger corpora, are also useful in the development of the language model.

The most well-known speech corpus for Polish is CORPORA, prepared in 1993 by Stefan Grocholewski [Grocholewski, 1997]. It contains 365 utterances (numbers, names and 114 sentences) spoken by 45 speakers. The sentences are incoherent semantically, so that they can provide maximal phonetic diversity. An example of such a sentence is “on myje wróble w zoo”, which in English means roughly “he is washing sparrows in a zoo”. Recordings of two speakers (a woman and a man) were manually annotated to phonemes and then, with help of dynamic programming methods, the rest of the speakers were annotated automatically and then manually corrected.

A relatively large corpus of speech related to legal matters is named Juristic [De-menko et al., 2008]. It contains recordings of about 1000 different speakers from different parts of Poland. Half of the recordings come from Court, Police, Prosecu-tion and the other half from universities and offices. Each speaker recorded about half an hour of both spontaneous and read speech. To the best of authors knowledge it is not available publicly.

LUNA is a corpus of spontaneous telephonic speech [Marciniak, 2010]. It con-tains about 11 hours of dialogues, where one person - the caller - asks for information concerning public transport in Warsaw, and the other person - the agent - gives that information. Originally, the corpus had only morphosyntactic and semantic annota-tion. A word-level time annotation was made separately for training of SARMATA ASR system by the Digital Signal Processing group at AGH-UST.

NKJP stands for “Narodowy Korpus Jezyka Polskiego” - in English NCP, or “National Corpus of Polish” [Przepiórkowski et al., 2008]. It is a relatively large resource of Polish texts, which consists of literature, journalism, letters, Internet

(38)

texts and others. The public subset of this corpus contains 1 million words with various kinds of annotations (morphosyntactic, semantic, named entities, etc.). It is also a resource of recorded conversations and media speeches, which, unfortunately, are not provided with the time alignment.

Recordings of Polish speech are also available as parts of larger, multilingual corpora. GlobalPhone is a corpus consisting of 20 languages, where for each language 100 sentences were read by 100 speakers [Schultz, 2002a]. Although the name of the corpus suggests that it contains telephonic speech, it does not - the idea of the authors was find a global phone set, shared by different languages. Chosen texts focus mostly on national and international politics as well as economics news. The Polish variant of the corpus has about 13 hours of speech in the training set and about 2 hours of speech in the test set.

Another database is the SpeechDat-E, which contains recorded speech from fixed telephone networks for five eastern European languages, including Polish. There are recordings of 1000 speakers, each reading some sentences and isolated words.

There is also a corpus consisting of recordings from the Europarliament, which is claimed by its authors to have several hundreds of hours of Polish politicians and interpreters speech. Unfortunately, this data has only approximate transcriptions or none at all [Lööf et al., 2009].

EASR is a corpus of elderly speech for 4 languages, Polish amongst them [Hämäläi-nen et al., 2014]. Authors claim to have 205 hours of read speech of 781 Polish persons.

The previously mentioned Polish Speecon corpus [Iskra et al., 2002] has 550 adult and 50 children speakers and amounts to over 200 hours of transcribed audio, mixing both spontaneous and read speech.

4.1.2 Text corpora

With numerous people and organizations producing content such as blogs and news portals on the Internet, text resources are seemingly infinite. However, in order to construct a strong language model for a specific application, we need to have some amount of text representative of the applications domain. Furthermore, many sources of text data require a considerable amount of effort to preprocess them for

(39)

usage in language modelling, e.g. text normalization, which in the case of strongly inflected languages can be far from straightforward. The resources presented below are not the only ones available in Polish, however, they are the most relevant from the ASR system design point of view.

One of the commonly used resources in the text domain is the NKJP, presented earlier in section 4.1.1. Due to its manual annotations, it’s commonly used as the training data for various kinds of NLP models, such as Named Entity Recognizers (NER) or Part-of-Speech (POS) taggers. However, in the domain of language mod-elling it should be viewed as a very small resource, with only 1 million words in its publicly available subset.

A relatively large Polish text resource is the Polish Sejm Corpus (PSC) [Ogrod-niczuk, 2012], which is a collection of transcripts from the Polish parliament since year 1993. The corpus is constantly growing in size and had more than 300 million words at the moment of publication. The contents of the corpus are representative for the domain of politics and have been shown to be significantly less linguistically rich than Polish language in general [Ogrodniczuk, 2012]. One major obstacle in application of this corpus in language modelling are the notation conventions used by transcribers, who heavily use abbreviated forms and denote numerals as either Arabic or Roman numbers. This problem is addressed in chapter 5.

Another large resource, which is also multilingual, is the OpenSubtitles cor-pus [Tiedemann, 2009, Lison and Tiedemann, 2016], often used in Machine Trans-lation (MT) tasks. It consists of more than 2.6 billion sentences from movie and TV subtitles in 60 different languages, including Polish. The Polish subset in the 2016 release of the corpus contains about 143 million sentences and about 0.9 bil-lion tokens. Many of the subtitles are retrieved from video streams using Optical Character Recognition and applying error correction. The texts require thorough normalization in order to be used in language modelling, including text encoding correction and filtering of mistyped or misrecognized words.

Domain specific text resources for training language models can sometimes also be obtained through web scraping, however, it has to be kept in mind that they are often not representative of conversational speech. An interesting approach to mitigate this problem has been presented by the researchers at the University of

(40)

Washington2 - they constructed a language model using conversational speech text data and filtered web scraped text to retain only the sentences with sufficiently high likelihood given this model. To the best of authors knowledge, this approach has not been applied in Polish.

2_{The procedure is described at website and to the best of authors knowledge, has not been}

oth-erwise published https://ssli.ee.washington.edu/tial/projects/ears/WebData/web_data_ collection.html.

(41)

4.2 AGH corpus of Polish speech

During our work on various projects, we managed to collect some sets of recordings which are not too comprehensive by themselves, but put together in one corpus should be large enough to satisfy the needs of ASR system training, therefore, hope-fully, adding another resource to the pool.

4.2.1 Master Label File Format

The AGH corpus is annotated with Master Label File (MLF) format. The kind of annotation we use carries information about beginning and ending times of either a word or a phrase. It may also be used to mark range of time in which an agent (e.g. human transcriber, ASR system) perceives the basic unit of speech - a phone. Files containing the annotation have .mlf extension and must contain a proper header, which consists of #!M LF !# symbol in the first line. Multiple annotations may be contained in one MLF, established that each annotation begins with a path to annotated wave file and ends with a dot. Basic time unit in MLF is 100 ns, that gives precision which is much better than typically needed in any ASR system. The basic time unit in our annotations is 1 ms. The MLF format was originally designed for the HTK software [Young et al., 2005]. Below, an example of annotation in MLF format is shown.

#!MLF!# # The header.

’’waves/spk_1/1.wav’’ # The path to annotated file.

# A comment.

0 12490000 heja # Annotated word of 1249ms duration. 12890000 24310000 co_u_ciebie # Annotated phrase.

24310000 25720000 o 25720000 27260000 l

(42)

. # Dot signs the end of 1.wav file annotation.

’’waves/spk_1/2.wav’’ # Path to the next file. #... Rest of the annotation.

4.2.2 Contents of the corpus

We present a description of every major part of AGH corpus. Issues discussed are: how and why the recordings set was created, what kind of speech do the recordings contain, recording equipment and gender ratio of the speakers as well as a more general description.

The recorded speech is contained in a single channel (mono) WAVE files with sampling frequency of 16kHz and 16bit precision. Annotations were automatically checked for orthographic correctness using OpenSJP, an open source distribution of Polish dictionary [sjp.pl, 2014] and manually corrected. Some parts (e.g. students’ recordings) were also checked manually. The post-processing of the corpus included creation of a list of words which are foreign or phonetically ambiguous. Preparation of a transformation dictionary for them and transforming them in such a way, so that ORTFON (our software performing automatic phonetic transcriptions based on an algorithm created by Steffen-Batóg [Steffen-Batóg and Nowakowski, 1993]) is able to deal with these problematic words. An example of such transformation: let us suppose that we have found an annotated phrase “earl grey”, in which both words are foreign to Polish. In order to help ORTFON transcribe it properly, we change it to “erl grej”, which is more phonetically compliant with Polish and allows the usage of the same transcription rules as for proper Polish words. The process is reversible because of creation of the transformation dictionary containing rules used for transformation, which also makes it possible to apply them to new data in the corpus as well. It also means that in this case the ASR system will output “erl grej” as the recognized phrase, although we can postprocess the results and convert it automatically to “earl grey”.

The prevalent dialect in the corpus is the Lesser Polish dialect, because the majority of speakers come from the south-eastern regions of Poland. However, other

(43)

39%

26%

18%

11%

6%

Colloquial

Students

TTS

VoIP

Various

Figure 4.1: Percentage of each of subcorpus contribution to our corpus in terms of recordings length.

(44)

dialects are also present. We did not mark information about the descent of each speaker because it is not needed for our current research. Still, we feel that this fact requires some explanation. Dialects in Polish are weaker than e.g. in English and generally avoided in formal situations. This fact is resembled in census collected by Central Statistical Office of Poland in 2011, which indicates, that only 2% of Poles use other languages or dialects at home [GUS, 2011]. Furthermore, recent research on tolerance of dialects in Poland suggests that although Poles declare to be tolerant of regional dialects, they prefer not to hear them and expect others not to use them in formal situations [Hansen, 2014]. Another example of dialect marginalization is that in NCP [Przepiórkowski et al., 2012] there is no mention of dialect itself. Therefore we did not consider tagging of dialects as significant during preparation of our corpus.

Our choices of the methods of data collection as well as decisions on the statistic profile of the corpus were mainly dictated by the need of large number of speakers and large amount of recordings. We focused primarily on building large and well annotated training corpus rather than on representing complete set of various di-alects, ages or topics. We designed our corpus to be dedicated to ASR training and tests, and therefore provided all required metadata only for those tasks.

Colloquial speech recordings

This set of recordings has been created by 10 speakers (5 males and 5 females), each one reading out about 1000 phrases in the near field of a microphone inside a small room with no audible reverberance, isolated from the outside noise. For every speaker, summary recordings length is about 1 hour, resulting in total length of 10 hours and 4 minutes. Each phrase is saved in a separate WAVE file and is annotated in an MLF file as a whole (beginning and ending of each phrase is marked, without distinguishing single words). Recorded utterances are short sentences or fragments of longer sentences, which are picked from the Internet. They are derived from everyday language and ensured to be orthographically correct. These recordings were prepared for us by an external company.

(45)

Students’ recordings

One of the classes we teach, concerns the speech-dedicated technologies. In order to pass the course, the students are expected to build a simple purpose ASR system using HTK [Young et al., 2005] for their own voice. Examples of such projects are systems handling pizza ordering, tickets booking or providing a voice interface for some application. Two major steps leading towards creation of such systems are grammar design and preparation of recordings. The grammars prepared are simple and designed for recognition of one sentence, which contains all the relevant information (such as quantity, size and topping of ordered pizzas). For each project, about 3 minutes of recordings are prepared. Recorded utterances are sentences compliant with grammar obtained in the earlier step or enumerations of words from the dictionary. Later on, the recordings are annotated to words by students, and then converted to phoneme level annotation by ASR system SARMATA [Ziółko et al., 2011].

At the moment we have 125 students recorded and we expect this number to grow around 60 persons per each year. The distribution of gender is 86 males and 39 females, giving ratio of roughly 2:1. Also, the vast majority of speakers are in the age group of 20-25. Duration of this subcorpus is 6 hours and 33 minutes. Equipment used to prepare these recordings (and so their quality) is various: some students use cheap PC microphones and cell phones, but some have professional recording equipment at their disposal, such as dictaphones or high-end microphones with suitable audio interfaces. This allows for testing how dependent the ASR is of recording devices.

TTS training corpus

During our efforts to develop a Text-To-Speech (TTS) synthesizer, a large set of recordings was prepared in order to be used by the system. It consists of 2132 sen-tences uttered by a young woman, who is a trained speaker. The text comes from the 1 million words subcorpus of NCP corpus (see section 4.1) and was designed to be both phonetically rich and balanced. It means that the distribution of IPA phonemes and diphones occurrence frequency is as close to Polish language as pos-sible. Accomplishing this condition is important for training and testing of both