Recent Advances in Error Correction of ASR

(1)

Recent Advances in Error Correction of ASR

2019-04-09

Tomasz Ziętkiewicz

(2)

Outline

1 Introduction

2 Dataset

3 Method

4 Evaluation

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR2/19

(3)

[GSW19]

IEEE International Conference on Acoustics, Speech and Signal

Processing May 12-17, 2019

(4)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

_,

I Worse performance on rare words

(5)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

_,

I Worse performance on rare words

(6)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

_,

I Worse performance on rare words

(7)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

_,

I Worse performance on rare words

(8)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

_,

I Worse performance on rare words

(9)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then

correcting e2e model’s errors

(10)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors

(11)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then

correcting e2e model’s errors

(12)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors

(13)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then

correcting e2e model’s errors

(14)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors

(15)

Solution

Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.

I Identify likely errors in ASR output I Propose alternatives

I Combine with LM-rescoring

(16)

Solution

Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.

I Identify likely errors in ASR output I Propose alternatives

I Combine with LM-rescoring

(17)

Solution

Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.

I Identify likely errors in ASR output I Propose alternatives

I Combine with LM-rescoring

(18)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(19)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(20)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(21)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(22)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(23)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(24)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

(25)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs

I also added to ASR trainset

(26)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

(27)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs

I also added to ASR trainset

(28)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

(29)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs

I also added to ASR trainset

(30)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

(31)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs

I also added to ASR trainset

(32)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

⁺

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

(33)

Baseline ASR model

I LAS - Listen, Attend and Spell [CJLV16]

I Encoder-decoder with attention

I encoder: 2 convolutional layers, 3 bidirectional LSTM layers

I decoder: single undirectional LSTM layer

(34)

Spelling correction model

I attention-based encoder-decoder sequence-to-sequence I similar to Neural Machine Translation model from [CFB

⁺

18]

I Encoder: 3 bi-directional LSTM layers I Decoder: 3 unidirectional LSTM layers

(35)

Architecture

(36)

Language model

I 2 unidirectional LSTM layers

I used to rescore n-best list generated by ASR

(37)

Inference

I ASR produces N-best list of hypotheses with log prob scores (p

_i

)

I Spelling correction produces M-best list for each ASR hypothesis with scores (q

_ij

)

I LM rescroing of each of M × N hypotheses with r

ij

score I Most likely hypothesis:

A

^∗

= argmax

A

λ

_LAS

∗ p

_i

+ λ

_SC

∗ q

_ij

+ λ

_LM

∗ r

_ij

(38)

Results

(39)

Results

(40)

Results

(41)

Thank you

Thank you for your attention!

(42)

References I

Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes, The best of both worlds: Combining recent advances in neural machine translation, CoRR abs/1804.09849 (2018).

William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, ICASSP, 2016.

Jinxi Guo, Tara N Sainath, and Ron J Weiss, A spelling correction model for end-to-end speech recognition, arXiv preprint arXiv:1902.07178 (2019).

(43)

References II

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An asr corpus based on public domain audio books, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp. 5206–5210.