Recent Advances in Error Correction of ASR
2019-04-09
Tomasz Ziętkiewicz
Outline
1 Introduction
2 Dataset
3 Method
4 Evaluation
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR2/19
[GSW19]
IEEE International Conference on Acoustics, Speech and Signal
Processing May 12-17, 2019
Motivation
I Popularity of end-to-end ASR models
I Acousting, pronunciation and language model combined in one neural network
I Problem: needs annotated audio data I LM trained on small dataset compared with
”traditional approach
,I Worse performance on rare words
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR4/19
Motivation
I Popularity of end-to-end ASR models
I Acousting, pronunciation and language model combined in one neural network
I Problem: needs annotated audio data I LM trained on small dataset compared with
”traditional approach
,I Worse performance on rare words
Motivation
I Popularity of end-to-end ASR models
I Acousting, pronunciation and language model combined in one neural network
I Problem: needs annotated audio data I LM trained on small dataset compared with
”traditional approach
,I Worse performance on rare words
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR4/19
Motivation
I Popularity of end-to-end ASR models
I Acousting, pronunciation and language model combined in one neural network
I Problem: needs annotated audio data I LM trained on small dataset compared with
”traditional approach
,I Worse performance on rare words
Motivation
I Popularity of end-to-end ASR models
I Acousting, pronunciation and language model combined in one neural network
I Problem: needs annotated audio data I LM trained on small dataset compared with
”traditional approach
,I Worse performance on rare words
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR4/19
Possible solutions
I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:
y∗ = argmax
y
logP (y|x) + λlogP
LM(y)
I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion
I Use TTS to generate audio-text pairs training data from text-only data
I Rare words and proper nouns are still problematic with this approach
I Why? Hypothesis: LM trained with other objective then
correcting e2e model’s errors
Possible solutions
I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:
y∗ = argmax
y
logP (y|x) + λlogP
LM(y)
I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion
I Use TTS to generate audio-text pairs training data from text-only data
I Rare words and proper nouns are still problematic with this approach
I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR5/19
Possible solutions
I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:
y∗ = argmax
y
logP (y|x) + λlogP
LM(y)
I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion
I Use TTS to generate audio-text pairs training data from text-only data
I Rare words and proper nouns are still problematic with this approach
I Why? Hypothesis: LM trained with other objective then
correcting e2e model’s errors
Possible solutions
I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:
y∗ = argmax
y
logP (y|x) + λlogP
LM(y)
I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion
I Use TTS to generate audio-text pairs training data from text-only data
I Rare words and proper nouns are still problematic with this approach
I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR5/19
Possible solutions
I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:
y∗ = argmax
y
logP (y|x) + λlogP
LM(y)
I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion
I Use TTS to generate audio-text pairs training data from text-only data
I Rare words and proper nouns are still problematic with this approach
I Why? Hypothesis: LM trained with other objective then
correcting e2e model’s errors
Possible solutions
I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:
y∗ = argmax
y
logP (y|x) + λlogP
LM(y)
I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion
I Use TTS to generate audio-text pairs training data from text-only data
I Rare words and proper nouns are still problematic with this approach
I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR5/19
Solution
Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.
I Identify likely errors in ASR output I Propose alternatives
I Combine with LM-rescoring
Solution
Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.
I Identify likely errors in ASR output I Propose alternatives
I Combine with LM-rescoring
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR6/19
Solution
Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.
I Identify likely errors in ASR output I Propose alternatives
I Combine with LM-rescoring
ASR train/eval Dataset
I LibriSpeech [PCPK15]
I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project
I carefully segmented and aligned I http://www.openslr.org/12/
I License: CC BY 4.0
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR7/19
ASR train/eval Dataset
I LibriSpeech [PCPK15]
I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project
I carefully segmented and aligned I http://www.openslr.org/12/
I License: CC BY 4.0
ASR train/eval Dataset
I LibriSpeech [PCPK15]
I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project
I carefully segmented and aligned I http://www.openslr.org/12/
I License: CC BY 4.0
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR7/19
ASR train/eval Dataset
I LibriSpeech [PCPK15]
I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project
I carefully segmented and aligned I http://www.openslr.org/12/
I License: CC BY 4.0
ASR train/eval Dataset
I LibriSpeech [PCPK15]
I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project
I carefully segmented and aligned I http://www.openslr.org/12/
I License: CC BY 4.0
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR7/19
ASR train/eval Dataset
I LibriSpeech [PCPK15]
I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project
I carefully segmented and aligned I http://www.openslr.org/12/
I License: CC BY 4.0
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs I also added to ASR trainset
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs
I also added to ASR trainset
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs I also added to ASR trainset
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs
I also added to ASR trainset
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs I also added to ASR trainset
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs
I also added to ASR trainset
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs I also added to ASR trainset
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs
I also added to ASR trainset
Text-only dataset
I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text
I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB
+17])
I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model
I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M
hypothesis-reference pairs I also added to ASR trainset
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19
Baseline ASR model
I LAS - Listen, Attend and Spell [CJLV16]
I Encoder-decoder with attention
I encoder: 2 convolutional layers, 3 bidirectional LSTM layers
I decoder: single undirectional LSTM layer
Spelling correction model
I attention-based encoder-decoder sequence-to-sequence I similar to Neural Machine Translation model from [CFB
+18]
I Encoder: 3 bi-directional LSTM layers I Decoder: 3 unidirectional LSTM layers
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR10/19
Architecture
Language model
I 2 unidirectional LSTM layers
I used to rescore n-best list generated by ASR
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR12/19
Inference
I ASR produces N-best list of hypotheses with log prob scores (p
i)
I Spelling correction produces M-best list for each ASR hypothesis with scores (q
ij)
I LM rescroing of each of M × N hypotheses with r
ijscore I Most likely hypothesis:
A
∗= argmax
A
λ
LAS∗ p
i+ λ
SC∗ q
ij+ λ
LM∗ r
ijResults
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR14/19
Results
Results
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR16/19
Thank you
Thank you for your attention!
References I
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes, The best of both worlds: Combining recent advances in neural machine translation, CoRR abs/1804.09849 (2018).
William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, ICASSP, 2016.
Jinxi Guo, Tara N Sainath, and Ron J Weiss, A spelling correction model for end-to-end speech recognition, arXiv preprint arXiv:1902.07178 (2019).
Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR18/19