• Nie Znaleziono Wyników

Recent Advances in Error Correction of ASR

N/A
N/A
Protected

Academic year: 2021

Share "Recent Advances in Error Correction of ASR"

Copied!
43
0
0

Pełen tekst

(1)

Recent Advances in Error Correction of ASR

2019-04-09

Tomasz Ziętkiewicz

(2)

Outline

1 Introduction

2 Dataset

3 Method

4 Evaluation

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR2/19

(3)

[GSW19]

IEEE International Conference on Acoustics, Speech and Signal

Processing May 12-17, 2019

(4)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

,

I Worse performance on rare words

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR4/19

(5)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

,

I Worse performance on rare words

(6)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

,

I Worse performance on rare words

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR4/19

(7)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

,

I Worse performance on rare words

(8)

Motivation

I Popularity of end-to-end ASR models

I Acousting, pronunciation and language model combined in one neural network

I Problem: needs annotated audio data I LM trained on small dataset compared with

”traditional approach

,

I Worse performance on rare words

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR4/19

(9)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then

correcting e2e model’s errors

(10)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR5/19

(11)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then

correcting e2e model’s errors

(12)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR5/19

(13)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then

correcting e2e model’s errors

(14)

Possible solutions

I Incorporating external LM trained on text-only data I Rescoring n-best decoded hypothesis from end-to-end ASR:

y∗ = argmax

y

logP (y|x) + λlogP

LM

(y)

I Incorporate RNN-LM into first-pass beam search by shallow, cold or deep fusion

I Use TTS to generate audio-text pairs training data from text-only data

I Rare words and proper nouns are still problematic with this approach

I Why? Hypothesis: LM trained with other objective then correcting e2e model’s errors

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR5/19

(15)

Solution

Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.

I Identify likely errors in ASR output I Propose alternatives

I Combine with LM-rescoring

(16)

Solution

Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.

I Identify likely errors in ASR output I Propose alternatives

I Combine with LM-rescoring

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR6/19

(17)

Solution

Proposed solution: spelling corrector model on text-to-text (hypothesis-to-refrerence) pairs.

I Identify likely errors in ASR output I Propose alternatives

I Combine with LM-rescoring

(18)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR7/19

(19)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(20)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR7/19

(21)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(22)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR7/19

(23)

ASR train/eval Dataset

I LibriSpeech [PCPK15]

I Large-scale (1000 hours) corpus of read English speech I audiobooks from the LibriVox project

I carefully segmented and aligned I http://www.openslr.org/12/

I License: CC BY 4.0

(24)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19

(25)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs

I also added to ASR trainset

(26)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19

(27)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs

I also added to ASR trainset

(28)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19

(29)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs

I also added to ASR trainset

(30)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19

(31)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs

I also added to ASR trainset

(32)

Text-only dataset

I Spelling correction needs parallel corpus: ASR hypothesis + ground truth text

I 800M word LibriSpeech language modeling corpus I Selected 40M sentences not overlapping with test set I Generated audio using TTS (WaveNet [vdOLB

+

17])

I Added noise and reverbation to get additional 40M utterances I Decode using pretrained ASR model

I From each TTS utterance ASR produces 8 hypotheses I All of them used to form hypothesis-reference pairs: 640M

hypothesis-reference pairs I also added to ASR trainset

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR8/19

(33)

Baseline ASR model

I LAS - Listen, Attend and Spell [CJLV16]

I Encoder-decoder with attention

I encoder: 2 convolutional layers, 3 bidirectional LSTM layers

I decoder: single undirectional LSTM layer

(34)

Spelling correction model

I attention-based encoder-decoder sequence-to-sequence I similar to Neural Machine Translation model from [CFB

+

18]

I Encoder: 3 bi-directional LSTM layers I Decoder: 3 unidirectional LSTM layers

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR10/19

(35)

Architecture

(36)

Language model

I 2 unidirectional LSTM layers

I used to rescore n-best list generated by ASR

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR12/19

(37)

Inference

I ASR produces N-best list of hypotheses with log prob scores (p

i

)

I Spelling correction produces M-best list for each ASR hypothesis with scores (q

ij

)

I LM rescroing of each of M × N hypotheses with r

ij

score I Most likely hypothesis:

A

= argmax

A

λ

LAS

∗ p

i

+ λ

SC

∗ q

ij

+ λ

LM

∗ r

ij

(38)

Results

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR14/19

(39)

Results

(40)

Results

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR16/19

(41)

Thank you

Thank you for your attention!

(42)

References I

Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes, The best of both worlds: Combining recent advances in neural machine translation, CoRR abs/1804.09849 (2018).

William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, ICASSP, 2016.

Jinxi Guo, Tara N Sainath, and Ron J Weiss, A spelling correction model for end-to-end speech recognition, arXiv preprint arXiv:1902.07178 (2019).

Tomasz Ziętkiewicz – Recent Advances in Error Correction of ASR18/19

(43)

References II

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An asr corpus based on public domain audio books, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp. 5206–5210.

A¨ aron van den Oord, Yazhe Li, Igor Babuschkin, Karen

Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den

Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg,

Norman Casagrande, Dominik Grewe, Seb Noury, Sander

Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex

Graves, Helen King, Tom Walters, Dan Belov, and Demis

Hassabis, Parallel wavenet: Fast high-fidelity speech synthesis,

CoRR abs/1711.10433 (2017).

Cytaty

Powiązane dokumenty

Structural Optimization & Mechanics, Delft University of Technology, Mekelweg 2, 2628 CD Delft, The Netherlands DevLab, Development Laboratories, Horsten 1, MMP 0.10, 5612

Yet, it is still of these secondary understandings of truth that we speak when we say truth is socially constructed, and how not, for this truth is developed in the

According to a 2011 study from the UN’s Food and Agriculture Organization, approximately two thirds of food waste in Europe occurs in the supply chain between production

Warto zauważyć, że pierwsze z wymie- nionych studiów proponuje metodę ‘pośredni- czącej analizy dyskursu’, lokującej się między KAD a etnometodologią, drugie odwołuje się

Warto nadmienić, Ŝe polityka planowania rodziny w Chinach oraz wspomniana polityka jednego dziecka wprowadzona w roku 1979 okazała się bardzo skuteczna jeŜeli

Działalność konspiracyjnej Lubelskiej Cho- rągwi Harcerek w początkowej fazie ilustrują trzy zachowane dokumenty, sporządzone odręcznie przez Walciszewską i przesłane

A utor książki stara się jednak dowieść tw ierdzenia, że w brew podejm ow anym p rzez w ielu kryty­ ków poszukiw aniom ujęć m o raln ie i in te le k tu aln ie

The most important element of the comparative analysis of short-term models is assessment of the parameter describing liquidity adjustment rate to long-term relationships