SOCCEsRs: Open Challenge for Correcting Errors of Speech Recognition Systems

(1)

SOCCEsRs: Open Challenge for Correcting Errors of Speech Recognition Systems

2019-04-09

(2)

Outline

1 Task description

2 Dataset

3 Evaluation

(3)

Task description

”Develop a method that improves the result of speech recognition

process on the basis of the (erroneous) output of the ASR system

and the correct human-made transcription”

(4)

Task description

Speech corpus

ASR response ASR

Error correction

system

Corrected ASR response

Transcription Recording

Reference

(5)

Rationale

Why such transformation is needed? It can be used in ASR system as postprocessing stage as:

I re-scoring of hypothesis produced by ASR using information not present at earlier stages of processing (i.e. one directrional LM)

I adaptation of a black box ASR system to some specific domain

(6)

Dataset

I 9000 sentences from Polish Wikinews I Recorded in studio

I Transcribed by human annotators

I Recognized by stat-of-the-art ASR system

I 8000:1000 train/test split

(7)

Dataset normalization

I all words are UPPERCASED I punctuation marks are removed

I numbers and special characters are replaced by their spoken

forms

(8)

Example lines

Example line from tsv file containing training dataset:

(9)

Evaluation

I average relative WER change I relative SER change

I CharMatch

(10)

Average relative WER change

I relative difference between:

I Word Error Rate of ASR hypothesis averaged over all test sentences

I Word Error Rate of hypothesis corrected by the proposed system averaged over all test sentences

(11)

Word Error Rate

W ER = S + D + I N = H + S + D

where S = number of substitutions, D = number of deletions, I =

number of insertions, H - number of hits, N - length of reference

sentence. See i.e. [?] for in-depth explanation.

(12)

Relative WER change

W ER

_r

∆ = 1 −

P|C|

i=0 W ERi

W ER⁰_i

|C|

where: |C| = number of sentences in corpora, W ER

_i

= W ER of i-th sentence in the corpora, processed by the system, W ER

⁰_i

= W ER of the i-th sentence in the corpora, returned by ASR.

In example: if average W ER

⁰

of raw ASR system is 8 and after

processing sentences through the error correction system W ER is 6

then W ER

_r

∆ = 25%.

(13)

Relative SER change

I relative difference between Sentence Error Rate of ASR hypothesis and hypothesis corrected by the proposed system SER

I SER: ratio of number of sentences with W ER = 0 (correctly recognized sentences) to number of all sentences in the corpora.

I Relative SER reduction is defined similarly to W ER

r

∆:

SER

r

∆ = 1 − SER SER

0

(14)

CharMatch

Introduced in [?].

F

₀

.5-measure defined in as follows:

F

₀

.5 = (1 + 0.5

²

) × P × R 0.5

²

P + R Where: P is precision and R is recall:

P =

P

i

T

_i P

i

d

L

(h

_i

, s

i

) , R =

P

i

T

_i P

i

d

L

(h

_i

, r

i

)

Where: r

_i

- i-th reference utterance, h

_i

- i-th ASR hypothesis, s

_i

- i-th system output, d

_L

(a, b) - Levenshtein distance between

sequences a and b, T

_i

- number of correct changes performed by the

system

(15)

CharMatch

T

_i

- number of correct changes performed by the system, calculated as:

T

i

= d

L

(h

_i

, r

i

) + d

_L

(h

_i

, s

i

) − d

_L

(s

_i

, r

i

)

2

(16)