Incorporation of Language Model with CitriNet #2191

snaaz21 · 2021-05-07T20:12:08Z

snaaz21
May 7, 2021

Hi,

I trained a Language Model with train_kenlm.py (6-gram with 286967 lines of text collected from LibriSpeech's train_clean_100, train_clean_360, train_other_500 and some other standard English text lines).

Results of counting:
`
=== 1/5 Counting and sorting n-grams ===
Reading /home/lang_model/model.tmp.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Unigram tokens 16718614 types 937
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11244 2:822115520 3:1541466624 4:2466346752 5:3596755712 6:4932693504
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Statistics:
1 937 D1=0.5 D2=1 D3+=1.5
2 287721 D1=0.506285 D2=1.04345 D3+=1.59677
3 3215127 D1=0.697273 D2=1.12271 D3+=1.48872
4 8173916 D1=0.82992 D2=1.20302 D3+=1.43574
5 11728534 D1=0.908197 D2=1.24826 D3+=1.3993
6 13666388 D1=0.918745 D2=1.31875 D3+=1.46753
Memory estimate for binary LM:
type MB
probing 770 assuming -p 1.5
probing 904 assuming -r models -p 1.5
trie 337 without quantization
trie 168 assuming -q 8 -b 8 quantization
trie 291 assuming -a 22 array pointer compression
trie 122 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:11244 2:4603536 3:64302540 4:196173984 5:328398952 6:437324416
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:11244 2:4603536 3:64302540 4:196173984 5:328398952 6:437324416
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Name:lmplz VmPeak:13261380 kB VmRSS:11264 kB RSSMax:2523692 kB user:25.2489 sys:5.2742 CPU:30.5231 real:19.9458
[NeMo I 2021-05-07 01:48:58 train_kenlm:135] Running binary_build command
/home/lang_model/kenlm/build/bin/lmplz -o 6 --text /home/lang_model/model.tmp.txt --arpa /home/lang_model/model.tmp.arpa --discount_fallback
Reading /home/lang_model/model.tmp.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
vin
SUCCESS

`

Language model is trained without any error and in evaluation using eval_beamsearch_ngram.py, I got better WER i.e. 2.61% on Dev_clean and 6.27% on Dev_other by alpha=1.0 and beta=1.0.

However I got higher WER using LM i.e. approx 16.4% as compared to 3.4% WER using only CitriNet without LM on my audios.

When I saw the predicted output I found that some consecutive words are missing, hence WER is high.
One of the examples is given below:

`
Ground truth: " hello welcome how can i help you today eh what kind of products you guys make sorry i couldn't this can please help me with more i recently here purchase reflects three dot o from one of the stores and this product is not behaving"

Predicted output: "hello welcome how can i help you today what kind of guys make sorry i recently purchase reflects three doo from one of the stores and this product is not behaving"
`
In the above example you can see bold text is missing in the predicted output, this is the output on choosing beam_alpha=1.0 and beam_beta=1.0.

And there is lot of variations in output for missing words by just changing alpha and beta values.

Why the words are missing in the prediction and how to decide the right alpha and beta values?

@EmreOzkose did you face the same?

titu1994 · 2021-05-07T20:49:45Z

titu1994
May 7, 2021
Maintainer

Librispeech KenLM should be built on concatenated datasets of entire Librispeech lm text corpus (after normalization to lower case) + the 3 train transcripts. Overall it should be close to 41 M lines of text rather than the 300k lines of text here.

Secondly, this sentence is confusing - "However I got higher WER using LM i.e. approx 16.4% as compared to 3.4% WER using only CitriNet without LM on my audios."

Is the 3.4% with the Citrinet without LM or Citrinet with LM? Did you get 16.4% with an LM on a different dataset or on the same Dataset that you got 3.4%?

If it's hour own audio clip, then Citrinet might simply not be good with greedy inference either. Check that greedy wer is close to the 16% you so with LM or higher

0 replies

snaaz21 · 2021-05-07T21:49:36Z

snaaz21
May 7, 2021
Author

Thank you for your response.

I got 3.4% using Citrinet without LM. And got 16.4% using Citrinet with LM.

I am getting both the above WER on the same data input.

I used only 1-2 minutes of audios as input.

0 replies

titu1994 · 2021-05-08T01:21:42Z

titu1994
May 8, 2021
Maintainer

The Citrinet Checkpoint available on Nemo is trained on roughly 7000 hours of speech - far more than just Librispeech. So a Librispeec LM (without the actual Librispeech text corpus) will only go so far as to improve or hurt wer.

Could you try as an experiment what is the result with alpha and betas set to 0?

0 replies

snaaz21 · 2021-05-08T11:20:56Z

snaaz21
May 8, 2021
Author

Okay, I'll try to train LM on large corpus.

When I set alpha=0 and beta=0, the WER is 3.50%

0 replies

titu1994 · 2021-05-08T16:34:51Z

titu1994
May 8, 2021
Maintainer

Ok so the fact that alpha 0 beta 0 gets close to greedy scores shows that your steps are correct and LM will help - just that it needs proper dataset + hyper parameter search to get good scores. You can tune your LM on the Librispeech dev set.

0 replies

snaaz21 · 2021-05-08T23:18:29Z

snaaz21
May 8, 2021
Author

ok, understood. Thank you for your time.

0 replies

snaaz21 · 2021-05-18T19:21:45Z

snaaz21
May 18, 2021
Author

Due to lack of resources I couldn’t train LM on full (40 millions of text lines) LibriSpeech LM corpus.

So, this time I trained LM on 13 M text lines (out of 40 Million text lines) of librispeech lm corpus.

And I got 2.35% wer on librispeech’s dev_clean and 5.75% WER on dev_other dataset.
(where alpha=1 and beta =1)

WER is improved on my audio dataset but again some words are missing in some audio’s transcripts (or) giving half word like for “phone” it is giving “ph” and sometimes some mixed words are coming like “asper” which should be two separate words (“as” and “per”)

I observed that the words are missing because of lesser frequency of speech, hence probability of those words is less. So, LM is discarding those words. What could be the reason? Or It is because of the LM model trained on only 13 M text lines and need to train full corpus?

1 reply

titu1994 May 18, 2021
Maintainer

The model on ngc is trained on nearly 7K hours of Speech from multiple datasets, so an LM for just Librispeech probably won't help too much with these edge cases. An LM is only as good as the amount of data it's trained on as well as the domain of that text corpus c

Incorporation of Language Model with CitriNet #2191

Uh oh!

Uh oh!

snaaz21 May 7, 2021

Replies: 7 comments · 1 reply

Uh oh!

titu1994 May 7, 2021 Maintainer

Uh oh!

Uh oh!

snaaz21 May 7, 2021 Author

Uh oh!

titu1994 May 8, 2021 Maintainer

Uh oh!

Uh oh!

snaaz21 May 8, 2021 Author

Uh oh!

Uh oh!

titu1994 May 8, 2021 Maintainer

Uh oh!

snaaz21 May 8, 2021 Author

Uh oh!

snaaz21 May 18, 2021 Author

Uh oh!

titu1994 May 18, 2021 Maintainer

snaaz21
May 7, 2021

Replies: 7 comments 1 reply

titu1994
May 7, 2021
Maintainer

snaaz21
May 7, 2021
Author

titu1994
May 8, 2021
Maintainer

snaaz21
May 8, 2021
Author

titu1994
May 8, 2021
Maintainer

snaaz21
May 8, 2021
Author

snaaz21
May 18, 2021
Author

titu1994 May 18, 2021
Maintainer