Replies: 7 comments 1 reply
-
Librispeech KenLM should be built on concatenated datasets of entire Librispeech lm text corpus (after normalization to lower case) + the 3 train transcripts. Overall it should be close to 41 M lines of text rather than the 300k lines of text here. Secondly, this sentence is confusing - "However I got higher WER using LM i.e. approx 16.4% as compared to 3.4% WER using only CitriNet without LM on my audios." Is the 3.4% with the Citrinet without LM or Citrinet with LM? Did you get 16.4% with an LM on a different dataset or on the same Dataset that you got 3.4%? If it's hour own audio clip, then Citrinet might simply not be good with greedy inference either. Check that greedy wer is close to the 16% you so with LM or higher |
Beta Was this translation helpful? Give feedback.
-
Thank you for your response. I got 3.4% using Citrinet without LM. And got 16.4% using Citrinet with LM. I am getting both the above WER on the same data input. I used only 1-2 minutes of audios as input. |
Beta Was this translation helpful? Give feedback.
-
The Citrinet Checkpoint available on Nemo is trained on roughly 7000 hours of speech - far more than just Librispeech. So a Librispeec LM (without the actual Librispeech text corpus) will only go so far as to improve or hurt wer. Could you try as an experiment what is the result with alpha and betas set to 0? |
Beta Was this translation helpful? Give feedback.
-
Okay, I'll try to train LM on large corpus. When I set alpha=0 and beta=0, the WER is 3.50% |
Beta Was this translation helpful? Give feedback.
-
Ok so the fact that alpha 0 beta 0 gets close to greedy scores shows that your steps are correct and LM will help - just that it needs proper dataset + hyper parameter search to get good scores. You can tune your LM on the Librispeech dev set. |
Beta Was this translation helpful? Give feedback.
-
ok, understood. Thank you for your time. |
Beta Was this translation helpful? Give feedback.
-
Due to lack of resources I couldn’t train LM on full (40 millions of text lines) LibriSpeech LM corpus. So, this time I trained LM on 13 M text lines (out of 40 Million text lines) of librispeech lm corpus. And I got 2.35% wer on librispeech’s dev_clean and 5.75% WER on dev_other dataset. WER is improved on my audio dataset but again some words are missing in some audio’s transcripts (or) giving half word like for “phone” it is giving “ph” and sometimes some mixed words are coming like “asper” which should be two separate words (“as” and “per”) I observed that the words are missing because of lesser frequency of speech, hence probability of those words is less. So, LM is discarding those words. What could be the reason? Or It is because of the LM model trained on only 13 M text lines and need to train full corpus? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I trained a Language Model with train_kenlm.py (6-gram with 286967 lines of text collected from LibriSpeech's train_clean_100, train_clean_360, train_other_500 and some other standard English text lines).
Results of counting:
`
=== 1/5 Counting and sorting n-grams ===
Reading /home/lang_model/model.tmp.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Unigram tokens 16718614 types 937
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11244 2:822115520 3:1541466624 4:2466346752 5:3596755712 6:4932693504
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Statistics:
1 937 D1=0.5 D2=1 D3+=1.5
2 287721 D1=0.506285 D2=1.04345 D3+=1.59677
3 3215127 D1=0.697273 D2=1.12271 D3+=1.48872
4 8173916 D1=0.82992 D2=1.20302 D3+=1.43574
5 11728534 D1=0.908197 D2=1.24826 D3+=1.3993
6 13666388 D1=0.918745 D2=1.31875 D3+=1.46753
Memory estimate for binary LM:
type MB
probing 770 assuming -p 1.5
probing 904 assuming -r models -p 1.5
trie 337 without quantization
trie 168 assuming -q 8 -b 8 quantization
trie 291 assuming -a 22 array pointer compression
trie 122 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:11244 2:4603536 3:64302540 4:196173984 5:328398952 6:437324416
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:11244 2:4603536 3:64302540 4:196173984 5:328398952 6:437324416
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Name:lmplz VmPeak:13261380 kB VmRSS:11264 kB RSSMax:2523692 kB user:25.2489 sys:5.2742 CPU:30.5231 real:19.9458
[NeMo I 2021-05-07 01:48:58 train_kenlm:135] Running binary_build command
/home/lang_model/kenlm/build/bin/lmplz -o 6 --text /home/lang_model/model.tmp.txt --arpa /home/lang_model/model.tmp.arpa --discount_fallback
Reading /home/lang_model/model.tmp.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
vin
SUCCESS
`
Language model is trained without any error and in evaluation using eval_beamsearch_ngram.py, I got better WER i.e. 2.61% on Dev_clean and 6.27% on Dev_other by alpha=1.0 and beta=1.0.
However I got higher WER using LM i.e. approx 16.4% as compared to 3.4% WER using only CitriNet without LM on my audios.
When I saw the predicted output I found that some consecutive words are missing, hence WER is high.
One of the examples is given below:
`
Ground truth: " hello welcome how can i help you today eh what kind of products you guys make sorry i couldn't this can please help me with more i recently here purchase reflects three dot o from one of the stores and this product is not behaving"
Predicted output: "hello welcome how can i help you today what kind of guys make sorry i recently purchase reflects three doo from one of the stores and this product is not behaving"
`
In the above example you can see bold text is missing in the predicted output, this is the output on choosing beam_alpha=1.0 and beam_beta=1.0.
And there is lot of variations in output for missing words by just changing alpha and beta values.
Why the words are missing in the prediction and how to decide the right alpha and beta values?
@EmreOzkose did you face the same?
Beta Was this translation helpful? Give feedback.
All reactions