i've done some experiments on the best-rq model, including ssl pretraining and supervised finetuning both on wenetspeech dataset. luckily, the pretraining is stable and with # of codebook equals to 1, the training acc can reach around 0.3. the following image is the training curve (the yellow line is wenetspeech only, and the blue one is wenetspeech + some industrial data)

during the supervised finetuning step, i basically frozen all encoder parameters and finetuned the ctc projection layer on wenetspeech dataset to do something like "probing test", but the result was a mess.

i noticed that there are a few discussions about the ssl models in this community. opening this discussion issue see if anyone is meeting similar problem