Where do I get the training data? It isn't in the repository. #38

sherylchenn · 2025-03-31T04:27:58Z

sherylchenn
Mar 31, 2025

Hi, this is Sheryl. I am applying for GSoC UNICODE, and am interested in improving the LSTM model for word segmentation. However, I am running into a roadblock running the "Thai_graphclust_model4_heavy" model with train_data = "BEST". I think the file path for '/content/lstm_word_segmentation/Data/Best/news/news_00040.txt' has been removed. Let me know how I should remedy this, as I'd love to run the model and take note of any room for improvement.

Answered by sffc

Mar 31, 2025

The data sets are not checked-in to the repository. The README contains instructions on how to obtain them.

https://github.com/unicode-org/lstm_word_segmentation?tab=readme-ov-file#data-sets

View full answer

sffc · 2025-03-31T16:34:35Z

sffc
Mar 31, 2025
Maintainer

The data sets are not checked-in to the repository. The README contains instructions on how to obtain them.

https://github.com/unicode-org/lstm_word_segmentation?tab=readme-ov-file#data-sets

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Where do I get the training data? It isn't in the repository. #38

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Where do I get the training data? It isn't in the repository. #38

Uh oh!

sherylchenn Mar 31, 2025

Replies: 1 comment

Uh oh!

sffc Mar 31, 2025 Maintainer

sherylchenn
Mar 31, 2025

sffc
Mar 31, 2025
Maintainer