To compare a corpus of texts to find the most similar documents we could try many approaches; one is to create k-shinles from the words, hash them to numbers and compare directly these sets of shingles to calculate Jaccard similarity (https://en.wikipedia.org/wiki/Jaccard_index). Another approach, usually used when we deal with big datasets, is to use linear sensitivity hashing (LSH) and try to find documents that hash in the same value and consider them similar.
0 commit comments