Skip to content

Commit 1867c76

Browse files
authored
Update README.md
1 parent c75e957 commit 1867c76

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
To compare a corpus of texts to find the most similar documents we could try many approaches; one is to create k-shinles from the words, hash them to numbers and compare directly these sets of shingles to calculate Jaccard similarity (https://en.wikipedia.org/wiki/Jaccard_index). Another approach, usually used when we deal with big datasets, is to use linear sensitivity hashing (LSH) and try to find documents that hash in the same value and consider them similar.
44

5-
LSH is implemented using minhashes over the shingles set and then cutting the minihashes to bands of r integers and try to hash this r integers into a new bucket. When two documents hash into the same bucket into any of the bands we can consider that this results from actual similarity between the documents. The probability of two documents that have s similarity to hash into at least one same bucket is ![equat](https://latex.codecogs.com/gif.latex?1%20-%20%281%20-s%5Er%29) where b are the bands and r the integers per band. For more about the LSH you can look Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman chapter 3.4 (http://www.mmds.org/)
5+
LSH is implemented using minhashes over the shingles set and then cutting the minihashes to bands of r integers and try to hash this r integers into a new bucket. When two documents hash into the same bucket into any of the bands we can consider that this results from actual similarity between the documents. The probability of two documents that have s similarity to hash into at least one same bucket is: ![equat](https://latex.codecogs.com/gif.latex?1%20-%20%281%20-s%5Er%29) where b are the bands and r the integers per band. For more about the LSH you can look Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman chapter 3.4 (http://www.mmds.org/)
66

77
This method was tested in Reuters-21578 dataset which is standart machine learning dataset. The documents are a collection of 22 files each containing about 1000 documents in SGML format. The dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection.
88

0 commit comments

Comments
 (0)