mattilyra / LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
☆282Updated last year
Related projects ⓘ
Alternatives and complementary repositories for LSH
- Example Python code for comparing documents using MinHash☆250Updated 5 years ago
- A Locality Sensitive Hashing (LSH) library with an emphasis on large, highly-dimensional datasets.☆144Updated 2 months ago
- Simhash and near-duplicate detection☆411Updated last year
- Calculates Word Mover's Distance Insanely Fast☆460Updated last year
- Quickly extract multi-word phrases from a corpus☆191Updated 4 years ago
- Tool for interactive embeddings visualization☆300Updated 3 months ago
- Fast, DB Backed pretrained word embeddings for natural language processing.☆223Updated last year
- Compute Sentence Embeddings Fast!☆618Updated last year
- Self-Supervision for Named Entity Disambiguation at the Tail☆214Updated 2 years ago
- Counter-fitting Word Vectors to Linguistic Constraints☆144Updated 4 years ago
- Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic fe…☆170Updated 2 years ago
- Textpipe: clean and extract metadata from text☆299Updated 3 years ago
- LexRank algorithm for text summarization☆229Updated 7 months ago
- A pure python implementation of locality sensitive hashing for text documents☆87Updated 9 years ago
- Various Algorithms for Short Text Mining☆467Updated this week
- Twitter named entity extraction for WNUT 2016 http://noisy-text.github.io/2016/ner-shared-task.html☆139Updated 2 years ago
- Concatenated Power Mean Embeddings as Universal Cross-Lingual Sentence Representations☆186Updated 3 years ago
- ☆165Updated 5 months ago
- Palmetto is a quality measuring tool for topics☆215Updated 9 months ago
- Text Mining and Topic Modeling Toolkit for Python with parallel processing power☆193Updated last year
- Flexible classic and NeurAl Retrieval Toolkit☆214Updated 4 months ago
- scikit-learn inspired API for CRFsuite☆426Updated last year
- 💙 Emoji handling and meta data for spaCy with custom extension attributes☆180Updated last year
- Evaluation software used in the Text Retrieval Conference☆241Updated 3 weeks ago
- A fast implementation of GloVe, with optional retrofitting☆243Updated last year
- Word Embeddings for Information Retrieval☆226Updated last year
- pyxDamerauLevenshtein implements the Damerau-Levenshtein (DL) edit distance algorithm for Python in Cython for high performance.☆243Updated 6 months ago
- Sentence2vec by Rock☆314Updated 2 years ago
- Document ranking via sentence modeling using BERT☆143Updated last year
- An efficient simhash implementation for python☆125Updated 5 years ago