mattilyra / LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
☆285Updated last year
Alternatives and similar repositories for LSH:
Users that are interested in LSH are comparing it to the libraries listed below
- Example Python code for comparing documents using MinHash☆251Updated 6 years ago
- ☆188Updated 9 months ago
- Simhash and near-duplicate detection☆413Updated last year
- Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive…☆767Updated 2 years ago
- A Locality Sensitive Hashing (LSH) library with an emphasis on large, highly-dimensional datasets.☆146Updated 6 months ago
- A comprehensive and scalable set of string tokenizers and similarity measures in Python☆136Updated 8 months ago
- A fast Python implementation of locality sensitive hashing.☆661Updated 4 years ago
- Quickly extract multi-word phrases from a corpus☆191Updated 4 years ago
- Fast, DB Backed pretrained word embeddings for natural language processing.☆222Updated last year
- Self-Supervision for Named Entity Disambiguation at the Tail☆215Updated 2 years ago
- Word Embeddings for Information Retrieval☆225Updated last year
- MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW☆2,661Updated 9 months ago
- Compute Sentence Embeddings Fast!☆621Updated 2 years ago
- Calculates Word Mover's Distance Insanely Fast☆461Updated last year
- Textpipe: clean and extract metadata from text☆302Updated 3 years ago
- Named Entity Recognition based on dictionaries☆242Updated 6 years ago
- EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings (official implementation)☆435Updated last year
- Palmetto is a quality measuring tool for topics☆216Updated last year
- scikit-learn inspired API for CRFsuite☆426Updated last year
- LASER multilingual sentence embeddings as a pip package☆224Updated last year
- Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18☆168Updated 3 years ago
- An efficient simhash implementation for python☆124Updated 5 years ago
- ☆91Updated 8 years ago
- A PyTorch implementation of Paragraph Vectors (doc2vec).☆414Updated 2 years ago
- A python binding for crfsuite☆772Updated 5 months ago
- Twitter named entity extraction for WNUT 2016 http://noisy-text.github.io/2016/ner-shared-task.html☆138Updated 2 years ago
- Document ranking via sentence modeling using BERT☆144Updated 2 years ago
- Tools and recipes to train deep learning models and build services for NLP tasks such as text classification, semantic search ranking and…☆460Updated 6 years ago
- ☆130Updated 3 years ago
- Abydos NLP/IR library for Python☆185Updated 2 years ago