hybridtheory / floc-simhashLinks
A fast python implementation of the SimHash algorithm.
☆27Updated 4 years ago
Alternatives and similar repositories for floc-simhash
Users that are interested in floc-simhash are comparing it to the libraries listed below
Sorting:
- ☆30Updated 3 years ago
- ☆68Updated 3 years ago
- Abydos NLP/IR library for Python☆194Updated 3 years ago
- Fuzzy matching and more functionality for spaCy.☆258Updated last year
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆170Updated 3 years ago
- Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents☆292Updated 2 years ago
- ☆70Updated 3 years ago
- Train a model, and detect gibberish strings with it.☆68Updated 3 years ago
- Multi-Langauge Identification☆28Updated last year
- Detect and visualize text reuse☆119Updated last year
- Python3 bindings for the Compact Language Detector v3 (CLD3)☆155Updated 2 years ago
- code and data used to build a training dataset for dragnet models☆10Updated 5 years ago
- A Flexible Deep Learning Approach to Fuzzy String Matching☆150Updated last year
- A comprehensive and scalable set of string tokenizers and similarity measures in Python☆142Updated last year
- Lightning Fast Language Prediction 🚀☆167Updated 5 months ago
- Sentence transformers models for SpaCy☆108Updated 2 years ago
- An index data structure for approximate string search.☆23Updated 6 years ago
- An efficient algorithm for k-bounded (Damerau-)Levenshtein distance☆16Updated 7 years ago
- An efficient simhash implementation for python☆127Updated 6 years ago
- Language detection using Spacy and Fasttext☆57Updated 2 years ago
- Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18☆170Updated 4 years ago
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆182Updated 7 months ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Updated 5 years ago
- Boolean text search in Python☆46Updated 7 months ago
- Efficient Trie-based regex unions for blacklist/whitelist filtering and one-pass mapping-based string replacing☆76Updated last week
- 80x faster and 95% accurate language identification with Fasttext☆164Updated 2 years ago
- Find strings/words in text; convenience and C speed☆126Updated 3 years ago
- Load embeddings and featurize your sentences.☆31Updated last year
- A Fuzzy Matching Approach for Clustering Strings☆30Updated 2 years ago
- Extract text from HTML☆134Updated last week