hybridtheory / floc-simhashLinks
A fast python implementation of the SimHash algorithm.
☆27Updated 4 years ago
Alternatives and similar repositories for floc-simhash
Users that are interested in floc-simhash are comparing it to the libraries listed below
Sorting:
- ☆69Updated 3 years ago
 - Abydos NLP/IR library for Python☆191Updated 2 years ago
 - A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆165Updated 2 years ago
 - Fuzzy matching and more functionality for spaCy.☆258Updated last year
 - code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
 - ☆30Updated 3 years ago
 - Boolean text search in Python☆46Updated 4 months ago
 - Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents☆291Updated 2 years ago
 - Python3 bindings for the Compact Language Detector v3 (CLD3)☆154Updated 2 years ago
 - Multi-Langauge Identification☆28Updated last year
 - Language detection using Spacy and Fasttext☆57Updated last year
 - A comprehensive and scalable set of string tokenizers and similarity measures in Python☆141Updated last year
 - A Flexible Deep Learning Approach to Fuzzy String Matching☆148Updated last year
 - Sentence transformers models for SpaCy☆107Updated 2 years ago
 - Information extraction from English and German texts based on predicate logic☆139Updated 2 years ago
 - A machine learning tool for fishing entities☆263Updated 5 months ago
 - Recon NER, Debug and correct annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality …☆106Updated last year
 - Annotation Management for Prodigy, that support multiple users working in many projects☆15Updated 6 years ago
 - Text tokenization and sentence segmentation (segtok v2)☆206Updated 3 years ago
 - 📂 Additional lookup tables and data resources for spaCy☆112Updated 5 months ago
 - A spaCy wrapper of OpenTapioca for named entity linking on Wikidata☆94Updated 2 years ago
 - A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆186Updated this week
 - Extracts a latent knowledge graph from text and index/query it in elasticsearch or solr☆21Updated 3 years ago
 - DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Updated 5 years ago
 - Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆178Updated 4 months ago
 - Find strings/words in text; convenience and C speed☆127Updated 3 years ago
 - Library for unit extraction - fork of quantulum for python3☆143Updated last year
 - Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18☆169Updated 4 years ago
 - A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-test…☆71Updated 2 weeks ago
 - Google USE (Universal Sentence Encoder) for spaCy☆184Updated 2 years ago