hybridtheory / floc-simhash
A fast python implementation of the SimHash algorithm.
☆27Updated 3 years ago
Alternatives and similar repositories for floc-simhash:
Users that are interested in floc-simhash are comparing it to the libraries listed below
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- Extract networks of entities from journalistic reporting☆48Updated last year
- ☆68Updated 2 years ago
- Hidden alignment conditional random field for classifying string pairs.☆24Updated 4 months ago
- ReconNER, Debug annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality of your data.☆34Updated 4 years ago
- ☆30Updated 2 years ago
- Python 3 library for reading and writing warc files☆20Updated 7 years ago
- Algorithms for "schema matching"☆26Updated 8 years ago
- An efficient simhash implementation for python☆124Updated 5 years ago
- A browser user interface for manual labeling of record pairs.☆44Updated last year
- A spaCy wrapper of OpenTapioca for named entity linking on Wikidata☆93Updated last year
- Python package for deduplication/entity resolution using active learning☆76Updated 5 months ago
- Language detection using Spacy and Fasttext☆55Updated last year
- Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.☆60Updated this week
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- An index data structure for approximate string search.☆23Updated 5 years ago
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆157Updated 2 years ago
- Trying to generate name synonyms from wikidata☆32Updated 4 years ago
- Recon NER, Debug and correct annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality …☆106Updated 11 months ago
- The NLP Bias Identification Toolkit☆36Updated last year
- Scripts and microservice to feed an ElasticSearch with Wikidata and Inventaire entities, and keep those up-to-date☆41Updated 4 years ago
- Library for unit extraction - fork of quantulum for python3☆136Updated 7 months ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- A simple ElasticSearch plugin wrapping around the search endpoint to provide Rocchio query expansion☆17Updated 7 years ago
- A Named-Entity Recogniser based on Grobid.☆50Updated 5 months ago
- sumgram is a tool that summarizes a collection of text documents by generating the most frequent sumgrams (conjoined ngrams)☆56Updated 6 months ago
- Efficient Trie-based regex unions for blacklist/whitelist filtering and one-pass mapping-based string replacing☆68Updated 2 weeks ago
- spaCy entry points for Curated Transformers☆26Updated 4 months ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆51Updated 4 years ago
- Information extraction from English and German texts based on predicate logic☆135Updated last year