Marcnuth / deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- Generate random date(time) in Python.☆10Updated last year
- Fast Neural Machine Translation in C++ - development repository☆19Updated last year
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆16Updated 2 years ago
- Indri search implementation on top of Lucene search engine☆34Updated last year
- Custom Python functions for working with SQLite FTS4☆22Updated 2 years ago
- CLI based diff viewer☆23Updated 3 years ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 6 years ago
- Detecting gibberish as a type of sentiment analysis with GPT2☆24Updated 4 years ago
- Faster, modernized fork of the language identification tool langid.py☆55Updated 5 months ago
- A fork of http://pydispatcher.sourceforge.net/ with PyPy support☆16Updated 7 years ago
- Commons of stupid, simple Python micro functions. Pull requests very welcome.☆19Updated last month
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- Python module (C extension and plain python) implementing DAWG☆20Updated 3 years ago
- Python wrapper for Ferret☆41Updated 3 years ago
- An index data structure for approximate string search.☆23Updated 6 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- Pluggable DSL that uses pipes to perform a series of linear transformations to extract data☆16Updated 10 months ago
- A file utility for accessing both local and remote files through a unified interface.☆41Updated last week
- Generate a SQLite database from Wikipedia & Wikidata dumps.☆35Updated last year
- Sentence Embedding as a Service☆15Updated last year
- A CLI tool for managing OpenAI batch processing jobs with ease.☆35Updated 2 weeks ago
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.☆40Updated 2 weeks ago
- Interpretable feature construction from taxonomies for text classification☆18Updated 3 years ago
- Neural Elastic Inference and Search☆19Updated 5 years ago
- ReconNER, Debug annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality of your data.☆35Updated 4 years ago
- Cortex-compatible model server for Python and TensorFlow☆17Updated 2 years ago
- A Python library for variable type checker/validator/converter at a run time.☆16Updated 4 months ago
- Python module to generate regular all expression matches☆18Updated 2 years ago