Marcnuth / deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication:
Users that are interested in deduplication are comparing it to the libraries listed below
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- Loadable spellfix1 extension for sqlite as python package☆26Updated 11 months ago
- Indri search implementation on top of Lucene search engine☆34Updated last year
- Detecting gibberish as a type of sentiment analysis with GPT2☆23Updated 4 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- An efficient data structure for fast string similarity searches☆22Updated 4 years ago
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆16Updated last year
- A fork of http://pydispatcher.sourceforge.net/ with PyPy support☆16Updated 7 years ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆25Updated 7 months ago
- extract difference between two html pages☆32Updated 6 years ago
- Generate random date(time) in Python.☆10Updated last year
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆44Updated 10 months ago
- An index data structure for approximate string search.☆23Updated 5 years ago
- Interpretable feature construction from taxonomies for text classification☆18Updated 2 years ago
- ReconNER, Debug annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality of your data.☆35Updated 4 years ago
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 6 years ago
- Custom Python functions for working with SQLite FTS4☆22Updated 2 years ago
- Find duplicate text files.☆14Updated 2 months ago
- framework for making streamcorpus data☆11Updated 8 years ago
- ☆20Updated 3 years ago
- Commons of stupid, simple Python micro functions. Pull requests very welcome.☆19Updated 2 years ago
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated 9 months ago
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated last year
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- A file utility for accessing both local and remote files through a unified interface.☆38Updated 2 weeks ago
- Transform Oracle PL/SQL Code to Python☆11Updated 11 years ago
- ☆12Updated 10 months ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆17Updated 2 years ago
- Neural Elastic Inference and Search☆19Updated 5 years ago