Marcnuth / deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication:
Users that are interested in deduplication are comparing it to the libraries listed below
- https://mimesniff.spec.whatwg.org/ implementation for Python☆14Updated last year
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆25Updated 5 months ago
- CLI based diff viewer☆23Updated 3 years ago
- Python module (C extension and plain python) implementing DAWG☆20Updated 3 years ago
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated 7 months ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆16Updated 2 years ago
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL☆29Updated 2 years ago
- Datasette plugin adding a llm_embed(model_id, text) SQL function☆12Updated 10 months ago
- My dot files in one place - extensively edited over time. Your mileage may vary☆2Updated 8 years ago
- A simple semantic search engine for scientific papers.☆27Updated last year
- Indri search implementation on top of Lucene search engine☆34Updated 10 months ago
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆32Updated last year
- Python context manager to communicate with a subprocess using iterables: for when data is too big to fit in memory and has to be streamed☆7Updated 3 months ago
- Generate random date(time) in Python.☆10Updated 10 months ago
- Docker Compose based system for running remote browsers (including Flash and Java support) connected to web archives☆14Updated 3 years ago
- YAML-formatted plain-text file based models for Flask backed by Flask-SQLAlchemy☆23Updated this week
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- Loadable spellfix1 extension for sqlite as python package☆25Updated 8 months ago
- Datasette plugin for authenticating access using API tokens☆11Updated 4 months ago
- A curated list of awesome open source tools and commercial products to catalog, version, and manage data 🚀☆29Updated 2 years ago
- A fork of http://pydispatcher.sourceforge.net/ with PyPy support☆16Updated 7 years ago
- An efficient data structure for fast string similarity searches☆22Updated 3 years ago
- Detecting gibberish as a type of sentiment analysis with GPT2☆24Updated 4 years ago
- Custom Python functions for working with SQLite FTS4☆22Updated 2 years ago
- Scripts supporting the development and serving the Roots Search Tool - https://hf.co/spaces/bigscience-data/roots-search☆10Updated last year
- Transform Oracle PL/SQL Code to Python☆11Updated 11 years ago