Marcnuth / deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
β16Updated last year
Related projects β
Alternatives and complementary repositories for deduplication
- A curated list of awesome open source tools and commercial products to catalog, version, and manage data πβ25Updated 2 years ago
- An open-source NLP library: fast text cleaning and preprocessingβ23Updated 3 years ago
- Neural Solr = Solr 9 + Mighty Inference + Nodeβ16Updated 2 years ago
- Datasette plugin for searching all searchable tables at onceβ19Updated 2 months ago
- A file utility for accessing both local and remote files through a unified interface.β35Updated 3 months ago
- Scripts supporting the development and serving the Roots Search Tool - https://hf.co/spaces/bigscience-data/roots-searchβ10Updated last year
- Neural Elastic Inference and Searchβ19Updated 4 years ago
- Code examples for Google Natural Language API.β13Updated 5 years ago
- Functional composable pipelines allowing clean separation of the business logic and its implementationβ11Updated 5 months ago
- Generate random date(time) in Python.β10Updated 8 months ago
- code and data used to build a training dataset for dragnet modelsβ10Updated 3 years ago
- Transform Oracle PL/SQL Code to Pythonβ11Updated 11 years ago
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.β51Updated last week
- CLI based diff viewerβ24Updated 3 years ago
- Repository to allow collaboration between Cycle Labs Cloud community in support of the community.β9Updated 2 years ago
- Interpretable feature construction from taxonomies for text classificationβ18Updated 2 years ago
- Custom Python functions for working with SQLite FTS4β22Updated 2 years ago
- A utility for labeling clusters of text data.β28Updated 3 years ago
- gRPC server for hnswlibβ14Updated last year
- A simple library for training named entity recognition model from partially annotated dataβ21Updated 11 months ago
- π Fine-tune OpenAI models for text classification, question answering, and moreβ16Updated last year
- Datamallet is a python library which contains several helper functions and module for the common tasks in a typical data science workflowβ¦β11Updated 2 years ago
- Loadable spellfix1 extension for sqlite as python packageβ25Updated 6 months ago
- My dot files in one place - extensively edited over time. Your mileage may varyβ2Updated 8 years ago
- Go library that provides easy-to-use interfaces and tools for TensorFlow users, in particular allowing to train existing TF models on .taβ¦β14Updated 7 months ago
- Various Jupyter notebooks about Common Crawl dataβ46Updated 2 years ago
- Train a model, and detect gibberish strings with it.β59Updated 2 years ago
- Python bindings for the Google's FarmHashβ37Updated 2 months ago
- Cortex-compatible model server for Python and TensorFlowβ16Updated last year
- Datasette enrichment for analyzing row data using OpenAI's GPT modelsβ19Updated 5 months ago