Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆19Updated 2 years ago
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- ☆15Updated last year
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆34Updated 2 years ago
- An example of graph embeddings for wikipedia page recommendations☆11Updated 4 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 4 years ago
- Yet another tool to search through your (exported) ChatGPT conversations☆13Updated last month
- Faster, modernized fork of the language identification tool langid.py☆60Updated last year
- A natural language date parser. (Python version of chrono.js)☆25Updated 7 months ago
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆66Updated 2 weeks ago
- A language detection software☆67Updated 8 years ago
- Fast fuzzy text search☆11Updated 2 years ago
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆17Updated 2 years ago
- Small python package to measure OCR quality and other related metrics.☆26Updated last year
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆23Updated 5 years ago
- Neural search engine for discovering semantically similar Python repositories on GitHub☆29Updated last year
- Rust python bindings for symspell☆21Updated 2 years ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆28Updated 2 months ago
- LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development☆20Updated 2 years ago
- Indri search implementation on top of Lucene search engine☆35Updated last year
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆12Updated 4 years ago
- LLM plugin for embeddings using sentence-transformers☆74Updated 9 months ago
- Python based Wikidata framework for easy dataframe extraction☆45Updated 2 years ago
- Scripts to parse arxiv documents for NLP tasks☆19Updated 2 years ago
- code and data used to build a training dataset for dragnet models☆10Updated 5 years ago
- convert epub file to txt☆94Updated 5 years ago
- Sentence Embedding as a Service☆15Updated 6 months ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆18Updated 3 years ago
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆80Updated this week
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- Tools to construct and process Common Crawl webgraphs☆104Updated last month
- This repository provides various Python methods for finding and aggregating synonyms for an individual word or a list of words.☆36Updated 2 years ago