Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated 2 years ago
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- ☆13Updated last year
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆23Updated 4 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- From Dataset Labeling, Entity Extraction to production Knowledge Graph Deployment: The Power of NLP and LLMs Combined.☆12Updated last year
- Indri search implementation on top of Lucene search engine☆35Updated last year
- This project is based on the original code of the inteoryx / twitter-video-dl project, which allows users to download Twitter videos as M…☆15Updated 5 months ago
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆16Updated 2 years ago
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆63Updated 7 months ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- A workflow system for Natural Language Processing.☆21Updated 5 years ago
- Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.☆52Updated 4 months ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆33Updated last year
- LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development☆20Updated 2 years ago
- Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT☆23Updated 2 years ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Detecting gibberish as a type of sentiment analysis with GPT2☆25Updated 4 years ago
- Interpretable feature construction from taxonomies for text classification☆18Updated 3 years ago
- Documentation effort for the BookCorpus dataset☆34Updated 4 years ago
- Fast fuzzy text search☆11Updated 2 years ago
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆61Updated this week
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- A simple semantic search engine for scientific papers.☆28Updated last year
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆190Updated 3 years ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated 3 weeks ago
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated 2 years ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆17Updated 3 years ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of …☆61Updated 4 years ago
- Reproducing "Writing with Transformer" demo, using aitextgen/FastAPI in backend, Quill/React in frontend☆28Updated 4 years ago
- convert epub file to txt☆92Updated 5 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago