Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated 2 years ago
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆23Updated 5 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 4 years ago
- Indri search implementation on top of Lucene search engine☆35Updated last year
- ☆92Updated 3 years ago
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated 2 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- ☆70Updated 4 years ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆45Updated last year
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆65Updated 10 months ago
- LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development☆20Updated 2 years ago
- A simple semantic search engine for scientific papers.☆28Updated 2 years ago
- Tools to construct and process Common Crawl webgraphs☆101Updated 2 weeks ago
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆76Updated last week
- An example of graph embeddings for wikipedia page recommendations☆11Updated 4 years ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆24Updated 3 months ago
- A robust web archive analytics toolkit☆120Updated last month
- Library for extracting text and timestamps from multiple subtitle files (.ass, .ssa, .srt, .sub, .txt).☆53Updated last year
- Scripts to parse arxiv documents for NLP tasks☆18Updated 2 years ago
- Faster, modernized fork of the language identification tool langid.py☆61Updated last year
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆35Updated last year
- Seed Machine Translation Data☆33Updated last year
- A natural language date parser. (Python version of chrono.js)☆25Updated 5 months ago
- A cost estimator for OpenAI API calls in tqdm loops.☆20Updated 11 months ago
- Fast edit distance Python extension written in Cython/C++. Supports Levenshtein distance and Damerau Optimal String Alignment (OSA) dista…☆24Updated 5 months ago
- Documentation effort for the BookCorpus dataset☆34Updated 4 years ago
- Boolean text search in Python☆46Updated 4 months ago
- Prebuilt .whl files for MacOS + Linux of the Facebook FAISS library☆56Updated 3 years ago
- Open source library for few shot NLP☆78Updated 2 years ago