zyocum / dedup
Find duplicate text files.
β14Updated 3 months ago
Alternatives and similar repositories for dedup:
Users that are interested in dedup are comparing it to the libraries listed below
- semantically distinct key phrase extraction using hilbert hashes.β48Updated 3 years ago
- Dataiku DSS plugin to detect languages, correct misspellings, and clean text data π§Όβ22Updated 3 months ago
- Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.β18Updated last year
- Generate multiple choice fill-in-the-blank questions from any article.β13Updated 2 years ago
- Scripts for building a geo-located web corpus using Common Crawl dataβ11Updated 2 weeks ago
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languagesβ10Updated last year
- A library to extract a publication date from a web page, along with a measure of the accuracy.β41Updated 5 years ago
- Post-processing OCR errors with seq2seq modelsβ28Updated 4 years ago
- BERT models for many languages created from Wikipedia textsβ33Updated 4 years ago
- Code for extracting parallel corpora from pmindiaβ16Updated 5 years ago
- In this project, we need to find out commercial products listed on Google that refer to the same entity across Amazon by comparing the siβ¦β11Updated 8 years ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of β¦β61Updated 4 years ago
- Survey on machine learning.β14Updated 4 years ago
- Transformer based Trigram Blocking implementation in Tensorflowβ11Updated 5 years ago
- An end-to-end event extraction and summarization system.β21Updated 4 years ago
- Text classification automlβ21Updated 3 years ago
- β13Updated 3 years ago
- Process Caltech Archives' digital documents and photos, and annotate each page or image with information about its contentsβ12Updated 2 years ago
- An open-source NLP library: fast text cleaning and preprocessingβ23Updated 3 years ago
- A python library to generate highly realistic typos (fuzz-testing)β11Updated last month
- Deep neural parser for database queryβ18Updated 2 years ago
- An index data structure for approximate string search.β23Updated 5 years ago
- RxNLP APIs for clustering sentences, extracting topics, counting words & n-grams, extracting text from html or URL, computing similarity β¦β15Updated 5 years ago
- An optimized Transformer based abstractive summarization model with Tensorflowβ16Updated 2 years ago
- Extracting narrative timelines (i.e. order and timing of events) from textβ20Updated 6 years ago
- Code for "Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking" (https://arxiv.org/abs/2β¦β13Updated 2 years ago
- Keyword extraction with spaCyβ31Updated 3 years ago
- NLG Best Practices for Data-Efficient Modeling How to Train Production-Ready Models with Little Dataβ10Updated 3 years ago
- No Teacher BART distillation experiment for NLI tasksβ26Updated 4 years ago
- Robust Cross-lingual Embeddings from Parallel Sentencesβ22Updated 4 years ago