zyocum / dedupLinks
Find duplicate text files.
☆15Updated 8 months ago
Alternatives and similar repositories for dedup
Users that are interested in dedup are comparing it to the libraries listed below
Sorting:
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆11Updated last year
- A python library to generate highly realistic typos (fuzz-testing)☆12Updated 6 months ago
- Post-processing OCR errors with seq2seq models☆28Updated 5 years ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of …☆61Updated 5 years ago
- Scripts for building a geo-located web corpus using Common Crawl data☆11Updated last month
- 📄Neural Sentential Paraphrase Generation to Augment Chatbot Training Dataset☆21Updated 2 years ago
- semantically distinct key phrase extraction using hilbert hashes.☆50Updated 3 years ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 3 years ago
- ☆30Updated 3 years ago
- Custom Natural Language Processing with big and small models 🌲🌱☆67Updated 4 years ago
- Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼☆22Updated 8 months ago
- CoreNLG is an easy to use and productivity oriented Python library for Natural Language Generation. It aims to provide the essential tool…☆27Updated 4 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- Topic Inference with Zeroshot models☆61Updated 2 years ago
- Transformer based Trigram Blocking implementation in Tensorflow☆11Updated 5 years ago
- BERT models for many languages created from Wikipedia texts☆33Updated 5 years ago
- Training a model without a dataset for natural language inference (NLI)☆25Updated 5 years ago
- Analyze XML extracted from PDFs (e.g. from TET or PDFMiner)☆20Updated 7 years ago
- A tidy and complete archive of metadata for papers on arxiv.org, 1993-2019☆28Updated 5 years ago
- A simple neural truecaser written in pytorch and allennlp.☆33Updated last year
- Convert CoNLL output of a dependency parser into a latex or graphviz tree☆12Updated 5 years ago
- Python script to assemble individual Tweets from a public Twitter stream (either Gnip activity-streams format or original Twitter API for…☆12Updated 9 years ago
- Simple and clean Python implementation of TextRank as per seminal paper by Rada Mihalcea and Paul Tarau. This implementation performs bot…☆11Updated 4 years ago
- Multilingual Language Modeling Toolkit☆11Updated 8 years ago
- On Generating Extended Summaries of Long Documents☆78Updated 4 years ago
- Extract annotated misspellings from MIMIC-III.☆13Updated 4 years ago
- Tool for sentiment analysis annotation☆13Updated 6 months ago
- Build intelligent data-driven applications with minimal effort. Sentence Clustering, Topics Extraction, Text Similarity, Opinion Summariz…☆41Updated 5 years ago
- Extract dates from text☆65Updated 4 years ago
- Code for "Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking" (https://arxiv.org/abs/2…☆14Updated 2 years ago