zyocum / dedupLinks
Find duplicate text files.
☆15Updated 6 months ago
Alternatives and similar repositories for dedup
Users that are interested in dedup are comparing it to the libraries listed below
Sorting:
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆11Updated last year
- A python library to generate highly realistic typos (fuzz-testing)☆12Updated 4 months ago
- Scripts for building a geo-located web corpus using Common Crawl data☆11Updated 3 months ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of …☆61Updated 4 years ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- 📄Neural Sentential Paraphrase Generation to Augment Chatbot Training Dataset☆21Updated 2 years ago
- BERT models for many languages created from Wikipedia texts☆33Updated 5 years ago
- Text preprocessing tools in python.☆27Updated 7 years ago
- Tool for sentiment analysis annotation☆13Updated 4 months ago
- Correction of spaces with character-based neural language models.☆13Updated 2 years ago
- This repo contains the code used to generate the French Wikipedia sample used in the QA annotation project PIAF☆11Updated 4 years ago
- ☆30Updated 3 years ago
- A collection of over 1.5 Million tweets data translated to French, with their sentiment.☆35Updated 8 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- A simple neural truecaser written in pytorch and allennlp.☆33Updated last year
- Text Similarity Search Application using Modern NLP and Elasticsearch☆30Updated 5 years ago
- Fast edit distance Python extension written in Cython/C++. Supports Levenshtein distance and Damerau Optimal String Alignment (OSA) dista…☆24Updated 2 months ago
- This is a prototype of a multi-lingual suite for named-entity recognition in Python.☆21Updated last year
- NSS Capstone project to use natural language modeling, classification, and information extraction to get the exact employee count values …☆15Updated 6 years ago
- Loan Risk Prediction Neural Network and API☆17Updated 4 years ago
- Transformer based Trigram Blocking implementation in Tensorflow☆11Updated 5 years ago
- Custom Natural Language Processing with big and small models 🌲🌱☆68Updated 3 years ago
- Post-processing OCR errors with seq2seq models☆28Updated 5 years ago
- Convert CoNLL output of a dependency parser into a latex or graphviz tree☆12Updated 5 years ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆45Updated last year
- MinHash implementation in Python☆11Updated 11 months ago
- ReconNER, Debug annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality of your data.☆35Updated 5 years ago
- Code for "Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking" (https://arxiv.org/abs/2…☆14Updated 2 years ago
- ☆17Updated 2 years ago
- Code for extracting parallel corpora from pmindia☆16Updated 5 years ago