zyocum / dedup
Find duplicate text files.
☆12Updated this week
Alternatives and similar repositories for dedup:
Users that are interested in dedup are comparing it to the libraries listed below
- Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼☆23Updated this week
- This is a prototype of a multi-lingual suite for named-entity recognition in Python.☆21Updated 8 months ago
- Simple and clean Python implementation of TextRank as per seminal paper by Rada Mihalcea and Paul Tarau. This implementation performs bot…☆11Updated 3 years ago
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆10Updated 11 months ago
- Post-processing OCR errors with seq2seq models☆28Updated 4 years ago
- ☆15Updated 3 years ago
- sequence tagging with spaCy and crfsuite☆18Updated last year
- A simple neural truecaser written in pytorch and allennlp.☆32Updated 7 months ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of …☆61Updated 4 years ago
- classify a job description (or noisy job title) into a ONET job title☆18Updated 8 years ago
- MinHash implementation in Python☆11Updated 4 months ago
- python package for performing deduplication using flexible text matching and cleaning in pandas dataframe☆25Updated 4 years ago
- Tool for sentiment analysis annotation☆11Updated 3 months ago
- 📄Neural Sentential Paraphrase Generation to Augment Chatbot Training Dataset☆21Updated 2 years ago
- Generic Environment for Context-Aware Correction of Orthography☆22Updated 2 years ago
- Instructions for how to convert a BERT Tensorflow model to work with HuggingFace's pytorch-transformers, and spaCy. This walk-through use…☆27Updated last year
- Extracting narrative timelines (i.e. order and timing of events) from text☆20Updated 5 years ago
- This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the re…☆12Updated 4 months ago
- BERT models for many languages created from Wikipedia texts☆34Updated 4 years ago
- semantically distinct key phrase extraction using hilbert hashes.☆48Updated 2 years ago
- Extract dates from text☆64Updated 3 years ago
- An efficient data structure for fast string similarity searches☆22Updated 3 years ago
- Code and data for Teddy https://arxiv.org/abs/2001.05171.☆15Updated 2 years ago
- Generate multiple choice fill-in-the-blank questions from any article.☆13Updated 2 years ago
- Annotation Management for Prodigy, that support multiple users working in many projects☆15Updated 6 years ago
- Watasense: an Unsupervised WSD System for Under-Resourced Languages.☆11Updated 2 years ago
- ☆18Updated 6 years ago
- This is a document concerning Data Readiness in the context of machine learning and Natural Language Processing.☆11Updated 3 years ago
- Rust python bindings for symspell☆18Updated last year
- Code for Paper "Target-oriented Fine-tuning for Zero-Resource Named Entity Recognition"☆21Updated 2 years ago