zyocum / dedup
Find duplicate text files.
☆11Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for dedup
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆11Updated 9 months ago
- Extracting narrative timelines (i.e. order and timing of events) from text☆20Updated 5 years ago
- Summarize. is a Streamlit application that performs automatic text summarization using both extractive and abstractive models.☆15Updated 3 years ago
- Scripts for building a geo-located web corpus using Common Crawl data☆11Updated 3 weeks ago
- Aho-Corasick string replacement utility☆23Updated 4 years ago
- A python library to generate highly realistic typos (fuzz-testing)☆11Updated 6 years ago
- Post-processing OCR errors with seq2seq models☆28Updated 4 years ago
- Text preprocessing tools in python.☆26Updated 6 years ago
- A Python package to get useful information from documents using TopicRank Algorithm.☆16Updated last year
- Code and data for Teddy https://arxiv.org/abs/2001.05171.☆15Updated 2 years ago
- sequence tagging with spaCy and crfsuite☆18Updated last year
- Code for extracting parallel corpora from pmindia☆16Updated 4 years ago
- This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the re…☆12Updated 2 months ago
- Arabic News Stance Corpus☆10Updated 3 years ago
- TopicScan: Visualization and validation interface for NMF Topic Modeling☆23Updated 4 years ago
- Tool for sentiment analysis annotation☆11Updated last month
- Simple and clean Python implementation of TextRank as per seminal paper by Rada Mihalcea and Paul Tarau. This implementation performs bot…☆11Updated 3 years ago
- Transformer based Trigram Blocking implementation in Tensorflow☆11Updated 4 years ago
- Tools and services for evaluating topic models☆15Updated 8 years ago
- Generic Environment for Context-Aware Correction of Orthography☆22Updated 2 years ago
- Pre-production releases for Spacy in Catalan☆14Updated 2 years ago
- Text processing library for sentiment analysis and related tasks☆27Updated 6 years ago
- This is a document concerning Data Readiness in the context of machine learning and Natural Language Processing.☆11Updated 3 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- This project is to provide spell check help from Urdu to Hindi transliteration.The spelling errors in our case mostly comprises of errors…☆10Updated 5 years ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- This repository includes all the code and data for the paper ELiDi (End2end Entity Linking and Disambiguation)☆14Updated 3 years ago
- This repo contains the code used to generate the French Wikipedia sample used in the QA annotation project PIAF☆11Updated 3 years ago
- An alternative approach for probabilistic topic modeling based on agglomerative clustering of topics (not documents)☆12Updated 3 years ago
- Instructions for how to convert a BERT Tensorflow model to work with HuggingFace's pytorch-transformers, and spaCy. This walk-through use…☆27Updated last year