zyocum / dedupLinks
Find duplicate text files.
☆14Updated 5 months ago
Alternatives and similar repositories for dedup
Users that are interested in dedup are comparing it to the libraries listed below
Sorting:
- NSS Capstone project to use natural language modeling, classification, and information extraction to get the exact employee count values …☆15Updated 6 years ago
- Code and data for Teddy https://arxiv.org/abs/2001.05171.☆15Updated 3 years ago
- A Python package to get useful information from documents using TopicRank Algorithm.☆16Updated last year
- Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼☆22Updated 5 months ago
- Tool for sentiment analysis annotation☆12Updated 3 months ago
- A set of methods for finding an appropriate number of topics in a text collection☆16Updated 2 months ago
- Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.☆18Updated last year
- Scripts for building a geo-located web corpus using Common Crawl data☆11Updated 2 months ago
- An index data structure for approximate string search.☆23Updated 6 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- Easy formatted text extraction from images using Google Vision API☆42Updated 4 years ago
- In this project, we need to find out commercial products listed on Google that refer to the same entity across Amazon by comparing the si…☆11Updated 8 years ago
- Comparing state of the art models for text summary generation☆19Updated 2 years ago
- Text preprocessing tools in python.☆27Updated 7 years ago
- Labeled segmentation for the document structure of printed books☆13Updated 7 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- Annotation Management for Prodigy, that support multiple users working in many projects☆15Updated 6 years ago
- python package for performing deduplication using flexible text matching and cleaning in pandas dataframe☆25Updated 4 years ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆44Updated last year
- Tokenization across languages. Useful as preprocessing for subword tokenization.☆22Updated 2 years ago
- This script uses an ensemble of multiple methods: RAKE, TF-IDF and Automatic Keyword Extraction to obtain top keywords in Reddit posts. P…☆12Updated 8 years ago
- A python library to generate highly realistic typos (fuzz-testing)☆11Updated 3 months ago
- An ongoing series of notebooks aimed at helping fellow NLP enthusiasts think about applying new tools and techniques to practical tasks.☆18Updated 4 years ago
- Text classification automl☆21Updated 3 years ago
- An optimized Transformer based abstractive summarization model with Tensorflow☆17Updated 2 years ago
- Topic modelling with SpaCy, Gensim and Textacy☆19Updated 7 years ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- Topic Inference with Zeroshot models☆61Updated 2 years ago
- ☆18Updated 3 years ago
- Code for "Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking" (https://arxiv.org/abs/2…☆14Updated 2 years ago