zyocum / dedup
Find duplicate text files.
β11Updated 4 months ago
Related projects: β
- Dataiku DSS plugin to detect languages, correct misspellings, and clean text data π§Όβ23Updated 4 months ago
- Post-processing OCR errors with seq2seq modelsβ28Updated 4 years ago
- This is a prototype of a multi-lingual suite for named-entity recognition in Python.β21Updated 4 months ago
- An example of how to use spaCy for extremely large files without running into memory issuesβ36Updated 2 years ago
- sequence tagging with spaCy and crfsuiteβ18Updated last year
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languagesβ11Updated 7 months ago
- Simple and clean Python implementation of TextRank as per seminal paper by Rada Mihalcea and Paul Tarau. This implementation performs botβ¦β11Updated 3 years ago
- Tool for sentiment analysis annotationβ11Updated 6 years ago
- Scripts for building a geo-located web corpus using Common Crawl dataβ9Updated 6 months ago
- Generate multiple choice fill-in-the-blank questions from any article.β13Updated last year
- Code for extracting parallel corpora from pmindiaβ16Updated 4 years ago
- Extracting narrative timelines (i.e. order and timing of events) from textβ20Updated 5 years ago
- Text preprocessing tools in python.β26Updated 6 years ago
- BERT models for many languages created from Wikipedia textsβ34Updated 4 years ago
- The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniquesβ29Updated 4 years ago
- Arabic News Stance Corpusβ10Updated 3 years ago
- β15Updated 3 years ago
- No Teacher BART distillation experiment for NLI tasksβ25Updated 4 years ago
- Text pattern search using marisa-trieβ18Updated 3 years ago
- CoreNLG is an easy to use and productivity oriented Python library for Natural Language Generation. It aims to provide the essential toolβ¦β27Updated 3 years ago
- The code describes how to load fastText vectors onto spaCyβ18Updated 3 years ago
- TopicScan: Visualization and validation interface for NMF Topic Modelingβ23Updated 4 years ago
- πNeural Sentential Paraphrase Generation to Augment Chatbot Training Datasetβ22Updated last year
- Using BERT for doing the task of Conditional Natural Language Generation by fine-tuning pre-trained BERT on custom dataset.β40Updated 4 years ago
- sequence tagging for NER for ULMFiTβ21Updated 3 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.β42Updated 5 years ago
- Text Similarity Search Application using Modern NLP and Elasticsearchβ29Updated 4 years ago
- Interpretable feature construction from taxonomies for text classificationβ18Updated 2 years ago
- This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the reβ¦β12Updated this week
- This repository includes all the code and data for the paper ELiDi (End2end Entity Linking and Disambiguation)β14Updated 3 years ago
- An end-to-end event extraction and summarization system.β21Updated 3 years ago