jonathandunn / common_crawl_corpus
Scripts for building a geo-located web corpus using Common Crawl data
☆9Updated 6 months ago
Related projects: ⓘ
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆11Updated 7 months ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- Finds linguistic patterns effortlessly☆31Updated last year
- Use spaCy for NLP and output to the FoLiA XML format.☆12Updated 6 months ago
- Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼☆23Updated 4 months ago
- Analyze Argumentation and Rhetorical Aspects in Scientific Writing.☆19Updated last year
- A Named-Entity Recogniser based on Grobid.☆48Updated last week
- OpenNeuroSpell contains parts of NeuroSpell (http://neurospell.com/en.php) released as open-source. More code will be published as soon a…☆20Updated 2 years ago
- Generic Environment for Context-Aware Correction of Orthography☆22Updated 2 years ago
- ☆29Updated 2 years ago
- sequence tagging with spaCy and crfsuite☆18Updated last year
- A python module to process data for Frame Semantic Parsing☆23Updated 3 years ago
- Text readability metrics in Python.☆12Updated 11 years ago
- Featurize words into orthographic and phonological vectors.☆39Updated last year
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.☆27Updated last year
- List of corpora annotated for coreference for different languages☆16Updated last month
- MinScIE is an Open Information Extraction system which provides structured knowledge enriched with semantic information about citations.☆15Updated 5 years ago
- spaCy match and replace, maintaining conjugation☆34Updated last year
- TopicScan: Visualization and validation interface for NMF Topic Modeling☆23Updated 4 years ago
- Converter from UD-trees to BART representation☆37Updated 6 months ago
- GC4LM: A Colossal (Biased) language model for German☆13Updated 3 years ago
- Wikidata embedding☆50Updated last month
- [COLING2020] A challenge dataset for Person SenTiment analysis in news domain.☆10Updated 2 years ago
- An implementation of GrASP (Shnarch et. al., 2017)☆21Updated 2 years ago
- ☆16Updated 5 years ago
- BERT models for many languages created from Wikipedia texts☆34Updated 4 years ago
- This repository includes all the code and data for the paper ELiDi (End2end Entity Linking and Disambiguation)☆14Updated 3 years ago
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆19Updated 2 years ago
- Python Multilingual Ucrel Semantic Analysis System☆29Updated last month