jonathandunn / common_crawl_corpus
Scripts for building a geo-located web corpus using Common Crawl data
☆11Updated 3 weeks ago
Alternatives and similar repositories for common_crawl_corpus:
Users that are interested in common_crawl_corpus are comparing it to the libraries listed below
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆10Updated last year
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- Use spaCy for NLP and output to the FoLiA XML format.☆12Updated last year
- Finds linguistic patterns effortlessly☆36Updated last year
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Updated 4 years ago
- Polyglot is a language identifier for detecting text documents containing text written in more than one language, and for identifying the…☆33Updated 8 years ago
- Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text processing capabilities.☆16Updated 7 months ago
- Analyze Argumentation and Rhetorical Aspects in Scientific Writing.☆19Updated 2 years ago
- A python library to generate highly realistic typos (fuzz-testing)☆11Updated 2 weeks ago
- FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…☆112Updated 2 months ago
- Python Multilingual Ucrel Semantic Analysis System☆31Updated 7 months ago
- spaCy-to-naf converter☆21Updated 9 months ago
- bin files☆13Updated 2 months ago
- Featurize words into orthographic and phonological vectors.☆40Updated last year
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- Converter from UD-trees to BART representation☆36Updated last year
- sequence tagging with spaCy and crfsuite☆19Updated 2 years ago
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆22Updated 3 years ago
- ParaNames: A multilingual resource for parallel names☆31Updated 10 months ago
- ☆17Updated 3 years ago
- Wikidata embedding☆50Updated 4 months ago
- ☆17Updated last year
- SeqScore: Scoring for named entity recognition and other sequence labeling tasks☆23Updated 3 weeks ago
- LeBLEU: Levenshtein/Letter-edit BLEU, N-gram-based Translation Evaluation Score for Morphologically Complex Languages☆10Updated 4 years ago
- Arabic News Stance Corpus☆10Updated 4 years ago
- Tool for the Automatic Assessment of Lexical Diversity☆11Updated 4 years ago
- Transform TMX to text☆28Updated 2 years ago
- linguistic converter / merging tool for multi-level annotated corpora. graph-based (using Python and NetworkX).☆51Updated last year
- Parser for KAF NAF files written in Python☆16Updated 3 years ago
- Searching in-memory corpus with Corpus Query Language (CQL)☆19Updated 4 months ago