DavidNemeskey / cc_corpus
Tools for compiling corpora from Common Crawl
☆14Updated 5 months ago
Alternatives and similar repositories for cc_corpus:
Users that are interested in cc_corpus are comparing it to the libraries listed below
- BERT and ELECTRA models trained on Europeana Newspapers☆38Updated 3 years ago
- ☆64Updated 2 years ago
- Linguistic and stylistic complexity measures for (literary) texts☆80Updated last year
- Sentence transformers models for SpaCy☆107Updated 2 years ago
- The home repository of the NerKor corpus, a Hungarian gold standard named entity annotated corpus containing 1 million tokens.☆15Updated last year
- Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Do…☆80Updated 9 months ago
- coFR: COreference resolution tool for FRench (and singletons).☆24Updated 4 years ago
- Natural language processing resources for multiple languages, with an eye towards use for digital humanities.☆126Updated 3 years ago
- Dutch coreference resolution & dialogue analysis using deterministic rules☆21Updated last year
- Identifying Historical People, Places and other Entities: Shared Task on Named Entity Recognition and Linking on Historical Newspapers at…☆22Updated 8 months ago
- An implementation of GrASP (Shnarch et. al., 2017)☆21Updated 2 years ago
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆161Updated 2 years ago
- A High-level Library for Named Entity Recognition in Python.☆23Updated last year
- REMERGE - Multi-Word Expression discovery algorithm☆14Updated 2 years ago
- Language Models for Zalando's flair library☆61Updated 5 years ago
- A spaCy wrapper of OpenTapioca for named entity linking on Wikidata☆94Updated 2 years ago
- A minimal, pure Python library to interface with CoNLL-U format files.☆151Updated last year
- Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further lang…☆122Updated last year
- A monolingual and cross-lingual meta-embedding generation and evaluation framework☆80Updated 2 years ago
- These are lists for a variety of languages containing words that are distinctive to each language.☆38Updated 3 years ago
- A module to compute textual lexical richness (aka lexical diversity).☆106Updated last year
- Bilingual term extractor☆53Updated last year
- Natural language understanding benchmarks for Norwegian☆14Updated last year
- A spaCy custom component that extracts and normalizes temporal expressions☆54Updated 2 years ago
- 🧪 Cutting-edge experimental spaCy components and features☆98Updated last year
- A character-level BERT for Ancient Greek☆10Updated last year
- Sentiment Corpus for Swedish 🇸🇪 Norwegian 🇳🇴 Danish 🇩🇰 Finnish 🇫🇮 (and English 🏴)☆15Updated 3 years ago
- spaCy + UDPipe☆161Updated 3 years ago
- This is a simple Python package for calculating a variety of lexical diversity indices☆75Updated last year
- A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).☆14Updated 10 months ago