DavidNemeskey / cc_corpusLinks
Tools for compiling corpora from Common Crawl
☆14Updated 6 months ago
Alternatives and similar repositories for cc_corpus
Users that are interested in cc_corpus are comparing it to the libraries listed below
Sorting:
- The home repository of the NerKor corpus, a Hungarian gold standard named entity annotated corpus containing 1 million tokens.☆15Updated last year
- A python module for word inflections designed for use with spaCy.☆92Updated 5 years ago
- Linguistic and stylistic complexity measures for (literary) texts☆81Updated last year
- Tool for parsing and converting various span encoding schemes.☆23Updated last year
- coFR: COreference resolution tool for FRench (and singletons).☆24Updated 5 years ago
- spaCy + UDPipe☆161Updated 3 years ago
- BERT and ELECTRA models trained on Europeana Newspapers☆38Updated 3 years ago
- ☆64Updated 2 years ago
- Sentence transformers models for SpaCy☆107Updated 2 years ago
- Alignment and annotation for comparable documents.☆22Updated 6 years ago
- A spaCy custom component that extracts and normalizes temporal expressions☆54Updated 2 years ago
- Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Do …☆80Updated 11 months ago
- Easier Automatic Sentence Simplification Evaluation☆161Updated last year
- A Word Sense Disambiguation system integrating implicit and explicit external knowledge.☆69Updated 3 years ago
- classy is a simple-to-use library for building high-performance Machine Learning models in NLP.☆87Updated 2 months ago
- List of corpora annotated for coreference for different languages☆17Updated 10 months ago
- Identifying Historical People, Places and other Entities: Shared Task on Named Entity Recognition and Linking on Historical Newspapers at…☆22Updated 10 months ago
- DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models☆157Updated 2 years ago
- An easy-to-use library to extract indices from texts.☆29Updated 3 years ago
- Searching in-memory corpus with Corpus Query Language (CQL)☆19Updated 6 months ago
- A tokenizer and sentence splitter for German and English web and social media texts.☆145Updated 5 months ago
- A set of utility scripts to process Wikipedia related data☆38Updated 2 years ago
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆162Updated 2 years ago
- Automatically detect errors in annotated corpora.☆47Updated last year
- Automatic extraction of edited sentences from text edition histories.☆83Updated 3 years ago
- 🧪 Cutting-edge experimental spaCy components and features☆99Updated last year
- An easy-to-use API for analyzing INCEpTION annotation projects.☆17Updated last year
- Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further lang…☆122Updated last year
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆67Updated 2 years ago
- Examples for aligning, padding and batching sequence labeling data (NER) for use with pre-trained transformer models☆65Updated 2 years ago