DavidNemeskey / cc_corpus
Tools for compiling corpora from Common Crawl
☆14Updated 5 months ago
Alternatives and similar repositories for cc_corpus
Users that are interested in cc_corpus are comparing it to the libraries listed below
Sorting:
- The home repository of the NerKor corpus, a Hungarian gold standard named entity annotated corpus containing 1 million tokens.☆15Updated last year
- Linguistic and stylistic complexity measures for (literary) texts☆81Updated last year
- Tower Parse: Low-Resource Dependency Parsing via Hierarchical Source Selection☆15Updated 3 years ago
- ☆64Updated 2 years ago
- A module to compute textual lexical richness (aka lexical diversity).☆106Updated last year
- A spaCy custom component that extracts and normalizes temporal expressions☆54Updated 2 years ago
- MAGPIE: A sense-annotated corpus of potentially idiomatic expressions☆27Updated 4 years ago
- This is a simple Python package for calculating a variety of lexical diversity indices☆77Updated last year
- Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Do…☆80Updated 10 months ago
- Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further lang…☆122Updated last year
- A python true casing utility that restores case information for texts☆88Updated 2 years ago
- coFR: COreference resolution tool for FRench (and singletons).☆24Updated 4 years ago
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆131Updated 5 months ago
- BERT and ELECTRA models trained on Europeana Newspapers☆38Updated 3 years ago
- Python 3 library for processing historical English☆67Updated 9 months ago
- Examples for aligning, padding and batching sequence labeling data (NER) for use with pre-trained transformer models☆65Updated 2 years ago
- Visualise, evaluate, and manage annotated data☆33Updated 2 years ago
- ☆161Updated 11 months ago
- MultiLexNorm 2021 competition system from ÚFAL☆15Updated 3 years ago
- ☆47Updated 9 months ago
- Language Models for Zalando's flair library☆61Updated 5 years ago
- ☆34Updated 7 months ago
- 🧪 Cutting-edge experimental spaCy components and features☆98Updated last year
- Alignment and annotation for comparable documents.☆22Updated 6 years ago
- 🖋 Resource and Tool for Writing System Identification -- LREC 2024☆14Updated 11 months ago
- Repository for the paper "MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguatio…☆44Updated last year
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated last year
- SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages☆9Updated last year
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆161Updated 2 years ago
- Neural CRF Model for Sentence Alignment in Text Simplification☆67Updated 4 months ago