t-systems-on-site-services-gmbh / german-wikipedia-text-corpus
This is a german text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. It's purpose is to train NLP embeddings like fastText or ELMo Deep contextualized word representations.
☆22Updated 2 years ago
Related projects ⓘ
Alternatives and complementary repositories for german-wikipedia-text-corpus
- OpusFilter - Parallel corpus processing toolkit☆102Updated 3 months ago
- Wikipedia text corpus for self-supervised NLP model training☆40Updated 2 years ago
- BERT and ELECTRA models trained on Europeana Newspapers☆36Updated 2 years ago
- MT Evaluation in Many Languages via Zero-Shot Paraphrasing☆102Updated 3 months ago
- Alignment and annotation for comparable documents.☆22Updated 6 years ago
- Plan and train German transformer models.☆23Updated 3 years ago
- Efficient Low-Memory Aligner☆139Updated 2 months ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆150Updated 5 months ago
- coFR: COreference resolution tool for FRench (and singletons).☆24Updated 4 years ago
- Curriculum training☆16Updated 2 months ago
- Automatic extraction of edited sentences from text edition histories.☆81Updated 2 years ago
- Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Do…☆76Updated 4 months ago
- Datasets for the task of tracing diachronic semantic shifts in Russian for two large-scale time period pairs (from pre-Soviet to Soviet t…☆14Updated 6 months ago
- ☆25Updated 4 years ago
- Disambiguate is a tool for training and using state of the art neural WSD models☆58Updated 2 years ago
- ☆43Updated 3 months ago
- ☆64Updated last year
- Poetry Corpora Annotated on Aesthetic Emotions☆11Updated 2 years ago
- Repository with code for MaChAmp: https://aclanthology.org/2021.eacl-demos.22/☆82Updated last month
- Transformer based translation quality estimation☆107Updated last year
- GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge (EMNLP 2019)☆92Updated 2 years ago
- A minimal, pure Python library to interface with CoNLL-U format files.☆149Updated last year
- A tool that locates, downloads, and extracts machine translation corpora☆147Updated 5 months ago
- Linguistic and stylistic complexity measures for (literary) texts☆77Updated 9 months ago
- ☆67Updated 3 months ago
- Scripts to preprocess training and test data and to run fast_align and giza☆109Updated 3 years ago
- a tool for calcualting character n-gram F score☆67Updated last year
- Identifying Historical People, Places and other Entities: Shared Task on Named Entity Recognition and Linking on Historical Newspapers at…☆22Updated 3 months ago
- This is a german ELMo deep contextualized word representation. It is trained on a special German Wikipedia Text Corpus.☆28Updated 4 years ago
- NTREX -- News Test References for MT Evaluation☆75Updated 5 months ago