GermanT5/wikipedia2corpus

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/GermanT5/wikipedia2corpus)

GermanT5 / wikipedia2corpus

Wikipedia text corpus for self-supervised NLP model training

☆47

Alternatives and similar repositories for wikipedia2corpus

Users that are interested in wikipedia2corpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

oliverguhr / german-sentiment
View on GitHub
A data set and model for german sentiment classification.
☆70Jul 17, 2026Updated last week
sueddeutsche / political-german-word-embeddings
View on GitHub
German word embeddings computed from a corpus of parliamentary transcripts (2017-2019)
☆15Mar 5, 2020Updated 6 years ago
LEL-A / GerAlpacaDataCleaned
View on GitHub
German Alpaca Dataset (Cleaned + Translated)
☆26Apr 6, 2023Updated 3 years ago
dennlinger / klexikon
View on GitHub
Klexikon: A German Dataset for Joint Summarization and Simplification
☆17Oct 5, 2022Updated 3 years ago
stefan-it / europeana-bert
View on GitHub
BERT and ELECTRA models trained on Europeana Newspapers
☆39Dec 14, 2021Updated 4 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
sobamchan / xscitldr
View on GitHub
X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents (JCDL 2022)
☆14Jul 22, 2022Updated 4 years ago
malteos / clp-transfer
View on GitHub
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
☆30Jan 25, 2023Updated 3 years ago
ghpaetzold / massalign
View on GitHub
Alignment and annotation for comparable documents.
☆22Oct 16, 2018Updated 7 years ago
t-systems-on-site-services-gmbh / german-elmo-model
View on GitHub
This is a german ELMo deep contextualized word representation. It is trained on a special German Wikipedia Text Corpus.
☆28Dec 15, 2019Updated 6 years ago
uds-lsv / TOKEN-is-a-MASK
View on GitHub
Code for our TSD paper "TOKEN is a MASK: Few-shot Named Entity Recognition with Pre-trained Language Models"
☆14Aug 19, 2022Updated 3 years ago
nec-research / st_tau
View on GitHub
This repository contains code for the paper "Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs" (Wang, Lawrence…
☆17Mar 8, 2021Updated 5 years ago
klimzaporojets / DWIE
View on GitHub
DWIE (Deutsche Welle corpus for Information Extraction) dataset. Introduced in our "DWIE: an entity-centric dataset for multi-task docume…
☆51Jul 23, 2023Updated 3 years ago
German-NLP-Group / german-transformer-training
View on GitHub
Plan and train German transformer models.
☆23Feb 22, 2021Updated 5 years ago
UniversalDependencies / UD_Polish-PDB
View on GitHub
Polish data.
☆13May 6, 2026Updated 2 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
cisnlp / multypo
View on GitHub
A Multilingual Keyboard Layout-Based Typo Generator
☆17Nov 23, 2025Updated 8 months ago
HKUST-KnowComp / SubeventWriter
View on GitHub
Official code repository for the main conference paper in EMNLP 2022: SubeventWriter: Iterative Sub-event Sequence Generation with Cohere…
☆11Oct 16, 2022Updated 3 years ago
valentinhofmann / flota
View on GitHub
☆18Feb 1, 2023Updated 3 years ago
lorelupo / divide-and-rule
View on GitHub
☆12Oct 17, 2022Updated 3 years ago
nlpaueb / multi-eurlex
View on GitHub
MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer
☆40Jun 7, 2022Updated 4 years ago
vid-koci / KBCtransferlearning
View on GitHub
Code accompanying the paper "Knowledge Base Completion Meets Transfer Learning"
☆15Feb 21, 2024Updated 2 years ago
tsproisl / SoMaJo
View on GitHub
A tokenizer and sentence splitter for German and English web and social media texts.
☆153Dec 9, 2024Updated last year
stefan-it / ukrainian-electra
View on GitHub
Ukrainian ELECTRA model
☆12Mar 11, 2023Updated 3 years ago
lauhaide / clads
View on GitHub
XWikisCorpus, cross-lingual summarisation, multi-lingual summarisation, pre-trained language models, zero-shot and few-shot summarisation…
☆10Nov 4, 2022Updated 3 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
tingofurro / shuffle_test
View on GitHub
Codebase, data and models for the Re-Thinking the Shuffle Test paper at ACL2021
☆10Oct 14, 2022Updated 3 years ago
JMMackenzie / CC-News-Tools
View on GitHub
Tools relating to the CC-News-En Collection
☆20Dec 8, 2023Updated 2 years ago
JungHoyoun / PromptCompressor
View on GitHub
☆12Apr 29, 2024Updated 2 years ago
stopwords-iso / stopwords-de
View on GitHub
German stopwords collection
☆88Oct 6, 2022Updated 3 years ago
pawel-bujnowski / smiler
View on GitHub
SMiLER - Samsung MultiLingual Entity and Relation Extraction dataset
☆18Feb 11, 2021Updated 5 years ago
ieg-dhr / NLP-Course4Humanities_2024
View on GitHub
This repository is part of an NLP course for humanities and cultural studies. This course uses historical newspapers as a source and appl…
☆21Jun 5, 2025Updated last year
martiansideofthemoon / longeval-summarization
View on GitHub
Official repository for our EACL 2023 paper "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (https…
☆45Aug 10, 2024Updated last year
jacobkrantz / lstm-syllabify
View on GitHub
Breaks a word into syllables using an LSTM-based neural network.
☆20Aug 14, 2023Updated 2 years ago
masakhane-io / masakhanePreprocessor
View on GitHub
Building an effective preprocessing tool for African languages
☆13Jan 24, 2024Updated 2 years ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
recap-utr / arguebuf-python
View on GitHub
Create and analyze argument graphs and serialize them via Protobuf
☆10Updated this week
hanxiao / demo-poems-ir
View on GitHub
Poems retrieval demo built with GNES framework
☆14Oct 3, 2019Updated 6 years ago
julmaxi / Abstractive-Timeline-Summarization
View on GitHub
☆11Dec 8, 2022Updated 3 years ago
telekom / mltb2
View on GitHub
Machine Learning Toolbox 2
☆13Nov 22, 2025Updated 8 months ago
DevSinghSachan / syntax-augmented-bert
View on GitHub
Source code of the paper "Do Syntax Trees Help Pre-trained Transformers Extract Information?" (EACL 2021)
☆75Dec 29, 2021Updated 4 years ago
stefan-it / german-gpt2
View on GitHub
German GPT-2 model
☆32Aug 17, 2021Updated 4 years ago
Kaleidophon / awesome-experimental-standards-deep-learning
View on GitHub
Repository collecting resources and best practices to improve experimental rigour in deep learning research.
☆27Mar 30, 2023Updated 3 years ago