Wikipedia text corpus for self-supervised NLP model training
☆46Jul 17, 2022Updated 3 years ago
Alternatives and similar repositories for wikipedia2corpus
Users that are interested in wikipedia2corpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- This is a german text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. It's purpose is to train NLP embeddings l…☆23Feb 22, 2022Updated 4 years ago
- A data set and model for german sentiment classification.☆69May 30, 2025Updated 10 months ago
- German word embeddings computed from a corpus of parliamentary transcripts (2017-2019)☆15Mar 5, 2020Updated 6 years ago
- German Alpaca Dataset (Cleaned + Translated)☆26Apr 6, 2023Updated 3 years ago
- Brave is a simple visualisation library for NLP information extraction, built on top of embedded BRAT.☆15Dec 25, 2019Updated 6 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents (JCDL 2022)☆14Jul 22, 2022Updated 3 years ago
- Klexikon: A German Dataset for Joint Summarization and Simplification☆17Oct 5, 2022Updated 3 years ago
- Repo for the simplified text alignment tools.☆21Dec 4, 2020Updated 5 years ago
- This is a german ELMo deep contextualized word representation. It is trained on a special German Wikipedia Text Corpus.☆28Dec 15, 2019Updated 6 years ago
- Code for our TSD paper "TOKEN is a MASK: Few-shot Named Entity Recognition with Pre-trained Language Models"☆14Aug 19, 2022Updated 3 years ago
- A dataset for realistic evaluation of noisy label methods☆14Dec 3, 2023Updated 2 years ago
- [ACL 20] Probing Linguistic Features of Sentence-level Representations in Neural Relation Extraction☆13Apr 21, 2020Updated 5 years ago
- DWIE (Deutsche Welle corpus for Information Extraction) dataset. Introduced in our "DWIE: an entity-centric dataset for multi-task docume…☆52Jul 23, 2023Updated 2 years ago
- Plan and train German transformer models.☆23Feb 22, 2021Updated 5 years ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Polish data.☆13Nov 12, 2025Updated 5 months ago
- ☆18Feb 1, 2023Updated 3 years ago
- ☆12Oct 17, 2022Updated 3 years ago
- ☆14Mar 31, 2024Updated 2 years ago
- A tokenizer and sentence splitter for German and English web and social media texts.☆153Dec 9, 2024Updated last year
- MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer☆41Jun 7, 2022Updated 3 years ago
- Ukrainian ELECTRA model☆12Mar 11, 2023Updated 3 years ago
- ☆10Mar 29, 2021Updated 5 years ago
- ML pipeline and web app for classifying disaster response messages.☆10Oct 6, 2018Updated 7 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- ☆12Apr 29, 2024Updated last year
- German stopwords collection☆88Oct 6, 2022Updated 3 years ago
- Python project to fetch twitter data for some interesting analyses☆13Dec 7, 2020Updated 5 years ago
- This repository is part of an NLP course for humanities and cultural studies. This course uses historical newspapers as a source and appl…☆19Jun 5, 2025Updated 10 months ago
- Curriculum training☆22Jun 25, 2025Updated 9 months ago
- process your massive word2vec binary model file as a readable stream of records☆11Jan 28, 2018Updated 8 years ago
- ☆24Jun 12, 2023Updated 2 years ago
- Compound splitter for German language ("Komposita-Zerlegung") based on large dictionary combined with highly efficient multi-pattern stri…☆35Jul 7, 2022Updated 3 years ago
- A small and fast S3 client without the clutter.☆39Apr 7, 2026Updated last week
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Functions for easily making publication-quality figures with matplotlib.☆19Jan 20, 2024Updated 2 years ago
- XWikisCorpus, cross-lingual summarisation, multi-lingual summarisation, pre-trained language models, zero-shot and few-shot summarisation…☆10Nov 4, 2022Updated 3 years ago
- Official repository for our EACL 2023 paper "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (https…☆44Aug 10, 2024Updated last year
- GraphOfDocs: Representing multiple documents as a single graph☆21Jun 22, 2022Updated 3 years ago
- Data for discourse connective prediction.☆12May 3, 2018Updated 7 years ago
- Building an effective preprocessing tool for African languages☆12Jan 24, 2024Updated 2 years ago
- ☆11Dec 8, 2022Updated 3 years ago