Wikipedia text corpus for self-supervised NLP model training
☆47Jul 17, 2022Updated 3 years ago
Alternatives and similar repositories for wikipedia2corpus
Users that are interested in wikipedia2corpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- This is a german text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. It's purpose is to train NLP embeddings l…☆23Feb 22, 2022Updated 4 years ago
- German word embeddings computed from a corpus of parliamentary transcripts (2017-2019)☆15Mar 5, 2020Updated 6 years ago
- German Alpaca Dataset (Cleaned + Translated)☆26Apr 6, 2023Updated 3 years ago
- Brave is a simple visualisation library for NLP information extraction, built on top of embedded BRAT.☆15Dec 25, 2019Updated 6 years ago
- BERT and ELECTRA models trained on Europeana Newspapers☆39Dec 14, 2021Updated 4 years ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents (JCDL 2022)☆14Jul 22, 2022Updated 3 years ago
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆30Jan 25, 2023Updated 3 years ago
- Klexikon: A German Dataset for Joint Summarization and Simplification☆16Oct 5, 2022Updated 3 years ago
- Repo for the simplified text alignment tools.☆21Dec 4, 2020Updated 5 years ago
- This is a german ELMo deep contextualized word representation. It is trained on a special German Wikipedia Text Corpus.☆28Dec 15, 2019Updated 6 years ago
- Code for our TSD paper "TOKEN is a MASK: Few-shot Named Entity Recognition with Pre-trained Language Models"☆14Aug 19, 2022Updated 3 years ago
- This repository contains code for the paper "Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs" (Wang, Lawrence…☆17Mar 8, 2021Updated 5 years ago
- [ACL 20] Probing Linguistic Features of Sentence-level Representations in Neural Relation Extraction☆13Apr 21, 2020Updated 6 years ago
- DWIE (Deutsche Welle corpus for Information Extraction) dataset. Introduced in our "DWIE: an entity-centric dataset for multi-task docume…☆52Jul 23, 2023Updated 2 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Codebase, data and models for the Re-Thinking the Shuffle Test paper at ACL2021☆10Oct 14, 2022Updated 3 years ago
- Plan and train German transformer models.☆23Feb 22, 2021Updated 5 years ago
- Mirror of Apache OpenNLP Add-ons☆19Jun 8, 2026Updated last week
- ☆18Feb 1, 2023Updated 3 years ago
- Python code to automatically produce a summary of a piece of text.☆11Sep 8, 2016Updated 9 years ago
- Code accompanying the paper "Knowledge Base Completion Meets Transfer Learning"☆15Feb 21, 2024Updated 2 years ago
- ☆14Mar 31, 2024Updated 2 years ago
- MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer☆40Jun 7, 2022Updated 4 years ago
- Ukrainian ELECTRA model☆12Mar 11, 2023Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Tools relating to the CC-News-En Collection☆20Dec 8, 2023Updated 2 years ago
- ☆12Apr 29, 2024Updated 2 years ago
- This repository is part of an NLP course for humanities and cultural studies. This course uses historical newspapers as a source and appl…☆20Jun 5, 2025Updated last year
- Curriculum training☆22Jun 25, 2025Updated 11 months ago
- ☆24Jun 12, 2023Updated 3 years ago
- This repo is meant to serve as a guide for Machine Learning/AI technical interviews.☆11Mar 5, 2024Updated 2 years ago
- Official code repository for the main conference paper in EMNLP 2022: SubeventWriter: Iterative Sub-event Sequence Generation with Cohere…☆11Oct 16, 2022Updated 3 years ago
- XWikisCorpus, cross-lingual summarisation, multi-lingual summarisation, pre-trained language models, zero-shot and few-shot summarisation…☆10Nov 4, 2022Updated 3 years ago
- Official repository for our EACL 2023 paper "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (https…☆45Aug 10, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- GraphOfDocs: Representing multiple documents as a single graph☆21Jun 22, 2022Updated 3 years ago
- Building an effective preprocessing tool for African languages☆13Jan 24, 2024Updated 2 years ago
- Create and analyze argument graphs and serialize them via Protobuf☆10Updated this week
- SMiLER - Samsung MultiLingual Entity and Relation Extraction dataset☆18Feb 11, 2021Updated 5 years ago
- The NLPStatTest project☆12Mar 12, 2022Updated 4 years ago
- ☆11Dec 8, 2022Updated 3 years ago
- Python text classification package with a focus on medical text.☆17Jun 28, 2021Updated 4 years ago