Wikipedia text corpus for self-supervised NLP model training
☆46Jul 17, 2022Updated 3 years ago
Alternatives and similar repositories for wikipedia2corpus
Users that are interested in wikipedia2corpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- This is a german text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. It's purpose is to train NLP embeddings l…☆23Feb 22, 2022Updated 4 years ago
- A data set and model for german sentiment classification.☆68May 30, 2025Updated 9 months ago
- BERT and ELECTRA models trained on Europeana Newspapers☆39Dec 14, 2021Updated 4 years ago
- X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents (JCDL 2022)☆14Jul 22, 2022Updated 3 years ago
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆30Jan 25, 2023Updated 3 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Klexikon: A German Dataset for Joint Summarization and Simplification☆17Oct 5, 2022Updated 3 years ago
- Repo for the simplified text alignment tools.☆21Dec 4, 2020Updated 5 years ago
- This repository contains code for the paper "Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs" (Wang, Lawrence…☆17Mar 8, 2021Updated 5 years ago
- A dataset for realistic evaluation of noisy label methods☆14Dec 3, 2023Updated 2 years ago
- [ACL 20] Probing Linguistic Features of Sentence-level Representations in Neural Relation Extraction☆13Apr 21, 2020Updated 5 years ago
- DWIE (Deutsche Welle corpus for Information Extraction) dataset. Introduced in our "DWIE: an entity-centric dataset for multi-task docume…☆52Jul 23, 2023Updated 2 years ago
- Crosswords puzzle generator and publisher using Constraints Satisfaction Problem (CSP) technique. With minimal backtracks.☆19Mar 29, 2019Updated 6 years ago
- Plan and train German transformer models.☆23Feb 22, 2021Updated 5 years ago
- Polish data.☆13Nov 12, 2025Updated 4 months ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- ☆18Feb 1, 2023Updated 3 years ago
- Code accompanying the paper "Knowledge Base Completion Meets Transfer Learning"☆15Feb 21, 2024Updated 2 years ago
- A tokenizer and sentence splitter for German and English web and social media texts.☆153Dec 9, 2024Updated last year
- MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer☆41Jun 7, 2022Updated 3 years ago
- Ukrainian ELECTRA model☆12Mar 11, 2023Updated 3 years ago
- Tools relating to the CC-News-En Collection☆20Dec 8, 2023Updated 2 years ago
- ☆10Mar 29, 2021Updated 4 years ago
- ☆12Apr 29, 2024Updated last year
- Pytorch implementation of Google TCAV☆10Jan 11, 2019Updated 7 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- This repository is part of an NLP course for humanities and cultural studies. This course uses historical newspapers as a source and appl…☆19Jun 5, 2025Updated 9 months ago
- ☆17Nov 23, 2021Updated 4 years ago
- ☆24Jun 12, 2023Updated 2 years ago
- Compound splitter for German language ("Komposita-Zerlegung") based on large dictionary combined with highly efficient multi-pattern stri…☆35Jul 7, 2022Updated 3 years ago
- Official code repository for the main conference paper in EMNLP 2022: SubeventWriter: Iterative Sub-event Sequence Generation with Cohere…☆11Oct 16, 2022Updated 3 years ago
- Official repository for our EACL 2023 paper "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (https…☆44Aug 10, 2024Updated last year
- GraphOfDocs: Representing multiple documents as a single graph☆21Jun 22, 2022Updated 3 years ago
- Create and analyze argument graphs and serialize them via Protobuf☆10Mar 18, 2026Updated last week
- Building an effective preprocessing tool for African languages☆13Jan 24, 2024Updated 2 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- SMiLER - Samsung MultiLingual Entity and Relation Extraction dataset☆18Feb 11, 2021Updated 5 years ago
- The NLPStatTest project☆12Mar 12, 2022Updated 4 years ago
- ☆11Dec 8, 2022Updated 3 years ago
- German GPT-2 model☆32Aug 17, 2021Updated 4 years ago
- This is a diacritization model for Arabic language. This model was built/trained using the Tashkeela: the Arabic diacritization corpus on…☆45Sep 10, 2023Updated 2 years ago
- Source code of the paper "Do Syntax Trees Help Pre-trained Transformers Extract Information?" (EACL 2021)☆75Dec 29, 2021Updated 4 years ago
- CAIPI turns LIMEs into trust!☆12May 30, 2020Updated 5 years ago