GermanT5 / wikipedia2corpus
Wikipedia text corpus for self-supervised NLP model training
☆43Updated 2 years ago
Alternatives and similar repositories for wikipedia2corpus:
Users that are interested in wikipedia2corpus are comparing it to the libraries listed below
- A Word Sense Disambiguation system integrating implicit and explicit external knowledge.☆68Updated 3 years ago
- A survey of corpora for Germanic low-resource languages and dialects☆25Updated 3 months ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆71Updated last year
- CD20200004 from 01/01/2021 to 31/12/2023 - LIG UGA - Python Notebook and Models for the MT Lab @ ALPS 2022☆13Updated 11 months ago
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆22Updated 3 years ago
- Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Do…☆80Updated 8 months ago
- Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also pred…☆69Updated 2 years ago
- BERT and ELECTRA models trained on Europeana Newspapers☆37Updated 3 years ago
- Dutch coreference resolution & dialogue analysis using deterministic rules☆21Updated last year
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆29Updated 2 years ago
- Repository with code for MaChAmp: https://aclanthology.org/2021.eacl-demos.22/☆83Updated 3 weeks ago
- OpusFilter - Parallel corpus processing toolkit☆104Updated last week
- XL-AMR is a sequence-to-graph cross-lingual AMR parser that exploits transfer learning (EMNLP2020).☆17Updated 7 months ago
- A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations☆55Updated 2 years ago
- UFSAC is a resource containing all WordNet Sense Annotated Corpora, and a Java library for manipulating them☆37Updated 2 years ago
- Build a dialog dataset from online books in many languages☆72Updated 2 years ago
- Alignment and annotation for comparable documents.☆22Updated 6 years ago
- GC4LM: A Colossal (Biased) language model for German☆13Updated 3 years ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆99Updated 10 months ago
- ParaNames: A multilingual resource for parallel names☆30Updated 9 months ago
- X-SRL Dataset. Including the code for the SRL annotation projection tool and an out-of-the-box word alignment tool based on Multilingual …☆15Updated 3 years ago
- PropSegmEnt is an annotated dataset for segmenting English text into propositions, and recognizing proposition-level entailment relations…☆19Updated 2 years ago
- Neural CRF Model for Sentence Alignment in Text Simplification☆66Updated last month
- ☆35Updated 2 years ago
- A library of translation-based text similarity measures☆25Updated last year
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆77Updated 5 months ago
- ☆45Updated 7 months ago
- This repo contains a set of neural transducer, e.g. sequence-to-sequence model, focusing on character-level tasks.☆72Updated last year
- This is a german text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. It's purpose is to train NLP embeddings l…☆24Updated 3 years ago
- Repository for the paper "MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguatio…☆44Updated last year