GermanT5 / wikipedia2corpusLinks
Wikipedia text corpus for self-supervised NLP model training
☆44Updated 2 years ago
Alternatives and similar repositories for wikipedia2corpus
Users that are interested in wikipedia2corpus are comparing it to the libraries listed below
Sorting:
- A survey of corpora for Germanic low-resource languages and dialects☆25Updated 6 months ago
- ☆74Updated 3 months ago
- A Word Sense Disambiguation system integrating implicit and explicit external knowledge.☆69Updated 3 years ago
- OpusFilter - Parallel corpus processing toolkit☆104Updated this week
- Framework for unified summarisation and evaluation of English documents using state-of-the-art models and measures.☆32Updated last year
- CD20200004 from 01/01/2021 to 31/12/2023 - LIG UGA - Python Notebook and Models for the MT Lab @ ALPS 2022☆13Updated last year
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆22Updated 3 years ago
- This is a german text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. It's purpose is to train NLP embeddings l…☆24Updated 3 years ago
- ParaNames: A multilingual resource for parallel names☆34Updated last year
- This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' pu…☆40Updated 3 years ago
- GC4LM: A Colossal (Biased) language model for German☆13Updated 4 years ago
- MAGPIE: A sense-annotated corpus of potentially idiomatic expressions☆27Updated 5 years ago
- This dataset contains human judgements about answer equivalence. The data is based on SQuAD (Stanford Question Answering Dataset), and co…☆25Updated 2 years ago
- A tiny BERT for low-resource monolingual models☆31Updated 9 months ago
- Repository with code for MaChAmp: https://aclanthology.org/2021.eacl-demos.22/☆87Updated last month
- Repository for the paper "MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguatio…☆44Updated last year
- Curriculum training☆18Updated this week
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆72Updated last year
- XED multilingual emotion datasets☆61Updated 2 years ago
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆30Updated 2 years ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆103Updated last year
- A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations☆56Updated 2 years ago
- Wikipedia based dataset to train relationship classifiers and fact extraction models☆25Updated 4 years ago
- A general-purpose library for cross-document NLP modelling and analysis☆11Updated last year
- UFSAC is a resource containing all WordNet Sense Annotated Corpora, and a Java library for manipulating them☆38Updated 3 years ago
- ☆15Updated 2 years ago
- Alignment and annotation for comparable documents.☆22Updated 6 years ago
- Code for the CRAC 2021 paper "On Generalization in Coreference Resolution" (Best short paper award)☆35Updated last year
- ☆27Updated 4 months ago
- ☆48Updated 11 months ago