ClimSocAna / tecb-de
German Text Embedding Clustering Benchmark
☆15Updated 7 months ago
Related projects ⓘ
Alternatives and complementary repositories for tecb-de
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆52Updated 3 months ago
- ☆27Updated 2 months ago
- Efficiently find the best-suited language model (LM) for your NLP task☆14Updated this week
- Evaluate language models using multiple choice items☆12Updated last month
- Data for the HIPE 2022 shared task.☆15Updated 11 months ago
- GC4LM: A Colossal (Biased) language model for German☆13Updated 3 years ago
- A software for transferring pre-trained English models to foreign languages☆18Updated last year
- Tools for managing datasets for governance and training.☆77Updated 2 weeks ago
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆29Updated last year
- T-Projection is a method to perform high-quality Annotation Projection of Sequence Labeling datasets.☆11Updated 11 months ago
- German Alpaca Dataset (Cleaned + Translated)☆23Updated last year
- A spaCy custom component that extracts and normalizes temporal expressions☆52Updated last year
- ☆28Updated 11 months ago
- A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).☆13Updated 5 months ago
- ☆14Updated 5 months ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆70Updated 8 months ago
- Master thesis: Exploring bias in German NLG (GPT-3 & GerPT-2). Applies regard classification and bias mitigation triggers.☆14Updated last month
- Repository with code for MaChAmp: https://aclanthology.org/2021.eacl-demos.22/☆81Updated last month
- PropSegmEnt is an annotated dataset for segmenting English text into propositions, and recognizing proposition-level entailment relations…☆18Updated last year
- Repo for Aspire - A scientific document similarity model based on matching fine-grained aspects of scientific papers.☆50Updated last year
- This repository provides the source code used to automatically generate the book summarization datasets described in the paper titled "Ec…☆11Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆22Updated last month
- Codebase, data and models for the Keep it Simple paper at ACL2021☆36Updated last year
- Examples for aligning, padding and batching sequence labeling data (NER) for use with pre-trained transformer models☆65Updated last year
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆75Updated 2 months ago
- Evaluation of language models on mono- or multilingual tasks.☆74Updated this week
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆92Updated last year
- Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)☆63Updated 2 years ago
- ☆19Updated 3 years ago
- The corresponding code for our paper: "Exploring the Challenges of Open Domain Multi-Document Summarization". Do not hesitate to open an …☆31Updated last year