kutvonenaki / cc100-sentencepiece
Common crawl pretrained sentencepiece tokenizers for English and Japanese for various vocabulary sizes. Also development environment for further languages
β10Updated 3 years ago
Alternatives and similar repositories for cc100-sentencepiece:
Users that are interested in cc100-sentencepiece are comparing it to the libraries listed below
- Helper scripts and notes that were used while porting various nlp modelsβ46Updated 3 years ago
- β21Updated 3 years ago
- Tutorial to pretrain & fine-tune a π€ Flax T5 model on a TPUv3-8 with GCPβ58Updated 2 years ago
- Repository for Findings of EMNLP 2020 "Context-aware Stand-alone Neural Spelling Correction"β18Updated 4 years ago
- Large-scale query-focused multi-document Summarization datasetβ10Updated 3 years ago
- Observe the slow deterioration of my mental sanity in the github commit historyβ12Updated last year
- β21Updated 2 months ago
- This repo contains code for the paper "Psychologically-informed chain-of-thought prompts for metaphor understanding in large language modβ¦β14Updated last year
- LTG-Bertβ31Updated last year
- Corresponding code repo for the paper at COLING 2020 - ARGMIN 2020: "DebateSum: A large-scale argument mining and summarization dataset"β54Updated 3 years ago
- A web application that interfaces two GEC systems. [web instance is down]β31Updated 7 months ago
- Consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentatiβ¦β37Updated 2 years ago
- Using short models to classify long textsβ21Updated 2 years ago
- Data and code accompanying the paper "Intent Detection with WikiHow"β10Updated 3 years ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.β93Updated 2 years ago
- Prabhupadavani: A Code-mixed Speech Translation Data for 25 languagesβ13Updated 2 years ago
- A repo for code based language modelsβ18Updated 4 years ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"β30Updated 2 years ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extractionβ24Updated 2 years ago
- β44Updated 4 months ago
- Experiments for XLM-V Transformers Integerationβ13Updated 2 years ago
- Generative Retrieval Transformerβ28Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β100Updated 11 months ago
- LAReQA is a challenging benchmark for evaluating language agnostic answer retrieval from a multilingual candidate pool. This repository cβ¦β14Updated 4 years ago
- Code for Paper "Target-oriented Fine-tuning for Zero-Resource Named Entity Recognition"β21Updated 2 years ago
- Code for Stage-wise Fine-tuning for Graph-to-Text Generationβ26Updated 2 years ago
- [ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generatorsβ24Updated last year
- Code repo for "Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers" (ACL 2023)β22Updated last year
- Code for running the experiments in Deep Subjecthood: Higher Order Grammatical Features in Multilingual BERTβ17Updated last year
- Implementation of "SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages" paper, accepted to Eβ¦β25Updated 2 years ago