kutvonenaki / cc100-sentencepieceLinks

Common crawl pretrained sentencepiece tokenizers for English and Japanese for various vocabulary sizes. Also development environment for further languages

☆10

Alternatives and similar repositories for cc100-sentencepiece

Users that are interested in cc100-sentencepiece are comparing it to the libraries listed below

Sorting:

gsarti / t5-flax-gcp
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP
☆58Updated 2 years ago
stas00 / porting
Helper scripts and notes that were used while porting various nlp models
☆46Updated 3 years ago
odelliab / HowSumm
Large-scale query-focused multi-document Summarization dataset
☆10Updated 3 years ago
styfeng / GenAug
Code for GenAug: Data Augmentation for Finetuning Text Generators.
☆28Updated 3 years ago
wangcongcong123 / ttt
A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+
☆38Updated 4 years ago
zharry29 / wikihow-intent
Data and code accompanying the paper "Intent Detection with WikiHow"
☆10Updated 4 years ago
krandiash / gpt3-nli
Training a model without a dataset for natural language inference (NLI)
☆25Updated 4 years ago
SAP / software-documentation-data-set-for-machine-translation
A parallel evaluation data set of SAP software documentation with document structure annotation
☆11Updated 2 months ago
thiborose / gecko-app
A web application that interfaces two GEC systems. [web instance is down]
☆31Updated 11 months ago
voidmagic / parameter-differentiation
☆21Updated 3 years ago
Hellisotherpeople / DebateSum
Corresponding code repo for the paper at COLING 2020 - ARGMIN 2020: "DebateSum: A large-scale argument mining and summarization dataset"
☆54Updated 3 years ago
bltlab / seqscore
SeqScore: Scoring for named entity recognition and other sequence labeling tasks
☆23Updated 4 months ago
thunlp / VERNet
Source codes of Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correction
☆43Updated 4 years ago
jacklxc / StandAloneSpellingCorrection
Repository for Findings of EMNLP 2020 "Context-aware Stand-alone Neural Spelling Correction"
☆18Updated 4 years ago
google-research-datasets / NewsQuizQA
NewsQuizQA is a quiz-style question-answer dataset used for generating quiz questions about the news
☆35Updated 4 years ago
salesforce / TaiChi
Open source library for few shot NLP
☆78Updated 2 years ago
allenai / zest
Code and dataset "ZEST" from "Learning from task descriptions", Weller et al, EMNLP 2020
☆17Updated 4 years ago
lucidrains / marge-pytorch
Implementation of Marge, Pre-training via Paraphrasing, in Pytorch
☆76Updated 4 years ago
Geotrend-research / smaller-transformers
Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.
☆103Updated 3 years ago
BBN-E / ZS4IE
ZS4IE: A Toolkit for Zero-Shot Information Extraction with Simple Verbalizations
☆28Updated 3 years ago
gentaiscool / few-shot-lm
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)
☆53Updated 3 years ago
octanove / grammartagger
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning
☆28Updated 4 years ago
MilaNLProc / language-invariant-properties
☆22Updated 3 years ago
google-research-datasets / lareqa
LAReQA is a challenging benchmark for evaluating language agnostic answer retrieval from a multilingual candidate pool. This repository c…
☆14Updated 5 years ago
prakhar21 / Writing-with-BERT
Using BERT for doing the task of Conditional Natural Language Generation by fine-tuning pre-trained BERT on custom dataset.
☆41Updated 5 years ago
megagonlabs / SubjQA
A question-answering dataset with a focus on subjective information
☆45Updated last year
dev-chauhan / PQG-pytorch
Paraphrase Generation model using pair-wise discriminator loss
☆45Updated 4 years ago
esdurmus / Wikilingua
Multilingual abstractive summarization dataset extracted from WikiHow.
☆92Updated 4 months ago
Yarkona / TOF
Code for Paper "Target-oriented Fine-tuning for Zero-Resource Named Entity Recognition"
☆21Updated 2 years ago
cindyxinyiwang / expand-via-lexicon-based-adaptation
Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"
☆30Updated 3 years ago