kutvonenaki / cc100-sentencepieceLinks
Common crawl pretrained sentencepiece tokenizers for English and Japanese for various vocabulary sizes. Also development environment for further languages
☆10Updated 3 years ago
Alternatives and similar repositories for cc100-sentencepiece
Users that are interested in cc100-sentencepiece are comparing it to the libraries listed below
Sorting:
- Helper scripts and notes that were used while porting various nlp models☆46Updated 3 years ago
- All my experiments with the various transformers and various transformer frameworks available☆14Updated 4 years ago
- Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP☆58Updated 2 years ago
- Repository for Findings of EMNLP 2020 "Context-aware Stand-alone Neural Spelling Correction"☆18Updated 4 years ago
- Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages☆13Updated 2 years ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- Large-scale query-focused multi-document Summarization dataset☆10Updated 3 years ago
- Code for "CyberWallE at SemEval-2020 Task 11: An Analysis of Feature Engineering for Ensemble Models for Propaganda Detection" (V. Blasch…☆9Updated 4 years ago
- Using short models to classify long texts☆21Updated 2 years ago
- Data and code accompanying the paper "Intent Detection with WikiHow"☆10Updated 4 years ago
- zero-vocab or low-vocab embeddings☆18Updated 2 years ago
- Code for Stage-wise Fine-tuning for Graph-to-Text Generation☆26Updated 2 years ago
- ☆22Updated 4 months ago
- A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+☆38Updated 4 years ago
- LTG-Bert☆33Updated last year
- Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Lo…☆39Updated last year
- ☆12Updated last year
- A tiny BERT for low-resource monolingual models☆31Updated 8 months ago
- ☆11Updated 2 years ago
- Implementation of Z-BERT-A: a zero-shot pipeline for unknown intent detection.☆40Updated last year
- ☆21Updated 3 years ago
- A Benchmark Dataset for Understanding Disfluencies in Question Answering☆63Updated 3 years ago
- Code for GenAug: Data Augmentation for Finetuning Text Generators.☆27Updated 3 years ago
- ☆51Updated 2 years ago
- Code for Paper "Target-oriented Fine-tuning for Zero-Resource Named Entity Recognition"☆21Updated 2 years ago
- The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)☆52Updated 2 years ago
- ☆12Updated last year
- ☆34Updated 4 years ago
- Embedding Recycling for Language models☆38Updated last year
- Fast whitespace correction with Transformers☆16Updated 3 weeks ago