kutvonenaki / cc100-sentencepiece
Common crawl pretrained sentencepiece tokenizers for English and Japanese for various vocabulary sizes. Also development environment for further languages
β10Updated 3 years ago
Alternatives and similar repositories for cc100-sentencepiece:
Users that are interested in cc100-sentencepiece are comparing it to the libraries listed below
- Tutorial to pretrain & fine-tune a π€ Flax T5 model on a TPUv3-8 with GCPβ58Updated 2 years ago
- Helper scripts and notes that were used while porting various nlp modelsβ46Updated 3 years ago
- β21Updated 3 years ago
- A web application that interfaces two GEC systems. [web instance is down]β31Updated 8 months ago
- LTG-Bertβ32Updated last year
- Code for Stage-wise Fine-tuning for Graph-to-Text Generationβ26Updated 2 years ago
- Large-scale query-focused multi-document Summarization datasetβ10Updated 3 years ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of β¦β61Updated 4 years ago
- Data and code accompanying the paper "Intent Detection with WikiHow"β10Updated 3 years ago
- Implementation of Z-BERT-A: a zero-shot pipeline for unknown intent detection.β39Updated last year
- Repository for Findings of EMNLP 2020 "Context-aware Stand-alone Neural Spelling Correction"β18Updated 4 years ago
- An official implementation of "BPE-Dropout: Simple and Effective Subword Regularization" algorithm.β51Updated 4 years ago
- Prabhupadavani: A Code-mixed Speech Translation Data for 25 languagesβ13Updated 2 years ago
- This repository contains the implementation of the paper: "Span Classification with Structured Information for Disfluency Detection in Spβ¦β12Updated last year
- Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loβ¦β39Updated last year
- β9Updated last year
- β38Updated 2 years ago
- Benchmarking various Deep Learning models such as BERT, ALBERT, BiLSTMs on the task of sentence entailment using two datasets - MultiNLI β¦β28Updated 4 years ago
- Code for "CyberWallE at SemEval-2020 Task 11: An Analysis of Feature Engineering for Ensemble Models for Propaganda Detection" (V. Blaschβ¦β9Updated 4 years ago
- β30Updated 4 years ago
- Open source library for few shot NLPβ78Updated last year
- A Benchmark Dataset for Understanding Disfluencies in Question Answeringβ62Updated 3 years ago
- β34Updated 4 years ago
- The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Modelsβ24Updated 3 years ago
- zero-vocab or low-vocab embeddingsβ18Updated 2 years ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"β30Updated 3 years ago
- Source codes of Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correctionβ43Updated 3 years ago
- β12Updated last year
- Bilingual sentence similarity classifier using Tensorflowβ21Updated 5 years ago
- BERT models for many languages created from Wikipedia textsβ33Updated 4 years ago