kutvonenaki / cc100-sentencepieceLinks
Common crawl pretrained sentencepiece tokenizers for English and Japanese for various vocabulary sizes. Also development environment for further languages
☆10Updated 3 years ago
Alternatives and similar repositories for cc100-sentencepiece
Users that are interested in cc100-sentencepiece are comparing it to the libraries listed below
Sorting:
- Repository for Findings of EMNLP 2020 "Context-aware Stand-alone Neural Spelling Correction"☆18Updated 4 years ago
- Large-scale query-focused multi-document Summarization dataset☆10Updated 3 years ago
- ZS4IE: A Toolkit for Zero-Shot Information Extraction with Simple Verbalizations☆28Updated 3 years ago
- Code for Paper "Target-oriented Fine-tuning for Zero-Resource Named Entity Recognition"☆21Updated 2 years ago
- ☆21Updated 3 years ago
- Code for GenAug: Data Augmentation for Finetuning Text Generators.☆27Updated 3 years ago
- The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)☆53Updated 3 years ago
- Data and code accompanying the paper "Intent Detection with WikiHow"☆10Updated 4 years ago
- Helper scripts and notes that were used while porting various nlp models☆46Updated 3 years ago
- Using short models to classify long texts☆21Updated 2 years ago
- Wikipedia based dataset to train relationship classifiers and fact extraction models☆25Updated 4 years ago
- A tiny BERT for low-resource monolingual models☆31Updated 9 months ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- LAReQA is a challenging benchmark for evaluating language agnostic answer retrieval from a multilingual candidate pool. This repository c…☆14Updated 5 years ago
- A repository for our AAAI-2020 Cross-lingual-NER paper. Code will be updated shortly.☆47Updated 2 years ago
- Code for "CyberWallE at SemEval-2020 Task 11: An Analysis of Feature Engineering for Ensemble Models for Propaganda Detection" (V. Blasch…☆9Updated 4 years ago
- benchmarks for evaluating MT models☆12Updated last year
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"☆30Updated 3 years ago
- SeqScore: Scoring for named entity recognition and other sequence labeling tasks☆23Updated 3 months ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆21Updated 4 months ago
- A parallel evaluation data set of SAP software documentation with document structure annotation☆11Updated last month
- Multilingual Compositional Wikidata Questions (MCWQ)☆18Updated 2 years ago
- ☆30Updated 4 years ago
- Training T5 to perform numerical reasoning.☆24Updated 4 years ago
- Code and data for the IWSLT 2022 shared task on Formality Control for SLT☆21Updated 2 years ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of …☆60Updated 4 years ago
- zero-vocab or low-vocab embeddings☆18Updated 2 years ago
- A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+☆38Updated 4 years ago
- ☆34Updated 4 years ago
- Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems☆22Updated 4 years ago