explosion / curated-tokenizersLinks
Lightweight piece tokenization library
☆12Updated last year
Alternatives and similar repositories for curated-tokenizers
Users that are interested in curated-tokenizers are comparing it to the libraries listed below
Sorting:
- Library for fast text representation and classification.☆31Updated 2 years ago
- Code for SaGe subword tokenizer (EACL 2023)☆27Updated last year
- zero-vocab or low-vocab embeddings☆18Updated 3 years ago
- Using short models to classify long texts☆21Updated 2 years ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆22Updated 7 months ago
- LTG-Bert☆34Updated 2 years ago
- GLADIS: A General and Large Acronym Disambiguation Benchmark (EACL 23)☆18Updated last year
- QAmeleon introduces synthetic multilingual QA data using PaLM, a 540B large language model. This dataset was generated by prompt tuning P…☆35Updated 2 years ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆76Updated 2 weeks ago
- T-Projection is a method to perform high-quality Annotation Projection of Sequence Labeling datasets.☆13Updated 2 years ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆49Updated 2 years ago
- ☆27Updated 11 months ago
- Efficient few-shot learning with cross-encoders.☆62Updated last year
- Source code and data for Like a Good Nearest Neighbor☆30Updated last year
- Pre-train Static Word Embeddings☆94Updated 5 months ago
- Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…☆28Updated last year
- Augmenty is an augmentation library based on spaCy for augmenting texts.☆156Updated last year
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆96Updated 3 years ago
- Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.☆105Updated 3 years ago
- A tiny BERT for low-resource monolingual models☆31Updated last month
- ☆132Updated 2 weeks ago
- A Python library aimed at dissecting and augmenting NER training data.☆60Updated 2 years ago
- Train huggingface models on top of Prodigy annotations☆21Updated last year
- Execute arbitrary SQL queries on 🤗 Datasets☆32Updated 2 years ago
- A spaCy custom component that extracts and normalizes temporal expressions☆56Updated 2 years ago
- 🛠️ Tools for Transformers compression using PyTorch Lightning ⚡☆85Updated last week
- PyTorch-IE: State-of-the-art Information Extraction in PyTorch☆77Updated 4 months ago
- Official Repository for "Hypencoder: Hypernetworks for Information Retrieval"☆33Updated 4 months ago
- Repository with code for MaChAmp: https://aclanthology.org/2021.eacl-demos.22/☆90Updated this week
- Temporary remove unused tokens during training to save ram and speed.☆23Updated 7 months ago