Fast bare-bones BPE for modern tokenizer training
☆176Jun 23, 2025Updated 8 months ago
Alternatives and similar repositories for bpeasy
Users that are interested in bpeasy are comparing it to the libraries listed below
Sorting:
- The official PyTorch implementation of Google's Gemma models☆5,606May 30, 2025Updated 9 months ago
- Code for the paper "Getting the most out of your tokenizer for pre-training and domain adaptation"☆22Feb 14, 2024Updated 2 years ago
- JAX implementation ViT-VQGAN☆63Jul 23, 2022Updated 3 years ago
- UNet diffusion model in pure CUDA☆657Jun 28, 2024Updated last year
- My hybrid TTS network that combines, VALL-E, VoiceBox, SpeechFlow, Seamless and TortoiseTTS into one☆26Aug 5, 2024Updated last year
- GPT for FACodec☆13Mar 25, 2024Updated last year
- Visualize multi-model embedding spaces. The first goal is to quickly get a lay of the land of any embedding space. Then be able to scroll…☆27May 16, 2024Updated last year
- ☆16Apr 4, 2022Updated 3 years ago
- ☆19Sep 16, 2025Updated 5 months ago
- A benchmark to evaluate language models on questions I've previously asked them to solve.☆1,042Apr 27, 2025Updated 10 months ago
- RuLES: a benchmark for evaluating rule-following in language models☆249Feb 24, 2025Updated last year
- ☆16Dec 31, 2021Updated 4 years ago
- Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.☆10,358Jul 1, 2024Updated last year
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆595Aug 12, 2025Updated 6 months ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆26Nov 25, 2024Updated last year
- Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…☆28Apr 17, 2024Updated last year
- Contains the code associated with the ICLR submission for our text-to-speech diffusion model☆57Oct 31, 2023Updated 2 years ago
- Pure Python version of the mlabwrap Python to Matlab bridge☆31Nov 21, 2019Updated 6 years ago
- ☆10Oct 2, 2024Updated last year
- Fast, free, easy, and object-agnostic video anonymization☆11Dec 12, 2020Updated 5 years ago
- Using large language models to maintain AI_CHANGELOG.md☆14Jul 15, 2024Updated last year
- Experimental CUDA kernel framework unifying typed dimensions, NVRTC JIT specialization, and ML‑guided tuning.☆46Feb 9, 2026Updated last month
- Supervoice diffusion enhance☆28Jul 15, 2024Updated last year
- ScriptBots is an Open Source Evolutionary Artificial Life Simulation of Predator-Prey dynamics, written by Andrej Karpathy.☆62Feb 18, 2011Updated 15 years ago
- The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching o…☆159Jul 14, 2025Updated 7 months ago
- [NeurIPS 2025@FoRLM] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search☆17Jan 24, 2026Updated last month
- BH hackathon☆14Apr 4, 2024Updated last year
- ☆16Feb 18, 2024Updated 2 years ago
- Generating Summaries with Controllable Readability Levels (EMNLP 2023)☆15Aug 6, 2025Updated 7 months ago
- 0-Shot Tokenizer Transplant☆14May 16, 2025Updated 9 months ago
- Using fourier interpolation to merge large language models☆11Jan 6, 2026Updated 2 months ago
- Code for the examples presented in the talk "Training a Llama in your backyard: fine-tuning very large models on consumer hardware" given…☆15Oct 16, 2023Updated 2 years ago
- FINALLY: Fast and universal speech enhancement model delivering studio-quality audio for a wide range of recordings.☆25Dec 11, 2025Updated 2 months ago
- ☆13May 30, 2024Updated last year
- ☆12Feb 22, 2024Updated 2 years ago
- TTS Text Analyzer☆31Jul 20, 2023Updated 2 years ago
- Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"☆28Oct 3, 2021Updated 4 years ago
- Due to the huge vocaburary size (151,936) of Qwen models, the Embedding and LM Head weights are excessively heavy. Therefore, this projec…☆34Jan 6, 2026Updated 2 months ago
- [ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling☆952Nov 16, 2025Updated 3 months ago