FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
☆336Nov 2, 2025Updated 4 months ago
Alternatives and similar repositories for flex-nano-vllm
Users that are interested in flex-nano-vllm are comparing it to the libraries listed below
Sorting:
- Simple (fast) transformer inference in PyTorch with torch.compile + lit-llama code☆10Aug 29, 2023Updated 2 years ago
- ☆21Mar 3, 2025Updated last year
- Minimalistic 4D-parallelism distributed training framework for education purpose☆2,104Aug 26, 2025Updated 6 months ago
- Simple MPI implementation for prototyping or learning☆304Aug 6, 2025Updated 7 months ago
- A PyTorch native platform for training generative AI models☆5,111Updated this week
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆188Jan 19, 2026Updated last month
- LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation☆33Feb 26, 2026Updated last week
- Custom triton kernels for training Karpathy's nanoGPT.☆19Oct 21, 2024Updated last year
- ☆52May 19, 2025Updated 9 months ago
- The simplest, fastest repository for training/finetuning small-sized VLMs.☆4,685Oct 27, 2025Updated 4 months ago
- Fast and memory-efficient exact kmeans☆140Feb 18, 2026Updated 2 weeks ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆474May 17, 2025Updated 9 months ago
- A scalable asynchronous reinforcement learning implementation with in-flight weight updates.☆373Feb 26, 2026Updated last week
- ☆20Jul 12, 2023Updated 2 years ago
- Implementing DeepSeek R1's GRPO algorithm from scratch☆1,787Apr 18, 2025Updated 10 months ago
- NanoGPT (124M) in 2 minutes☆4,734Feb 27, 2026Updated last week
- Learn CUDA with PyTorch☆244Updated this week
- Helpful tools and examples for working with flex-attention☆1,140Feb 8, 2026Updated last month
- [Developmental] Quarto Extension to Enable Google Colaboratory Links with Quarto Documents☆15May 18, 2025Updated 9 months ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 5 months ago
- utilities for batched llm calls with retries☆47Updated this week
- Tile primitives for speedy kernels☆3,202Feb 24, 2026Updated last week
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆88Nov 29, 2025Updated 3 months ago
- Our library for RL environments + evals☆3,877Updated this week
- UNet diffusion model in pure CUDA☆657Jun 28, 2024Updated last year
- Single File, Single GPU, From Scratch, Efficient, Full Parameter Tuning library for "RL for LLMs"☆599Oct 7, 2025Updated 5 months ago
- Cookbook of SGLang - Recipe☆93Updated this week
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆91Jul 17, 2025Updated 7 months ago
- Official repository for our work on micro-budget training of large-scale diffusion models.☆1,549Jan 12, 2025Updated last year
- Code accompanying our ICML 2020 paper on choice set optimization in group decision-making.☆11Jun 27, 2020Updated 5 years ago
- Offline Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits☆10Oct 21, 2024Updated last year
- JAX implementation of GPTQ quantization algorithm☆10Jul 19, 2023Updated 2 years ago
- Jax implementation of VIT-VQGAN☆10Jan 25, 2024Updated 2 years ago
- ICLR 2023: Learning to Extrapolate: A Transductive Approach☆11Aug 15, 2023Updated 2 years ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆98Jul 24, 2025Updated 7 months ago
- Optimized primitives for collective multi-GPU communication☆10May 8, 2024Updated last year
- logit lens for VGGT☆26Dec 2, 2025Updated 3 months ago
- ☆12Apr 26, 2024Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆329Updated this week