apple / ml-recurrent-drafterLinks
☆218Updated 11 months ago
Alternatives and similar repositories for ml-recurrent-drafter
Users that are interested in ml-recurrent-drafter are comparing it to the libraries listed below
Sorting:
- Ship correct and fast LLM kernels to PyTorch☆132Updated this week
- scalable and robust tree-based speculative decoding algorithm☆366Updated 11 months ago
- Fast low-bit matmul kernels in Triton☆423Updated last month
- Applied AI experiments and examples for PyTorch☆312Updated 4 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Updated last month
- Official implementation for Training LLMs with MXFP4☆116Updated 8 months ago
- 👷 Build compute kernels☆213Updated this week
- TPU inference for vLLM, with unified JAX and PyTorch support.☆213Updated this week
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆278Updated last month
- LM engine is a library for pretraining/finetuning LLMs☆110Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆233Updated last week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆218Updated this week
- Triton-based implementation of Sparse Mixture of Experts.☆260Updated 3 months ago
- Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model☆259Updated 7 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆279Updated 2 years ago
- KV cache compression for high-throughput LLM inference☆149Updated 11 months ago
- ring-attention experiments☆161Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆91Updated last year
- PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"☆79Updated last month
- Efficient LLM Inference over Long Sequences☆393Updated 6 months ago
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆269Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆67Updated this week
- Experiments on speculative sampling with Llama models☆127Updated 2 years ago
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support☆245Updated this week
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆146Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆131Updated last year
- Load compute kernels from the Hub☆376Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆227Updated last year
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM☆190Updated last week
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆368Updated last week