apple / ml-recurrent-drafterLinks
β219Updated 11 months ago
Alternatives and similar repositories for ml-recurrent-drafter
Users that are interested in ml-recurrent-drafter are comparing it to the libraries listed below
Sorting:
- A high-throughput and memory-efficient inference and serving engine for LLMsβ267Updated 3 weeks ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β275Updated last month
- Fast low-bit matmul kernels in Tritonβ413Updated last week
- Official implementation for Training LLMs with MXFP4β116Updated 8 months ago
- π· Build compute kernelsβ195Updated last week
- Ship correct and fast LLM kernels to PyTorchβ127Updated last week
- scalable and robust tree-based speculative decoding algorithmβ366Updated 11 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β146Updated last year
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)β263Updated this week
- Load compute kernels from the Hubβ352Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on diskβ225Updated this week
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".β279Updated 2 years ago
- TPU inference for vLLM, with unified JAX and PyTorch support.β202Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsityβ90Updated last year
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β217Updated 2 weeks ago
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face supportβ214Updated this week
- KV cache compression for high-throughput LLM inferenceβ148Updated 10 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ131Updated last year
- Triton-based implementation of Sparse Mixture of Experts.β259Updated 2 months ago
- ring-attention experimentsβ160Updated last year
- Applied AI experiments and examples for PyTorchβ311Updated 4 months ago
- Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Modelβ255Updated 7 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β249Updated 10 months ago
- Experiments on speculative sampling with Llama modelsβ127Updated 2 years ago
- Simple & Scalable Pretraining for Neural Architecture Researchβ305Updated 3 weeks ago
- This repository contains the experimental PyTorch native float8 training UXβ227Updated last year
- Efficient LLM Inference over Long Sequencesβ394Updated 6 months ago
- PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"β79Updated last week
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLMβ174Updated last week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ135Updated last year