apple / ml-recurrent-drafterLinks
☆219Updated 10 months ago
Alternatives and similar repositories for ml-recurrent-drafter
Users that are interested in ml-recurrent-drafter are comparing it to the libraries listed below
Sorting:
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆278Updated 2 years ago
- scalable and robust tree-based speculative decoding algorithm☆363Updated 10 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆210Updated 2 weeks ago
- Fast low-bit matmul kernels in Triton☆401Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Updated last year
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆271Updated last week
- Official implementation for Training LLMs with MXFP4☆110Updated 7 months ago
- Ship correct and fast LLM kernels to PyTorch☆124Updated 2 weeks ago
- KV cache compression for high-throughput LLM inference☆145Updated 9 months ago
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support☆187Updated last week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆217Updated last week
- TPU inference for vLLM, with unified JAX and PyTorch support.☆170Updated this week
- Load compute kernels from the Hub☆337Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆86Updated last year
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆254Updated last week
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆146Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆130Updated last year
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆327Updated this week
- Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model☆252Updated 6 months ago
- ring-attention experiments☆160Updated last year
- 👷 Build compute kernels☆190Updated this week
- Applied AI experiments and examples for PyTorch☆307Updated 3 months ago
- ☆224Updated last week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆249Updated 10 months ago
- PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"☆78Updated 2 months ago
- Efficient LLM Inference over Long Sequences☆392Updated 5 months ago
- Experiments on speculative sampling with Llama models☆127Updated 2 years ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆347Updated 7 months ago
- [ICML 2024] CLLMs: Consistency Large Language Models☆406Updated last year
- Triton-based implementation of Sparse Mixture of Experts.☆253Updated 2 months ago