apple / ml-recurrent-drafterLinks
β218Updated 9 months ago
Alternatives and similar repositories for ml-recurrent-drafter
Users that are interested in ml-recurrent-drafter are comparing it to the libraries listed below
Sorting:
- A high-throughput and memory-efficient inference and serving engine for LLMsβ266Updated last year
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β270Updated 2 months ago
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β215Updated last week
- Fast low-bit matmul kernels in Tritonβ381Updated 3 weeks ago
- scalable and robust tree-based speculative decoding algorithmβ360Updated 8 months ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inferenceβ283Updated this week
- KV cache compression for high-throughput LLM inferenceβ142Updated 8 months ago
- Load compute kernels from the Hubβ304Updated last week
- π· Build compute kernelsβ163Updated this week
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".β277Updated last year
- How to ensure correctness and ship LLM generated kernels in PyTorchβ66Updated last week
- Efficient LLM Inference over Long Sequencesβ390Updated 3 months ago
- ring-attention experimentsβ154Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsityβ83Updated last year
- Official implementation for Training LLMs with MXFP4β97Updated 5 months ago
- Triton-based implementation of Sparse Mixture of Experts.β246Updated 3 weeks ago
- Applied AI experiments and examples for PyTorchβ299Updated 2 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ130Updated 10 months ago
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)β227Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on diskβ180Updated this week
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β420Updated this week
- Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Modelβ247Updated 4 months ago
- β222Updated 3 weeks ago
- Write a fast kernel and run it on Discord. See how you compare against the best!β58Updated last week
- Simple & Scalable Pretraining for Neural Architecture Researchβ297Updated 2 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β344Updated 5 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ130Updated 10 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β147Updated last year
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β248Updated 8 months ago
- β240Updated this week