apple / ml-recurrent-drafterLinks

☆218

Alternatives and similar repositories for ml-recurrent-drafter

Users that are interested in ml-recurrent-drafter are comparing it to the libraries listed below

Sorting:

neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated last year
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆270Updated 2 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆215Updated last week
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆381Updated 3 weeks ago
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆360Updated 8 months ago
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆283Updated this week
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆142Updated 8 months ago
huggingface / kernels
Load compute kernels from the Hub
☆304Updated last week
huggingface / kernel-builder
👷 Build compute kernels
☆163Updated this week
IST-DASLab / qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆277Updated last year
meta-pytorch / BackendBench
How to ensure correctness and ship LLM generated kernels in PyTorch
☆66Updated last week
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 3 months ago
gpu-mode / ring-attention
ring-attention experiments
☆154Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆97Updated 5 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆246Updated 3 weeks ago
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆299Updated 2 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 10 months ago
snowflakedb / ArcticTraining
ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)
☆227Updated last week
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆180Updated this week
meta-pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆420Updated this week
antimatter15 / reverse-engineering-gemma-3n
Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model
☆247Updated 4 months ago
huggingface / picotron_tutorial
☆222Updated 3 weeks ago
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆58Updated last week
microsoft / ArchScale
Simple & Scalable Pretraining for Neural Architecture Research
☆297Updated 2 months ago
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆344Updated 5 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆147Updated last year
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆248Updated 8 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆240Updated this week