ambisinister / mla-experiments
Experiments on Multi-Head Latent Attention
β67Updated 6 months ago
Alternatives and similar repositories for mla-experiments:
Users that are interested in mla-experiments are comparing it to the libraries listed below
- Transformers components but in Tritonβ31Updated 3 months ago
- π₯ A minimal training framework for scaling FLA modelsβ59Updated this week
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inferenceβ40Updated 3 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β57Updated 3 weeks ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ157Updated 7 months ago
- ring-attention experimentsβ123Updated 4 months ago
- The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".β59Updated 3 months ago
- 16-fold memory access reduction with nearly no lossβ77Updated this week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ107Updated 2 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMsβ78Updated 2 months ago
- Linear Attention Sequence Parallelism (LASP)β77Updated 8 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsityβ64Updated 5 months ago
- β111Updated this week
- β77Updated last year
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β148Updated 3 weeks ago
- Triton-based implementation of Sparse Mixture of Experts.β196Updated 2 months ago
- Vocabulary Parallelismβ17Updated 3 months ago
- Estimate MFU for DeepSeekV3β16Updated last month
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitinβ¦β48Updated 7 months ago
- Here we will test various linear attention designs.β58Updated 9 months ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β35Updated 8 months ago
- Fast and memory-efficient exact attentionβ58Updated this week
- β99Updated 11 months ago
- β22Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundryβ40Updated last year
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β42Updated 4 months ago
- β125Updated last year
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).β53Updated 8 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Promptsβ38Updated 11 months ago