ifromeast / AI_analysis
analyse problems of AI with Math and Code
☆13Updated this week
Alternatives and similar repositories for AI_analysis
Users that are interested in AI_analysis are comparing it to the libraries listed below
Sorting:
- ☆19Updated 4 months ago
- More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression☆11Updated 4 months ago
- Implement some method of LLM KV Cache Sparsity☆32Updated 11 months ago
- LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification☆53Updated 2 months ago
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆34Updated 3 weeks ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆47Updated 5 months ago
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆74Updated 3 months ago
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆19Updated 7 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆37Updated 11 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆37Updated last month
- ATC23 AE☆45Updated 2 years ago
- Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference☆19Updated 2 months ago
- ☆82Updated last week
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆58Updated last month
- LLM Inference with Microscaling Format☆22Updated 6 months ago
- Explore Inter-layer Expert Affinity in MoE Model Inference☆9Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆81Updated last month
- [ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.☆24Updated 3 weeks ago
- ☆49Updated 5 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆99Updated this week
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated last month
- ☆54Updated last year
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆44Updated last month
- Quantized Attention on GPU☆45Updated 5 months ago
- Official PyTorch implementation of "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact"☆44Updated 11 months ago
- [ICLR 2025] DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference☆20Updated last month
- ☆58Updated 3 weeks ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…☆55Updated 10 months ago
- ☆125Updated 2 weeks ago
- 16-fold memory access reduction with nearly no loss☆94Updated last month