PKULab1806 / Fairy-plus-minus-iLinks
Fairy±i (iFairy): Complex-valued Quantization Framework for Large Language Models
☆111Updated last month
Alternatives and similar repositories for Fairy-plus-minus-i
Users that are interested in Fairy-plus-minus-i are comparing it to the libraries listed below
Sorting:
- ☆29Updated 7 months ago
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆96Updated 3 weeks ago
- DeepSeek Native Sparse Attention pytorch implementation☆111Updated 3 weeks ago
- Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing☆65Updated last year
- Triton Documentation in Chinese Simplified / Triton 中文文档☆96Updated 3 weeks ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆67Updated last year
- Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)☆70Updated 8 months ago
- ☆150Updated 6 months ago
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆83Updated last month
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention☆270Updated last month
- ☆109Updated this week
- ☆126Updated 4 months ago
- NVIDIA cuTile learn☆147Updated last month
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆214Updated 3 months ago
- analyse problems of AI with Math and Code☆27Updated 5 months ago
- Omni_Infer is a suite of inference accelerators designed for the Ascend NPU platform, offering native support and an expanding feature se…☆96Updated this week
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆304Updated 7 months ago
- ☆66Updated last year
- ☆443Updated 5 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆144Updated 2 weeks ago
- ☆116Updated 7 months ago
- ☆45Updated last year
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆242Updated last month
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…☆264Updated last month
- A lightweight reinforcement learning framework that integrates seamlessly into your codebase, empowering developers to focus on algorithm…☆96Updated 4 months ago
- Estimate MFU for DeepSeekV3☆26Updated last year
- A simple calculation for LLM MFU.☆58Updated 4 months ago
- Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.☆81Updated this week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆160Updated 2 months ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆273Updated 2 months ago