PKULab1806 / Fairy-plus-minus-iLinks
Fairy±i (iFairy): Complex-valued Quantization Framework for Large Language Models
☆103Updated 2 weeks ago
Alternatives and similar repositories for Fairy-plus-minus-i
Users that are interested in Fairy-plus-minus-i are comparing it to the libraries listed below
Sorting:
- ☆29Updated 3 months ago
- siiRL: Shanghai Innovation Institute RL Framework for Advanced LLMs and Multi-Agent Systems☆214Updated this week
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆71Updated 3 months ago
- A minimal, easy-to-read PyTorch reimplementation of the Qwen3 and Qwen2.5 VL with a fancy CLI☆146Updated 3 weeks ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆82Updated 5 months ago
- DeepSeek Native Sparse Attention pytorch implementation☆96Updated last month
- ☆143Updated 2 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆194Updated last week
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆218Updated last month
- ☆97Updated 4 months ago
- Implementation for FP8/INT8 Rollout for RL training without performence drop.☆242Updated this week
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆264Updated 3 months ago
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention☆181Updated last month
- ☆425Updated last month
- A simple calculation for LLM MFU.☆45Updated 2 weeks ago
- Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)☆57Updated 5 months ago
- ☆72Updated 10 months ago
- ☆107Updated last month
- A lightweight reinforcement learning framework that integrates seamlessly into your codebase, empowering developers to focus on algorithm…☆68Updated last month
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆81Updated 3 months ago
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"☆466Updated 2 weeks ago
- ☆55Updated last year
- Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing☆59Updated 8 months ago
- Official implementation of "DPad: Efficient Diffusion Language Models with Suffix Dropout"☆40Updated 3 weeks ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆154Updated this week
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆238Updated 2 months ago
- Triton multi-level runner, include IR/PTX/cubin.☆54Updated this week
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆59Updated 10 months ago
- A sparse attention kernel supporting mix sparse patterns☆303Updated 7 months ago
- ☆50Updated 4 months ago