PKULab1806 / Fairy-plus-minus-iLinks
Fairy±i (iFairy): Complex-valued Quantization Framework for Large Language Models
☆116Updated 2 months ago
Alternatives and similar repositories for Fairy-plus-minus-i
Users that are interested in Fairy-plus-minus-i are comparing it to the libraries listed below
Sorting:
- Triton Documentation in Chinese Simplified / Triton 中文文档☆99Updated last month
- ☆29Updated 7 months ago
- DeepSeek Native Sparse Attention pytorch implementation☆111Updated last month
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆97Updated last month
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆223Updated 2 weeks ago
- A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-perfo…☆51Updated last week
- A lightweight reinforcement learning framework that integrates seamlessly into your codebase, empowering developers to focus on algorithm…☆98Updated 5 months ago
- ☆128Updated 5 months ago
- Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing☆67Updated last year
- ☆117Updated 8 months ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆187Updated 4 months ago
- qwen-nsa☆87Updated 3 months ago
- [ICML 2025 Oral] Mixture of Lookup Experts☆68Updated last month
- ☆129Updated 7 months ago
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention☆274Updated 2 months ago
- ☆152Updated 6 months ago
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆248Updated last week
- NVIDIA cuTile learn☆154Updated last month
- ☆116Updated 4 months ago
- Implement custom operators in PyTorch with cuda/c++☆76Updated 3 years ago
- ☆48Updated 6 months ago
- A minimal PyTorch re-implementation of Qwen3 VL with a fancy CLI☆313Updated last month
- A collection of tricks and tools to speed up transformer models☆194Updated last month
- analyse problems of AI with Math and Code☆27Updated 6 months ago
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆87Updated 2 months ago
- The homepage of OneBit model quantization framework.☆200Updated 11 months ago
- ☆449Updated 5 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆148Updated last month
- Fast and memory-efficient exact kmeans☆136Updated 2 months ago
- Efficient Mixture of Experts for LLM Paper List☆160Updated 4 months ago