SJTU-IPADS / SmallThinkerLinks
☆48Updated 6 months ago
Alternatives and similar repositories for SmallThinker
Users that are interested in SmallThinker are comparing it to the libraries listed below
Sorting:
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆223Updated 2 weeks ago
- ☆29Updated 7 months ago
- ☆64Updated 8 months ago
- Cookbook of SGLang - Recipe☆63Updated this week
- ☆117Updated 8 months ago
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆97Updated last month
- dInfer: An Efficient Inference Framework for Diffusion Language Models☆403Updated 3 weeks ago
- ☆83Updated 9 months ago
- Fairy±i (iFairy): Complex-valued Quantization Framework for Large Language Models☆116Updated 2 months ago
- ☆74Updated 8 months ago
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆92Updated this week
- ☆166Updated last month
- [Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…☆13Updated 2 weeks ago
- Omni_Infer is a suite of inference accelerators designed for the Ascend NPU platform, offering native support and an expanding feature se…☆102Updated this week
- Based on Nano-vLLM, a simple replication of vLLM with self-contained paged attention and flash attention implementation☆243Updated last week
- ☆52Updated 8 months ago
- ☆47Updated 9 months ago
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆115Updated 2 months ago
- A lightweight reinforcement learning framework that integrates seamlessly into your codebase, empowering developers to focus on algorithm…☆98Updated 5 months ago
- QeRL enables RL for 32B LLMs on a single H100 GPU.☆477Updated 2 months ago
- ☆449Updated 5 months ago
- ☆128Updated 5 months ago
- ☆96Updated 10 months ago
- Fast and memory-efficient exact attention☆110Updated last week
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆85Updated this week
- Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend☆105Updated this week
- mHC kernels implemented in CUDA☆243Updated 2 weeks ago
- Block Diffusion for Ultra-Fast Speculative Decoding☆432Updated last week
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆491Updated 2 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆154Updated 2 weeks ago