☆94Feb 11, 2026Updated 2 weeks ago
Alternatives and similar repositories for infllmv2_cuda_impl
Users that are interested in infllmv2_cuda_impl are comparing it to the libraries listed below
Sorting:
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring☆269Jul 6, 2025Updated 7 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- ☆48Dec 13, 2025Updated 2 months ago
- ☆65Apr 26, 2025Updated 10 months ago
- A Triton JIT runtime and ffi provider in C++☆31Updated this week
- DLBlas: clean and efficient kernels☆33Updated this week
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆20Jan 24, 2025Updated last year
- [AAAI 2026] SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries☆37Jan 14, 2026Updated last month
- ☆16Jul 29, 2025Updated 7 months ago
- 🌟Official code of our AAAI26 paper 🔍WebFilter☆37Nov 9, 2025Updated 3 months ago
- ☆38Aug 7, 2025Updated 6 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆161Oct 13, 2025Updated 4 months ago
- An efficient hierarchical Graph-based RAG☆32Nov 27, 2025Updated 3 months ago
- ☆20Oct 18, 2021Updated 4 years ago
- ☆52May 19, 2025Updated 9 months ago
- Distributed IO-aware Attention algorithm☆24Sep 24, 2025Updated 5 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆60Mar 25, 2025Updated 11 months ago
- ☆118May 19, 2025Updated 9 months ago
- Triton adapter for Ascend. Mirror of https://gitcode.com/ascend/triton-ascend☆110Updated this week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆341Feb 23, 2025Updated last year
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆101Dec 15, 2025Updated 2 months ago
- ☆43Mar 15, 2025Updated 11 months ago
- ☆34Feb 3, 2025Updated last year
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆472May 17, 2025Updated 9 months ago
- Official Implementation of APB (ACL 2025 main Oral) and Spava.☆34Jan 30, 2026Updated last month
- It is an LLM-based AI agent, which can write correct and efficient gpu kernels automatically.☆68Updated this week
- ☆95Apr 2, 2025Updated 11 months ago
- ☆87Updated this week
- qwen-nsa☆87Oct 14, 2025Updated 4 months ago
- Building the Virtuous Cycle for AI-driven LLM Systems☆186Feb 19, 2026Updated last week
- Code for the preprint "Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?"☆48Jul 29, 2025Updated 7 months ago
- In our implementation of Qwen-Image-Edit, we employ block causal attention to improve inference speed.☆37Feb 16, 2026Updated 2 weeks ago
- The RedStone repository includes code for preparing extensive datasets used in training large language models.☆156Jan 22, 2026Updated last month
- Official implementation of paper "Reason4Rec: Large Language Models for Recommendation with Deliberative User Preference Alignment"☆41Apr 10, 2025Updated 10 months ago
- Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI.☆257Oct 4, 2025Updated 4 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆816Mar 6, 2025Updated 11 months ago
- Linear Attention Sequence Parallelism (LASP)☆88Jun 4, 2024Updated last year
- ☆97Mar 26, 2025Updated 11 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆184Feb 19, 2026Updated last week