☆101Feb 11, 2026Updated 2 months ago
Alternatives and similar repositories for infllmv2_cuda_impl
Users that are interested in infllmv2_cuda_impl are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆48Dec 13, 2025Updated 4 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring☆277Jul 6, 2025Updated 9 months ago
- Ongoing research project for code&math LLMs☆31Jul 4, 2025Updated 9 months ago
- Distributed IO-aware Attention algorithm☆24Sep 24, 2025Updated 7 months ago
- Efficient triton implementation of Native Sparse Attention.☆275May 23, 2025Updated 11 months ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Official Implementation of APB (ACL 2025 main Oral) and Spava (ACL 2026 main).☆37Apr 6, 2026Updated 3 weeks ago
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆21Jan 24, 2025Updated last year
- Sequence-level 1F1B schedule for LLMs.☆19Jun 4, 2024Updated last year
- qwen-nsa☆87Oct 14, 2025Updated 6 months ago
- ☆38Aug 7, 2025Updated 8 months ago
- A Triton JIT runtime and ffi provider in C++☆33Updated this week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆168Oct 13, 2025Updated 6 months ago
- ☆52May 19, 2025Updated 11 months ago
- DLBlas: clean and efficient kernels☆39Apr 24, 2026Updated last week
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- ☆18Jun 3, 2024Updated last year
- Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical Knowledge Graphs☆15Feb 10, 2026Updated 2 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 10 months ago
- ☆11Aug 4, 2024Updated last year
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆344Feb 23, 2025Updated last year
- So, I trained a Llama a 130M architecture I coded from ground up to build a small instruct model from scratch. Trained on FineWeb dataset…☆17Mar 26, 2025Updated last year
- Triton adapter for Ascend. Mirror of https://gitcode.com/ascend/triton-ascend☆120Updated this week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,210Apr 8, 2026Updated 3 weeks ago
- [ICCV 2025] Preacher: Paper-to-Video Agentic System☆48Sep 1, 2025Updated 8 months ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference☆296May 1, 2025Updated last year
- [ICLR24] The open-source repo of THU-KEG's KoLA benchmark.☆56Sep 28, 2023Updated 2 years ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆64Mar 25, 2025Updated last year
- ☆244Nov 19, 2025Updated 5 months ago
- ☆22Jun 5, 2025Updated 10 months ago
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 9 months ago
- ☆119May 19, 2025Updated 11 months ago
- ☆33Feb 3, 2025Updated last year
- from MHA, MQA, GQA to MLA by 苏剑林, with code☆47Feb 19, 2025Updated last year
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- ☆44Sep 15, 2025Updated 7 months ago
- Trainable fast and memory-efficient sparse attention☆632Updated this week
- DICE: Detecting In-distribution Data Contamination with LLM's Internal State☆11Sep 21, 2024Updated last year
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention☆294Dec 1, 2025Updated 5 months ago
- some minitools for linux os that are program with python☆13Jun 20, 2017Updated 8 years ago
- A lightweight Inference Engine built for block diffusion models☆44Apr 12, 2026Updated 3 weeks ago
- KACC: A Multi-task Benchmark for Knowledge Abstraction, Concretization and Completion☆12Oct 21, 2021Updated 4 years ago