☆101Feb 11, 2026Updated 3 months ago
Alternatives and similar repositories for infllmv2_cuda_impl
Users that are interested in infllmv2_cuda_impl are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring☆277Jul 6, 2025Updated 10 months ago
- Ongoing research project for code&math LLMs☆31Jul 4, 2025Updated 10 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆240Jan 14, 2026Updated 4 months ago
- Distributed IO-aware Attention algorithm☆24Sep 24, 2025Updated 7 months ago
- Efficient triton implementation of Native Sparse Attention.☆277May 23, 2025Updated last year
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- Official Implementation of APB (ACL 2025 main Oral) and Spava (ACL 2026 main).☆37Apr 6, 2026Updated last month
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆21Jan 24, 2025Updated last year
- qwen-nsa☆87Oct 14, 2025Updated 7 months ago
- ☆37Aug 7, 2025Updated 9 months ago
- A Triton JIT runtime and ffi provider in C++☆35Updated this week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆168Oct 13, 2025Updated 7 months ago
- ☆66Apr 26, 2025Updated last year
- ☆13Oct 19, 2023Updated 2 years ago
- ☆52May 19, 2025Updated last year
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- QRHead: Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking☆38Jan 20, 2026Updated 4 months ago
- DLBlas: clean and efficient kernels☆40Updated this week
- ☆18Jun 3, 2024Updated last year
- Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical Knowledge Graphs☆16Feb 10, 2026Updated 3 months ago
- ☆11Aug 4, 2024Updated last year
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆344Feb 23, 2025Updated last year
- So, I trained a Llama a 130M architecture I coded from ground up to build a small instruct model from scratch. Trained on FineWeb dataset…☆17Mar 26, 2025Updated last year
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,212Apr 8, 2026Updated last month
- Heuristic filtering framework for RefineCode☆85Mar 13, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference☆299May 1, 2025Updated last year
- [ICLR24] The open-source repo of THU-KEG's KoLA benchmark.☆56Sep 28, 2023Updated 2 years ago
- ☆248Nov 19, 2025Updated 6 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆65Mar 25, 2025Updated last year
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 10 months ago
- Triton adapter for Ascend. Mirror of https://gitcode.com/ascend/triton-ascend☆125Updated this week
- ☆33Feb 3, 2025Updated last year
- ☆46Sep 15, 2025Updated 8 months ago
- ☆121May 19, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- from MHA, MQA, GQA to MLA by 苏剑林, with code☆48Feb 19, 2025Updated last year
- DICE: Detecting In-distribution Data Contamination with LLM's Internal State☆11Sep 21, 2024Updated last year
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention☆297Dec 1, 2025Updated 5 months ago
- Trainable fast and memory-efficient sparse attention☆676May 16, 2026Updated last week
- some minitools for linux os that are program with python☆13Jun 20, 2017Updated 8 years ago
- A lightweight Inference Engine built for block diffusion models☆46Apr 12, 2026Updated last month
- KACC: A Multi-task Benchmark for Knowledge Abstraction, Concretization and Completion☆12Oct 21, 2021Updated 4 years ago