RightNow-AI/qwen3.5-triton

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/RightNow-AI/qwen3.5-triton)

RightNow-AI / qwen3.5-triton

Pure Triton kernels for Qwen3.5-27B inference on NVIDIA B200

☆121

Alternatives and similar repositories for qwen3.5-triton

Users that are interested in qwen3.5-triton are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

RightNow-AI / AutoMegaKernel
View on GitHub
An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch…
☆120Jun 29, 2026Updated 3 weeks ago
RightNow-AI / rightnow-cli
View on GitHub
Claude Code for CUDA. Free AI assistant that actually understands GPU architecture
☆112Oct 10, 2025Updated 9 months ago
danielvegamyhre / gemm
View on GitHub
☆19Mar 29, 2026Updated 3 months ago
RightNow-AI / autokernel
View on GitHub
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
☆1,479Mar 19, 2026Updated 4 months ago
guoqingbao / attention.rs
View on GitHub
LLM Kernel Library for Rust
☆30Updated this week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
stanford-oval / sliders
View on GitHub
Repository for paper: Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
☆27Apr 27, 2026Updated 2 months ago
RightNow-AI / tiny-tpu
View on GitHub
Minimal TPU implementation with 8x8 systolic array and PyTorch integration
☆66Jan 26, 2026Updated 5 months ago
yao-jz / intra-kernel-profiler
View on GitHub
Region-level profiling for CUDA kernels with trace, NVBit, CUPTI, NSys, and an interactive Explorer.
☆122Apr 17, 2026Updated 3 months ago
NVlabs / KernelBlaster
View on GitHub
A framework for in context learning for code optimization
☆59Mar 14, 2026Updated 4 months ago
Infatoshi / MegaQwen
View on GitHub
Qwen3-0.6B megakernel: 527 tok/s decode on RTX 3090 (3.8x faster than PyTorch)
☆121Feb 10, 2026Updated 5 months ago
meta-pytorch / tritonparse
View on GitHub
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆211Jul 15, 2026Updated last week
andrewargatkiny / dense-attention
View on GitHub
This is the repo for DenseAttention and DANet - fast and conceptually simple modification of standard attention and Transformer
☆20Apr 6, 2026Updated 3 months ago
RightNow-AI / RightNow-Tile
View on GitHub
Open-source transpiler for CUDA Tile (13.1) migration
☆19Dec 9, 2025Updated 7 months ago
inclusionAI / cuLA
View on GitHub
CUDA kernels for linear attention variants, written in CuTe DSL and CUTLASS C++.
☆535Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
poad42 / cuda-fp8-ampere
View on GitHub
IMMA-based **FP8-as-storage** GEMM experiments for Ampere (sm_86 / RTX 3090 Ti).
☆24Jan 30, 2026Updated 5 months ago
frankxwang / dpo-prefix-sharing
View on GitHub
DPO, but faster 🚀
☆52Dec 6, 2024Updated last year
RightNow-AI / StreamIndex
View on GitHub
Memory-bounded compressed sparse attention via streaming top-k. Triton kernels for the DeepSeek-V4 lightning indexer. 32x regime extensio…
☆22May 5, 2026Updated 2 months ago
lightseekorg / tokenspeed
View on GitHub
TokenSpeed is a speed-of-light LLM inference engine.
☆1,658Updated this week
QwenLM / FlashQLA
View on GitHub
high-performance linear attention kernel library built on TileLang
☆607Updated this week
serdes21 / flashtile
View on GitHub
FlashTile is a CUDA Tile IR compiler that is compatible with NVIDIA's tileiras, targeting SM70 through SM121 NVIDIA GPUs.
☆61Feb 6, 2026Updated 5 months ago
Dogacel / auto-gpu-kernel
View on GitHub
Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average…
☆148Jun 10, 2026Updated last month
m87-labs / kestrel
View on GitHub
☆80Updated this week
shlokgpu / 100-days-cuda
View on GitHub
This repository documents my 100-day journey of learning and writing CUDA kernels.
☆35Mar 29, 2026Updated 3 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
ai-dynamo / flextensor
View on GitHub
FlexTensor is a tensor offloading and management library for PyTorch that enables running large models on limited GPU memory by intellige…
☆109Jun 3, 2026Updated last month
Deveraux-Parker / GPT-OSS-MONKEY-WRENCHES
View on GitHub
☆27Aug 17, 2025Updated 11 months ago
Mog9 / gpt2-inference
View on GitHub
A GPT-2 inference engine written from scratch in CUDA and C++. Implements custom CUDA kernels for tiled matrix multiplication, LayerNorm,…
☆43May 17, 2026Updated 2 months ago
AMDResearch / intelliperf
View on GitHub
Automated bottleneck detection and solution orchestration
☆23Feb 24, 2026Updated 5 months ago
daemyung / practice-triton
View on GitHub
삼각형의 실전! Triton
☆16Feb 15, 2024Updated 2 years ago
tile-ai / tilelang-puzzles
View on GitHub
Learning TileLang with 10 puzzles!
☆349May 28, 2026Updated last month
catswe / flash-attention-residuals
View on GitHub
Triton kernels and PyTorch ops for Block Attention Residuals (AttnRes)
☆86May 29, 2026Updated last month
Snektron / gpumode-amd-fp8-mm
View on GitHub
My submission for the GPUMODE/AMD fp8 mm challenge
☆29Jun 4, 2025Updated last year
snowflakedb / ArcticInference
View on GitHub
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆462Jul 14, 2026Updated last week
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
huggingface / hf-rocm-kernels
View on GitHub
☆24May 26, 2026Updated 2 months ago
catid / lllm
View on GitHub
Latent Large Language Models
☆19Aug 24, 2024Updated last year
luongthecong123 / fp8-quant-matmul
View on GitHub
Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.
☆19Feb 9, 2026Updated 5 months ago
yiakwy-xpu-ml-framework-team / flash-float-jit-kernels
View on GitHub
☆24Updated this week
huggingface / kernels
View on GitHub
Build compute kernels and load them from the Hub.
☆715Updated this week
MoonshotAI / FlashKDA
View on GitHub
FlashKDA: high-performance Kimi Delta Attention kernels
☆468May 26, 2026Updated 2 months ago
phymhan / S2D2
View on GitHub
☆16Jun 17, 2026Updated last month