HanGuo97/hilt

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/HanGuo97/hilt)

HanGuo97 / hilt

☆40

Alternatives and similar repositories for hilt

Users that are interested in hilt are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆19Nov 19, 2024Updated last year
KuangjuX / cu-x
View on GitHub
🎉My Collections of CUDA Kernels~
☆11Jun 25, 2024Updated last year
TiledTensor / TiledLower
View on GitHub
TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.
☆13Nov 23, 2024Updated last year
ColfaxResearch / layout-categories
View on GitHub
This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".
☆134Sep 24, 2025Updated 8 months ago
mitkotak / fast_flops
View on GitHub
FLOPS counter for all your GPU benchmarking needs
☆13Aug 8, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
daniel-geon-park / triton_bwd
View on GitHub
Automatic differentiation for Triton Kernels
☆29Aug 12, 2025Updated 9 months ago
ROCm / FlyDSL
View on GitHub
FlyDSL is the Python front‑end of the project: Flexible LaYout DSL.
☆192Updated this week
Dao-AILab / gemm-cublas
View on GitHub
☆22May 5, 2025Updated last year
matrix97317 / OneNeuralNetwork
View on GitHub
This is a cross-chip platform collection of operators and a unified neural network library.
☆17Nov 3, 2023Updated 2 years ago
LeiWang1999 / TVM.CMakeExtend
View on GitHub
Tutorials of Extending and importing TVM with CMAKE Include dependency.
☆16Oct 11, 2024Updated last year
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆155May 10, 2025Updated last year
xxyux / SpInfer
View on GitHub
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆65Mar 25, 2025Updated last year
ademeure / cuda-side-boost
View on GitHub
☆57Feb 24, 2026Updated 3 months ago
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆109Jun 28, 2025Updated 10 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
HydraQYH / expert_specialization_moe
View on GitHub
Expert Specialization MoE Solution based on CUTLASS
☆27Apr 14, 2026Updated last month
flashinfer-ai / debug-print
View on GitHub
Debug print operator for cudagraph debugging
☆15Aug 2, 2024Updated last year
vudaoanhtuan / vietnamese-tone-prediction
View on GitHub
restore tone for missing tone sentences
☆13Jul 29, 2019Updated 6 years ago
OpenBitSys / BitDecoding
View on GitHub
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆89May 14, 2026Updated last week
nicolaswilde / amx-gemm-handwritten
View on GitHub
Handwritten GEMM using Intel AMX (Advanced Matrix Extension)
☆17Jan 11, 2025Updated last year
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆86May 5, 2025Updated last year
pzhao-eng / FlashMLA
View on GitHub
☆65Feb 15, 2026Updated 3 months ago
belindal / state-tracking
View on GitHub
Code and data for paper "(How) do Language Models Track State?"
☆22Mar 31, 2025Updated last year
InternLM / turbomind
View on GitHub
☆98Mar 26, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
IST-DASLab / qutlass
View on GitHub
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆181Nov 11, 2025Updated 6 months ago
simveit / effective_transpose
View on GitHub
Effective transpose on Hopper GPU
☆28Sep 6, 2025Updated 8 months ago
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆516Jan 20, 2026Updated 4 months ago
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated last year
tensara / problems
View on GitHub
Tensara's GPU programming problems
☆20Apr 23, 2026Updated last month
hazan-lab / flash-stu
View on GitHub
PyTorch implementation of the Flash Spectral Transform Unit.
☆22Sep 19, 2024Updated last year
lemyx / tilelang-dsa
View on GitHub
DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang
☆44Nov 19, 2025Updated 6 months ago
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆107Dec 17, 2024Updated last year
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆39Feb 3, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
yester31 / Cutlass_EX
View on GitHub
study of cutlass
☆22Nov 10, 2024Updated last year
HuyNguyen-hust / flash-attn-101
View on GitHub
☆22Sep 3, 2024Updated last year
erfanzar / jax-flash-attn2
View on GitHub
A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/…
☆34Mar 4, 2025Updated last year
dsl-learn / LeetGPU
View on GitHub
LeetGPU Solutions
☆118Oct 9, 2025Updated 7 months ago
gerkone / painn-jax
View on GitHub
PaiNN in jax
☆11Jan 14, 2025Updated last year
antgroup / DeepXTrace
View on GitHub
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
☆97Jan 16, 2026Updated 4 months ago
IBM / triton-dejavu
View on GitHub
Framework to reduce autotune overhead to zero for well known deployments.
☆100Sep 19, 2025Updated 8 months ago