causalflow-ai/petit-kernel

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/causalflow-ai/petit-kernel)

causalflow-ai / petit-kernel

Optimized FP16/BF16 x FP4 GPU kernels for AMD GPUs

☆58

Alternatives and similar repositories for petit-kernel

Users that are interested in petit-kernel are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Jiacheng / honeycomb-osdi23-ae
View on GitHub
☆17May 22, 2023Updated 3 years ago
Anonymous1252022 / fp4-all-the-way
View on GitHub
☆50May 20, 2025Updated last year
huanchenz / index-microbench
View on GitHub
☆16Feb 19, 2017Updated 9 years ago
Repeerc / flash-attention-v2-RDNA3-minimal
View on GitHub
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…
☆52Aug 25, 2024Updated last year
maxi-k / costoptimal-model
View on GitHub
Model implementation and explorative UI for the paper "Towards Cost-Optimal Query Processing in the Cloud". Slides: https://bit.ly/37ZfeP…
☆17Sep 17, 2025Updated 9 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
lightseekorg / TorchSpec
View on GitHub
A PyTorch native library for training speculative decoding models
☆187Updated this week
Deep-Learning-Profiling-Tools / triton-samples
View on GitHub
☆14Mar 8, 2025Updated last year
wu-kan / wuk_cupti_wrapper
View on GitHub
a simple API to use CUPTI
☆10Aug 19, 2025Updated 10 months ago
awslabs / hybrid-model-factory
View on GitHub
Open-source toolkit for training, Priming, and serving next generation Hybrid architectures
☆72Jun 29, 2026Updated last week
conanhujinming / flash_hash_join
View on GitHub
A flash implementation of hash join in C++
☆33Sep 15, 2025Updated 9 months ago
Idein / onnigiri
View on GitHub
☆13Jun 10, 2026Updated 3 weeks ago
hpides / vectorized-hash-tables
View on GitHub
Code and results for our paper "Analyzing Vectorized Hash Tables Across CPU Architectures" @ VLDB '23.
☆31Feb 2, 2024Updated 2 years ago
AbhilashaRavichander / information-probing
View on GitHub
☆11May 18, 2025Updated last year
bdhirsh / pytorch_open_registration_example
View on GitHub
Example of using pytorch's open device registration API
☆31Oct 14, 2022Updated 3 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
BenjaminDHorne / The-NELA-Toolkit
View on GitHub
The News Landscape Toolkit (NELA)
☆16Oct 14, 2020Updated 5 years ago
perladoubinsky / SemAug
View on GitHub
[WAVC 2024] Official implementation of the paper: Semantic Generative Augmentations for Few-shot Counting
☆13May 1, 2024Updated 2 years ago
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated last year
rmascarenhas / foppl
View on GitHub
First-Order Probabilistic Programming Language
☆29Jun 3, 2019Updated 7 years ago
tanzelin430 / libsmctrl
View on GitHub
libsmctrl论文的复现，添加了python端接口，可以在python端灵活调用接口来分配计算资源
☆12May 21, 2024Updated 2 years ago
bauman / python-idzip
View on GitHub
Seekable, gzip compatible, compression format
☆16Nov 4, 2025Updated 8 months ago
thomaschlt / mla.c
View on GitHub
Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.
☆18Jan 15, 2025Updated last year
peichenxie / FPRev
View on GitHub
☆25May 9, 2025Updated last year
kcyu2014 / multi-model-forgetting
View on GitHub
ICML2019 Accepted Paper. Overcoming Multi-Model Forgetting
☆14Jun 5, 2019Updated 7 years ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
luongthecong123 / fp8-quant-matmul
View on GitHub
Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.
☆19Feb 9, 2026Updated 4 months ago
elikbelik / scholar_alters
View on GitHub
Parse unread emails from Google Scholar alerts and sort publication by relevance
☆13Nov 23, 2019Updated 6 years ago
ahmedheakl / CASS
View on GitHub
[ACL 2026 🔥] CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
☆34Apr 20, 2026Updated 2 months ago
ademeure / QuickRunCUDA
View on GitHub
☆20May 30, 2026Updated last month
lhb8125 / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆19Updated this week
oscar-project / oscar-website
View on GitHub
The website of the Oscar Project
☆11Mar 27, 2025Updated last year
CentML / lorafusion
View on GitHub
LoRAFusion: Efficient LoRA Fine-Tuning for LLMs
☆28Updated this week
UWHustle / Efficiently-Searching-In-Memory-Sorted-Arrays
View on GitHub
Efficiently Searching In-Memory Sorted Arrays:Revenge of the Interpolation Search?
☆33May 31, 2021Updated 5 years ago
kavishgambhir / xy-cut-tree
View on GitHub
Segmenting a given document using recursive xy-cut algorithm.
☆12Oct 9, 2018Updated 7 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
Heidelberg-NLP / xsrl_mbert_aligner
View on GitHub
X-SRL Dataset. Including the code for the SRL annotation projection tool and an out-of-the-box word alignment tool based on Multilingual …
☆15Apr 22, 2021Updated 5 years ago
huggingface / hf-rocm-kernels
View on GitHub
☆24May 26, 2026Updated last month
Bruce-Lee-LY / cutlass_gemm
View on GitHub
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆20Aug 3, 2025Updated 11 months ago
rosinality / melgan-pytorch
View on GitHub
MelGAN and Tacotron 2 in PyTorch
☆11Oct 22, 2019Updated 6 years ago
alibaba / SRDiffusion
View on GitHub
Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation
☆20Jun 11, 2025Updated last year
dholroyd / lowly
View on GitHub
Low-latency live streaming PoC
☆11Jul 30, 2019Updated 6 years ago
DeepLink-org / DLSlime
View on GitHub
Composable and Embeddable Communication Runtime for Distributed AI Services
☆102Jun 5, 2026Updated last month