iclementine/optimize_softmax

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/iclementine/optimize_softmax)

iclementine / optimize_softmax

Optimize softmax in triton in many cases

☆24

Alternatives and similar repositories for optimize_softmax

Users that are interested in optimize_softmax are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

flagos-ai / FlagGems
View on GitHub
FlagGems is an operator library for large language models implemented in the Triton Language.
☆1,053Updated this week
jundaf2 / CUDA-INT8-GEMM
View on GitHub
CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API
☆37Sep 15, 2023Updated 2 years ago
weishengying / cute_gemm
View on GitHub
☆22Aug 14, 2024Updated last year
initzhang / DUCATI_SIGMOD
View on GitHub
Accepted paper of SIGMOD 2023, DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
☆16Dec 15, 2023Updated 2 years ago
latentCall145 / channels-last-groupnorm
View on GitHub
A CUDA kernel for NHWC GroupNorm for PyTorch
☆23Nov 15, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
ppppppppig / lite_lang
View on GitHub
一个轻量化的大模型推理框架
☆23May 26, 2025Updated last year
zeroine / cutlass-cute-sample
View on GitHub
☆49Apr 15, 2024Updated 2 years ago
wu-kan / wuk_cupti_wrapper
View on GitHub
a simple API to use CUPTI
☆10Aug 19, 2025Updated 11 months ago
cherichy / tilecute
View on GitHub
☆32Jul 2, 2025Updated last year
caiwanxianhust / FasterLLaMA
View on GitHub
使用 CUDA C++ 实现的 llama 模型推理框架
☆64Nov 8, 2024Updated last year
uwsampl / paper-agents
View on GitHub
☆13Dec 9, 2024Updated last year
RC4ML / Legion
View on GitHub
GPU-initiated Large-scale GNN System [ATC 23]
☆19Oct 30, 2024Updated last year
madsys-dev / deepseekv2-profile
View on GitHub
☆156Mar 4, 2025Updated last year
Tianxinhuang / PCLossNet
View on GitHub
The codes for ECCV'22: Learning to Train a Point Cloud Reconstruction Network without Matching
☆10Nov 16, 2022Updated 3 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
weishengying / tiny-flash-attention
View on GitHub
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆59Aug 12, 2024Updated last year
derhuerst / casket
View on GitHub
casket is an easy-to-use web file storage.
☆13Jul 31, 2021Updated 4 years ago
wangzyon / trt_learn
View on GitHub
TensorRT encapsulation, learn, rewrite, practice.
☆30Oct 19, 2022Updated 3 years ago
MITIBMxGraph / SALIENT_artifact
View on GitHub
Artifact evaluation of the paper "Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining"
☆23Mar 7, 2022Updated 4 years ago
ConsciousML / img-processing-cuda
View on GitHub
Implementation from scratch in CUDA C++ of image processing algorithms.
☆24Oct 26, 2020Updated 5 years ago
CalebDu / Awesome-Cute
View on GitHub
☆121May 16, 2025Updated last year
tongzhou80 / nanoPyC
View on GitHub
☆69Mar 19, 2023Updated 3 years ago
JunHao-Zhu / FusionQuery
View on GitHub
[VLDB 2024] Source code for FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data
☆11Mar 11, 2025Updated last year
shenh10 / DeepSeek_Simulator
View on GitHub
☆100Apr 2, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
leimao / CUTLASS-Examples
View on GitHub
CUTLASS and CuTe Examples
☆136Nov 30, 2025Updated 7 months ago
tile-ai / tvm
View on GitHub
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆20Jul 13, 2026Updated last week
yinuotxie / Efficient-LLM-Inferencing-on-GPUs
View on GitHub
Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆44Dec 11, 2023Updated 2 years ago
huweim / dataflow_architecture
View on GitHub
Research about dataflow architecture
☆15Nov 30, 2023Updated 2 years ago
chhzh123 / ptc-tutorial
View on GitHub
PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo
☆17Mar 13, 2023Updated 3 years ago
xxcclong / GNN-Computing
View on GitHub
Artifact for PPoPP20 "Understanding and Bridging the Gaps in Current GNN Performance Optimizations"
☆42Nov 16, 2021Updated 4 years ago
tpof314 / AACalculator
View on GitHub
AA计算器
☆16Jun 11, 2020Updated 6 years ago
nicksypark / rope-triton
View on GitHub
☆15Mar 30, 2024Updated 2 years ago
Fanerst / artensor
View on GitHub
Generating contraction orders and perform numerical contractions for arbitrary tensor networks
☆19May 15, 2026Updated 2 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
JackonYang / hands-on-tvm
View on GitHub
hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.
☆51Jun 15, 2023Updated 3 years ago
Dinghow / Dinghow_DigitalArt_Notes
View on GitHub
Dinghow's notes about Digital Art, School of Software Engineering, Tongji University
☆12Sep 23, 2019Updated 6 years ago
raymond1123 / hgemm
View on GitHub
☆30Nov 16, 2024Updated last year
ChenCVer / python_cpp_extension
View on GitHub
C++ and CUDA extensions for Python/Pytorch and GPU Accelerated Augmentation.
☆34Nov 30, 2022Updated 3 years ago
kriskrisliu / NoisyQuant
View on GitHub
An official implement of CVPR 2023 paper - NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers
☆27Mar 13, 2024Updated 2 years ago
sanagno / neurips_2022_statistics
View on GitHub
☆17Oct 22, 2022Updated 3 years ago
actypedef / ARCQuant
View on GitHub
[ACL 2026 Main] Code for the paper "ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs"
☆28Jun 1, 2026Updated last month