hyhieu / easy_pybindLinks

☆32

Alternatives and similar repositories for easy_pybind

Users that are interested in easy_pybind are comparing it to the libraries listed below

Sorting:

vdesai2014 / inference-optimization-blog-post
☆89Updated last year
unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆145Updated last year
cloneofsimo / min-fsdp
☆91Updated last year
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 11 months ago
kklemon / FlashPerceiver
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
☆30Updated 11 months ago
HanGuo97 / log-linear-attention
☆252Updated 4 months ago
NX-AI / flashrnn
FlashRNN - Fast RNN Kernels with I/O Awareness
☆103Updated last week
kuterd / opal_ptx
Experimental GPU language with meta-programming
☆23Updated last year
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
gpu-mode / profiling-cuda-in-torch
☆174Updated last year
HazyResearch / train-tk
train with kittens!
☆63Updated last year
amorehead / jvp_flash_attention
Flash Attention Triton kernel with support for second-order derivatives
☆106Updated last week
idiap / sigma-gpt
σ-GPT: A New Approach to Autoregressive Models
☆68Updated last year
dvruette / barrel-rec-pytorch
☆53Updated last year
fal-ai / diffusion-speedrun
Focused on fast experimentation and simplicity
☆75Updated 10 months ago
zaydzuhri / softpick-attention
Landing repository for the paper "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax"
☆85Updated last month
lucidrains / spline-based-transformer
Implementation of the proposed Spline-Based Transformer from Disney Research
☆104Updated 11 months ago
lucidrains / nGPT-pytorch
Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI
☆291Updated 4 months ago
deepreinforce-ai / CUDA-L1
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
☆195Updated last week
NVIDIA / ngpt
Normalized Transformer (nGPT)
☆192Updated 11 months ago
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
cloneofsimo / min-max-gpt
Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training
☆132Updated last year
alexzhang13 / Triton-Puzzles-Solutions
Personal solutions to the Triton Puzzles
☆20Updated last year
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆107Updated 9 months ago
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆91Updated 5 months ago
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated last year
riverstone496 / awesome-second-order-optimization
☆28Updated last month
cloneofsimo / scaling-guide
WIP
☆93Updated last year
IST-DASLab / llmq
Quantized LLM training in pure CUDA/C++.
☆209Updated this week
nshepperd / flash_attn_jax
JAX bindings for Flash Attention v2
☆97Updated last week