YconquestY / Needle
Imperative deep learning framework with customized GPU and CPU backend
☆30Updated last year
Alternatives and similar repositories for Needle:
Users that are interested in Needle are comparing it to the libraries listed below
- ☆82Updated last month
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆85Updated 3 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆73Updated 3 weeks ago
- DeeperGEMM: crazy optimized version☆67Updated 3 weeks ago
- Implement Flash Attention using Cute.☆76Updated 4 months ago
- High performance Transformer implementation in C++.☆118Updated 3 months ago
- Cataloging released Triton kernels.☆217Updated 3 months ago
- A Easy-to-understand TensorOp Matmul Tutorial☆342Updated 7 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆162Updated 9 months ago
- ☆166Updated last year
- ☆92Updated 7 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- ☆55Updated 2 weeks ago
- Puzzles for learning Triton, play it with minimal environment configuration!☆296Updated 4 months ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆181Updated 2 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated 3 weeks ago
- Examples of CUDA implementations by Cutlass CuTe☆159Updated 2 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆206Updated last year
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆44Updated 9 months ago
- 📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.☆169Updated 2 weeks ago
- flash attention tutorial written in python, triton, cuda, cutlass☆334Updated 3 months ago
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆39Updated 8 months ago
- ☆60Updated this week
- ☆235Updated 2 months ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆67Updated 4 years ago
- Code base and slides for ECE408:Applied Parallel Programming On GPU.☆122Updated 3 years ago
- PyTorch bindings for CUTLASS grouped GEMM.☆81Updated 5 months ago
- DGEMM on KNL, achieve 75% MKL☆17Updated 2 years ago
- A PyTorch-like deep learning framework. Just for fun.☆153Updated last year
- Codes & examples for "CUDA - From Correctness to Performance"☆96Updated 6 months ago