YconquestY / Needle
Imperative deep learning framework with customized GPU and CPU backend
☆29Updated last year
Alternatives and similar repositories for Needle
Users that are interested in Needle are comparing it to the libraries listed below
Sorting:
- High performance Transformer implementation in C++.☆122Updated 4 months ago
- ☆84Updated last month
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆76Updated last week
- DGEMM on KNL, achieve 75% MKL☆17Updated 2 years ago
- Code base and slides for ECE408:Applied Parallel Programming On GPU.☆123Updated 3 years ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated last month
- ☆68Updated 3 weeks ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆100Updated last year
- Implement Flash Attention using Cute.☆82Updated 5 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆174Updated 2 weeks ago
- A lightweight design for computation-communication overlap.☆113Updated last week
- A PyTorch-like deep learning framework. Just for fun.☆154Updated last year
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆88Updated 4 months ago
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆66Updated 2 years ago
- Examples of CUDA implementations by Cutlass CuTe☆177Updated 3 months ago
- ☆168Updated last year
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆66Updated 4 years ago
- ☆32Updated last year
- A curated list of awesome projects and papers for distributed training or inference☆233Updated 7 months ago
- ☆58Updated 3 weeks ago
- Code release for book "Efficient Training in PyTorch"☆65Updated last month
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆36Updated 2 months ago
- Learning material for CMU10-714: Deep Learning System☆248Updated last year
- ☆119Updated 5 months ago
- Cataloging released Triton kernels.☆221Updated 4 months ago
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆41Updated 9 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆61Updated 8 months ago
- ☆109Updated last week
- DeeperGEMM: crazy optimized version☆69Updated last week
- Optimize softmax in triton in many cases☆20Updated 8 months ago