YconquestY / NeedleLinks

Imperative deep learning framework with customized GPU and CPU backend

☆30

Alternatives and similar repositories for Needle

Users that are interested in Needle are comparing it to the libraries listed below

Sorting:

InternLM / turbomind
☆87Updated 3 months ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆96Updated last week
SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆367Updated 6 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆238Updated 5 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆125Updated 5 months ago
Sunt-ing / stick
A PyTorch-like deep learning framework. Just for fun.
☆157Updated last year
mit-han-lab / parallel-computing-tutorial
☆170Updated last year
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆38Updated 2 weeks ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆87Updated 6 months ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆130Updated last year
ailzhang / EfficientPyTorch
Code release for book "Efficient Training in PyTorch"
☆69Updated 2 months ago
PKUFlyingPig / CMU10-714
Learning material for CMU10-714: Deep Learning System
☆256Updated last year
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆397Updated 3 weeks ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆143Updated last week
pzhao-eng / FlashMLA
☆48Updated last month
interestingLSY / CUDA-From-Correctness-To-Performance-Code
Codes & examples for "CUDA - From Correctness to Performance"
☆100Updated 8 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆80Updated last month
gpu-mode / ring-attention
ring-attention experiments
☆144Updated 8 months ago
nvixnu / pmpp__programming_massively_parallel_processors
Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…
☆69Updated 4 years ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆92Updated 10 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆377Updated last month
xlite-dev / ffpa-attn
⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.
☆186Updated last month
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆38Updated 3 months ago
iclementine / optimize_softmax
Optimize softmax in triton in many cases
☆21Updated 9 months ago
ColfaxResearch / cutlass-kernels
☆212Updated 11 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆364Updated 9 months ago
MDK8888 / vllmini
A minimal implementation of vllm.
☆44Updated 11 months ago
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆255Updated 3 months ago
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆224Updated 2 weeks ago
chenhongyu2048 / LLM-inference-optimization-paper
Summary of some awesome work for optimizing LLM inference
☆77Updated 3 weeks ago