YconquestY / Needle
Imperative deep learning framework with customized GPU and CPU backend
☆30Updated last year
Alternatives and similar repositories for Needle:
Users that are interested in Needle are comparing it to the libraries listed below
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆75Updated 2 weeks ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance☆43Updated last week
- Materials for learning SGLang☆176Updated this week
- ☆26Updated 8 months ago
- High performance Transformer implementation in C++.☆98Updated this week
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆132Updated 6 months ago
- Learning material for CMU10-714: Deep Learning System☆229Updated 8 months ago
- Puzzles for learning Triton, play it with minimal environment configuration!☆205Updated last month
- A PyTorch-like deep learning framework. Just for fun.☆141Updated last year
- Dynamic Memory Management for Serving LLMs without PagedAttention☆273Updated last month
- CUDA Matrix Multiplication Optimization☆153Updated 6 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆27Updated 2 months ago
- DGEMM on KNL, achieve 75% MKL☆16Updated 2 years ago
- ☆55Updated last month
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆205Updated last month
- flash attention tutorial written in python, triton, cuda, cutlass☆250Updated 2 weeks ago
- ☆151Updated last year
- ☆199Updated 2 months ago
- Summary of some awesome work for optimizing LLM inference☆50Updated 3 weeks ago
- All Homeworks for TinyML and Efficient Deep Learning Computing 6.5940 • Fall • 2023 • https://efficientml.ai☆148Updated last year
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆174Updated 2 months ago
- 📖A curated list of Awesome Diffusion Inference Papers with codes, such as Sampling, Caching, Multi-GPUs, etc. 🎉🎉☆169Updated this week
- learning how CUDA works☆189Updated 5 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆34Updated 4 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆196Updated last year
- A low-latency & high-throughput serving engine for LLMs☆296Updated 4 months ago
- ☆79Updated 4 months ago
- A large-scale simulation framework for LLM inference☆312Updated 2 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆126Updated last year
- Implement Flash Attention using Cute.☆65Updated last month