YconquestY / Needle
Imperative deep learning framework with customized GPU and CPU backend
☆30Updated last year
Alternatives and similar repositories for Needle:
Users that are interested in Needle are comparing it to the libraries listed below
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆29Updated 3 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆139Updated 7 months ago
- flash attention tutorial written in python, triton, cuda, cutlass☆262Updated last month
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆78Updated last month
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆53Updated 2 weeks ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated 11 months ago
- ☆156Updated last year
- High performance Transformer implementation in C++.☆103Updated last month
- Summary of some awesome work for optimizing LLM inference☆57Updated 2 weeks ago
- Puzzles for learning Triton, play it with minimal environment configuration!☆236Updated 2 months ago
- A PyTorch-like deep learning framework. Just for fun.☆142Updated last year
- ☆104Updated 7 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆35Updated 5 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆199Updated last year
- 📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, 1.8x~3x↑🎉vs SDPA EA.☆107Updated this week
- Tutorials for writing high-performance GPU operators in AI frameworks.☆129Updated last year
- Implement Flash Attention using Cute.☆69Updated 2 months ago
- ☆67Updated 2 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆290Updated this week
- ☆201Updated 3 months ago
- Codes & examples for "CUDA - From Correctness to Performance"☆80Updated 3 months ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆175Updated 3 weeks ago
- Curated collection of papers in MoE model inference☆69Updated this week
- All Homeworks for TinyML and Efficient Deep Learning Computing 6.5940 • Fall • 2023 • https://efficientml.ai☆156Updated last year
- learning how CUDA works☆201Updated 6 months ago
- Examples of CUDA implementations by Cutlass CuTe☆139Updated 2 weeks ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆221Updated 2 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆19Updated this week
- Materials for learning SGLang☆275Updated 2 weeks ago
- llama INT4 cuda inference with AWQ☆50Updated last month