harleyszhang / lite_llamaLinks

A light llama-like llm inference framework based on the triton kernel.

☆158

Alternatives and similar repositories for lite_llama

Users that are interested in lite_llama are comparing it to the libraries listed below

Sorting:

zjhellofss / triton_course
☆36Updated 5 months ago
ifromeast / cuda_learning
learning how CUDA works
☆330Updated 7 months ago
zjhellofss / KuiperLLama
校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
☆439Updated 3 months ago
RussWong / CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
☆258Updated last year
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆62Updated 11 months ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆108Updated 3 months ago
BBuf / how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
☆458Updated last year
RussWong / LLM-engineering
☆25Updated 2 months ago
AyakaGEMM / Hands-on-GEMM
☆141Updated last year
zjhellofss / kuiperdatawhale
☆301Updated last year
zjhellofss / KuiperCourse
b站上的课程
☆76Updated 2 years ago
YuxueYang1204 / CudaDemo
Implement custom operators in PyTorch with cuda/c++
☆71Updated 2 years ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆244Updated 3 months ago
CalvinXKY / BasicCUDA
A tutorial for CUDA&PyTorch
☆158Updated 9 months ago
LDLINGLINGLING / nano_vllm_note
注释的nano_vllm仓库，并且完成了MiniCPM4的适配以及注册新模型的功能
☆81Updated 2 months ago
Tongkaio / CUDA_Kernel_Samples
CUDA 算子手撕与面试指南
☆659Updated 2 months ago
Qwesh157 / conv_op_optimization
This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
☆39Updated 3 weeks ago
iclementine / optimize_softmax
Optimize softmax in triton in many cases
☆21Updated last year
HuPengsheet / EasyNN
EasyNN是一个面向教学而开发的神经网络推理框架，旨在让大家0基础也能自主完成推理框架编写！
☆33Updated last year
sunkx109 / My-Torch-Extension
A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.
☆35Updated last year
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆226Updated 2 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆437Updated 5 months ago
OpenPPL / ppl.pmx
☆59Updated 11 months ago
KarhouTam / cuda-kernels
Some common CUDA kernel implementations (Not the fastest).
☆27Updated 2 months ago
xgqdut2016 / cuda_code
easy cuda code
☆87Updated 10 months ago
AdvancedCompiler / AdvancedCompiler
先进编译实验室的个人主页
☆152Updated last week
FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in the Triton Language.
☆703Updated this week
OpenPPL / ppl.llm.kernel.cuda
☆150Updated 9 months ago
mrzhuzhe / riven
CPU Memory Compiler and Parallel programing
☆26Updated 11 months ago
MAhaitao999 / CUDA_Programming
《CUDA编程基础与实践》一书的代码
☆139Updated 3 years ago