sunkx109 / llama
View external linksLinks

Inference code for LLaMA models

☆128

Alternatives and similar repositories for llama

Users that are interested in llama are comparing it to the libraries listed below

Sorting:

sunkx109 / llama.cpp
View on GitHub
llama 2 Inference
☆43Nov 4, 2023Updated 2 years ago
yu-yake2002 / ysyx-docker
View on GitHub
A docker image for One Student One Chip's debug exam
☆10Sep 22, 2023Updated 2 years ago
guoguo1314 / llama3_learn.c
View on GitHub
Inference deployment of the llama3
☆11Apr 21, 2024Updated last year
dhcode-cpp / online-softmax
View on GitHub
simplest online-softmax notebook for explain Flash Attention
☆15Jan 27, 2026Updated 2 weeks ago
yhwang-hub / yolov5_QAT
View on GitHub
Quantize yolov5 using pytorch_quantization.🚀🚀🚀
☆14Oct 24, 2023Updated 2 years ago
openmlsys / openmlsys-cuda
View on GitHub
Tutorials for writing high-performance GPU operators in AI frameworks.
☆136Aug 12, 2023Updated 2 years ago
Qingrenn / mmdeploy-summer-camp
View on GitHub
🐱 ncnn int8 模型量化评估
☆14Oct 10, 2022Updated 3 years ago
MARD1NO / CUDA-PPT
View on GitHub
☆119Apr 2, 2025Updated 10 months ago
gogongxt / nano-sglang
View on GitHub
☆117Jan 10, 2026Updated last month
SunshlnW / Design-Mode
View on GitHub
李建忠《设计模式》笔记总结
☆15Jan 9, 2020Updated 6 years ago
paperg / NCCL_GP
View on GitHub
Separate from hardware and used to learn some NCCL mechanisms
☆25Apr 19, 2024Updated last year
OpenXiangShan / XSAI
View on GitHub
A fork of Xiangshan for AI
☆36Feb 6, 2026Updated last week
yhwang-hub / dl_model_deploy
View on GitHub
☆79May 16, 2023Updated 2 years ago
zlagpacan / LOROF
View on GitHub
Linux on RISC-V on FPGA (LOROF): RV64GC Sv39 Quad-Core Superscalar Out-of-Order Virtual Memory CPU
☆15Updated this week
noronhadaniel / ACS_2023
View on GitHub
This repository contains all (Python 3) code and libraries required for the 2022-2023 Notre Dame Rocketry Team (NDRT) Apogee Control Syst…
☆10Apr 30, 2023Updated 2 years ago
OpenPPL / ppl.nn.llm
View on GitHub
☆141Apr 23, 2024Updated last year
JackonYang / hands-on-tvm
View on GitHub
hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.
☆49Jun 15, 2023Updated 2 years ago
CmdBlockZQG / rvcore-mini-linux
View on GitHub
Build mini linux for your own RISC-V emulator!
☆24Sep 11, 2024Updated last year
AniZpZ / AutoSmoothQuant
View on GitHub
An easy-to-use package for implementing SmoothQuant for LLMs
☆110Apr 7, 2025Updated 10 months ago
ironartisan / awesome-compression1
View on GitHub
模型压缩的小白入门教程
☆22Jul 7, 2024Updated last year
lllibano / SystolicArray
View on GitHub
A parametric RTL code generator of an efficient integer MxM Systolic Array implementation for Xilinx FPGAs.
☆31Aug 28, 2025Updated 5 months ago
DataXujing / TensorRT-LLM-ChatGLM3
View on GitHub
大模型部署实战：TensorRT-LLM, Triton Inference Server, vLLM
☆27Feb 26, 2024Updated last year
sunkx109 / GPUs-Specs
View on GitHub
Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM
☆75Aug 12, 2025Updated 6 months ago
akaihaoshuai / baby-llama2-chinese_cybertron
View on GitHub
使用单个24G显卡，从0开始训练LLM
☆56Jul 9, 2025Updated 7 months ago
xgqdut2016 / hpc_project
View on GitHub
some hpc project for learning
☆26Aug 28, 2024Updated last year
BBuf / how-to-learn-deep-learning-framework
View on GitHub
how to learn PyTorch and OneFlow
☆482Mar 22, 2024Updated last year
GetUpEarlier / minit
View on GitHub
☆27May 27, 2024Updated last year
Syencil / Programming_Massively_Parallel_Processors
View on GitHub
CUDA 6大并行计算模式代码与笔记
☆61Jul 30, 2020Updated 5 years ago
jhoecmu / ooo-beta
View on GitHub
☆12Aug 12, 2022Updated 3 years ago
LitLeo / TensorRT_Tutorial
View on GitHub
☆1,047Mar 13, 2024Updated last year
intel / npu-nn-cost-model
View on GitHub
Library for modelling performance costs of different Neural Network workloads on NPU devices
☆34Updated this week
THU-DSP-LAB / ventus-gpgpu-cpp-simulator
View on GitHub
Cycle-accurate C++ & SystemC simulator for the RISC-V GPGPU Ventus
☆31Dec 24, 2025Updated last month
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆78Aug 12, 2024Updated last year
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆250Feb 5, 2026Updated last week
Archermmt / tvm_walk_through
View on GitHub
code reading for tvm
☆76Jan 20, 2022Updated 4 years ago
psanse / CliSAT
View on GitHub
An exact algorithm for the maximum clique problem (MCP) which improves over state-of-the-art approaches in some cases by orders of magnit…
☆14Nov 15, 2025Updated 2 months ago
ColfaxResearch / cfx-article-src
View on GitHub
☆175May 7, 2025Updated 9 months ago
harleyszhang / lite_llama
View on GitHub
A light llama-like llm inference framework based on the triton kernel.
☆171Jan 5, 2026Updated last month
HandH1998 / QQQ
View on GitHub
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆154Aug 21, 2025Updated 5 months ago

sunkx109 / llamaView external linksLinks

Alternatives and similar repositories for llama

sunkx109 / llama
View external linksLinks