yifanlu0227 / LLaMA2-7B-on-laptopLinks

Lab 5 project of MIT-6.5940, deploying LLaMA2-7B-chat on one's laptop with TinyChatEngine.

☆18

Alternatives and similar repositories for LLaMA2-7B-on-laptop

Users that are interested in LLaMA2-7B-on-laptop are comparing it to the libraries listed below

Sorting:

yifanlu0227 / MIT-6.5940
All Homeworks for TinyML and Efficient Deep Learning Computing 6.5940 • Fall • 2023 • https://efficientml.ai
☆184Updated last year
chenhongyu2048 / LLM-inference-optimization-paper
Summary of some awesome work for optimizing LLM inference
☆138Updated 3 weeks ago
Ther-nullptr / circult-eda-mlsys-tinyml-arxiv-daily
🎓Automatically Update circult-eda-mlsys-tinyml Papers Daily using Github Actions (Update Every 8th hours)
☆10Updated this week
ArthurinRUC / cutlass-notes
From Minimal GEMM to Everything
☆80Updated 2 weeks ago
goliaro / specinfer-ae
☆23Updated last year
SJTU-ReArch-Group / Paper-Reading-List
☆137Updated last week
mit-han-lab / parallel-computing-tutorial
☆176Updated 2 years ago
JackonYang / hands-on-tvm
hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.
☆50Updated 2 years ago
PKU-SEC-Lab / AdapMoE
Code release for AdapMoE accepted by ICCAD 2024
☆34Updated 7 months ago
MoE-Inf / awesome-moe-inference
Curated collection of papers in MoE model inference
☆307Updated last month
infinigence / SpecEE
Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)
☆68Updated 7 months ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 8 months ago
ranggihwang / Pregated_MoE
☆57Updated last year
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆279Updated 8 months ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆113Updated 4 months ago
PrincetonUniversity / LLMCompass
☆207Updated last month
PKUFlyingPig / MIT6.5940_TinyML
Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing
☆60Updated 10 months ago
guanrenyang / Programming-Massively-Parallel-Processors
Solution of Programming Massively Parallel Processors
☆50Updated last year
Toseic / LLM-inference-arxiv-daily
🎓Automatically Update LLM inference systems Papers Daily using Github Actions (Update Every 12th hours)
☆12Updated this week
DicardoX / Research-Space
This repository is established to store personal notes and annotated papers during daily research.
☆162Updated this week
yifu-ding / BGEMM-CUDA
This is a repository of Binary General Matrix Multiply (BGEMM) by customized CUDA kernel. Thank FP6-LLM for the wheels!
☆17Updated last year
DD-DuDa / BitDecoding
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆63Updated last week
microsoft / SparTA
☆159Updated last year
SNU-ARC / any-precision-llm
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
☆121Updated 4 months ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆251Updated 4 months ago
iclementine / optimize_softmax
Optimize softmax in triton in many cases
☆21Updated last year
BrotherHappy / OSTQuant
[ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitt…
☆83Updated 7 months ago
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆289Updated 5 months ago
snu-comparch / InfiniGen
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆161Updated last year
yinuotxie / Efficient-LLM-Inferencing-on-GPUs
Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆44Updated last year