yifanlu0227 / LLaMA2-7B-on-laptop
Lab 5 project of MIT-6.5940, deploying LLaMA2-7B-chat on one's laptop with TinyChatEngine.
☆11Updated 11 months ago
Related projects ⓘ
Alternatives and complementary repositories for LLaMA2-7B-on-laptop
- Summary of some awesome work for optimizing LLM inference☆37Updated this week
- ☆144Updated last year
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆25Updated 11 months ago
- All Homeworks for TinyML and Efficient Deep Learning Computing 6.5940 • Fall • 2023 • https://efficientml.ai☆137Updated 11 months ago
- List of papers related to Vision Transformers quantization and hardware acceleration in recent AI conferences and journals.☆55Updated 5 months ago
- ☆80Updated last year
- Examples of CUDA implementations by Cutlass CuTe☆101Updated last week
- Tutorials of Extending and importing TVM with CMAKE Include dependency.☆11Updated last month
- ☆52Updated 2 weeks ago
- ☆26Updated 3 weeks ago
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆32Updated 3 months ago
- Puzzles for learning Triton, play it with minimal environment configuration!☆123Updated last week
- ☆21Updated last year
- ☆123Updated last year
- hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.☆41Updated last year
- [ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.☆85Updated 6 months ago
- ☆82Updated this week
- Homework solutions for CMU 10-414/714 – Deep Learning Systems: Algorithms and Implementation☆41Updated last year
- MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)☆44Updated 5 months ago
- ☆12Updated 8 months ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆175Updated this week
- This repo contains the Assignments from Cornell Tech's ECE 5545 - Machine Learning Hardware and Systems offered in Spring 2023☆19Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆52Updated 3 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆181Updated last year
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆50Updated 2 months ago
- ☆131Updated 4 months ago
- Code Repository of Evaluating Quantized Large Language Models☆103Updated 2 months ago
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆83Updated 3 months ago
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆53Updated 8 months ago