yifanlu0227 / LLaMA2-7B-on-laptop
Lab 5 project of MIT-6.5940, deploying LLaMA2-7B-chat on one's laptop with TinyChatEngine.
β17Updated last year
Alternatives and similar repositories for LLaMA2-7B-on-laptop:
Users that are interested in LLaMA2-7B-on-laptop are comparing it to the libraries listed below
- β63Updated 5 months ago
- πAutomatically Update circult-eda-mlsys-tinyml Papers Daily using Github Actions (Update Every 8th hours)β10Updated this week
- hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.β47Updated last year
- β166Updated last year
- All Homeworks for TinyML and Efficient Deep Learning Computing 6.5940 β’ Fall β’ 2023 β’ https://efficientml.aiβ165Updated last year
- List of papers related to Vision Transformers quantization and hardware acceleration in recent AI conferences and journals.β83Updated 10 months ago
- Code release for AdapMoE accepted by ICCAD 2024β19Updated last month
- This repo contains the Assignments from Cornell Tech's ECE 5545 - Machine Learning Hardware and Systems offered in Spring 2023β28Updated last year
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUsβ42Updated last month
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.β33Updated 3 weeks ago
- β95Updated last year
- β55Updated last year
- [DAC 2024] EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Laβ¦β52Updated 9 months ago
- β19Updated last year
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.β86Updated 2 years ago
- Summary of some awesome work for optimizing LLM inferenceβ69Updated last week
- β104Updated last week
- Large Language Model (LLM) Serving Paper and Resource Listβ21Updated 7 months ago
- This is a repository of Binary General Matrix Multiply (BGEMM) by customized CUDA kernel. Thank FP6-LLM for the wheels!β14Updated 7 months ago
- Examples of CUDA implementations by Cutlass CuTeβ159Updated 2 months ago
- Tender: Accelerating Large Language Models via Tensor Decompostion and Runtime Requantization (ISCA'24)β14Updated 9 months ago
- β138Updated 9 months ago
- ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)β36Updated 4 months ago
- β26Updated this week
- Code Repository of Evaluating Quantized Large Language Modelsβ121Updated 7 months ago
- β22Updated last year
- Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".β118Updated last year
- Optimize softmax in triton in many casesβ20Updated 7 months ago
- β102Updated last month
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMsβ104Updated last week