aju22 / LLaMA2
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆54Updated last year
Related projects ⓘ
Alternatives and complementary repositories for LLaMA2
- Official PyTorch implementation of QA-LoRA☆117Updated 8 months ago
- ☆199Updated 5 months ago
- ☆134Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆167Updated 11 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆81Updated 8 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆254Updated 2 months ago
- ☆157Updated last month
- Explorations into some recent techniques surrounding speculative decoding☆211Updated last year
- ☆122Updated 9 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆147Updated 4 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆89Updated last year
- PB-LLM: Partially Binarized Large Language Models☆148Updated last year
- For releasing code related to compression methods for transformers, accompanying our publications☆372Updated last month
- Low-bit optimizers for PyTorch☆119Updated last year
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆307Updated 7 months ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆241Updated last month
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆134Updated 5 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆262Updated last year
- ☆188Updated 6 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆149Updated last month
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)☆135Updated last month
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆199Updated 6 months ago
- ☆184Updated last month
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆138Updated 2 months ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)☆188Updated 3 weeks ago
- Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.☆67Updated last year
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"☆358Updated last month
- ☆96Updated last month