aju22 / LLaMA2
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆65Updated last year
Alternatives and similar repositories for LLaMA2:
Users that are interested in LLaMA2 are comparing it to the libraries listed below
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆95Updated last year
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆94Updated last year
- ☆147Updated last year
- ☆125Updated last year
- ☆221Updated 10 months ago
- Explorations into some recent techniques surrounding speculative decoding☆261Updated 4 months ago
- Official PyTorch implementation of QA-LoRA☆132Updated last year
- ☆198Updated 5 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆198Updated last year
- For releasing code related to compression methods for transformers, accompanying our publications☆425Updated 3 months ago
- Experiments on speculative sampling with Llama models☆125Updated last year
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆214Updated last month
- Code for studying the super weight in LLM☆99Updated 5 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆199Updated 9 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆284Updated 2 months ago
- The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".☆168Updated last month
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆236Updated last year
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆182Updated 3 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆158Updated 10 months ago
- An extension of the nanoGPT repository for training small MOE models.☆138Updated last month
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆383Updated 5 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆205Updated 11 months ago
- Easy and Efficient Quantization for Transformers☆197Updated 3 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆153Updated 3 weeks ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆161Updated 9 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆131Updated 8 months ago
- ☆181Updated 2 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆200Updated 5 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆307Updated 10 months ago