aju22 / LLaMA2
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆64Updated last year
Alternatives and similar repositories for LLaMA2:
Users that are interested in LLaMA2 are comparing it to the libraries listed below
- Explorations into some recent techniques surrounding speculative decoding☆250Updated 3 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆203Updated 2 weeks ago
- Official PyTorch implementation of QA-LoRA☆129Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆92Updated last year
- ☆220Updated 9 months ago
- ☆125Updated last year
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆91Updated last year
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆278Updated 3 weeks ago
- ☆145Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆192Updated last year
- Experiments on speculative sampling with Llama models☆125Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆158Updated 8 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆337Updated 7 months ago
- Code for studying the super weight in LLM☆94Updated 3 months ago
- Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.☆78Updated last year
- The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".☆165Updated this week
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆205Updated 10 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆301Updated 9 months ago
- ☆184Updated 6 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆154Updated 9 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 5 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆198Updated 4 months ago
- Awesome list for LLM quantization☆190Updated 3 months ago
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆180Updated 2 months ago
- ☆253Updated last year
- An extension of the nanoGPT repository for training small MOE models.☆109Updated 3 weeks ago
- ☆195Updated 3 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆141Updated 6 months ago
- ☆122Updated last month
- LLaMA 2 implemented from scratch in PyTorch☆311Updated last year