aju22 / LLaMA2Links
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆68Updated last year
Alternatives and similar repositories for LLaMA2
Users that are interested in LLaMA2 are comparing it to the libraries listed below
Sorting:
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆98Updated last year
- ☆223Updated last year
- Official PyTorch implementation of QA-LoRA☆138Updated last year
- A family of compressed models obtained via pruning and knowledge distillation☆344Updated 8 months ago
- Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.☆81Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆95Updated last year
- Explorations into some recent techniques surrounding speculative decoding☆272Updated 6 months ago
- The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".☆174Updated 3 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆225Updated 4 months ago
- An extension of the nanoGPT repository for training small MOE models.☆162Updated 4 months ago
- ☆199Updated 7 months ago
- ☆127Updated last year
- Code for studying the super weight in LLM☆113Updated 7 months ago
- For releasing code related to compression methods for transformers, accompanying our publications☆435Updated 6 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆167Updated last year
- Easy and Efficient Quantization for Transformers☆198Updated 3 weeks ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆304Updated 4 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆132Updated last year
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆147Updated last year
- Low-bit optimizers for PyTorch☆130Updated last year
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆205Updated last year
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆159Updated 3 months ago
- Experiments on speculative sampling with Llama models☆128Updated 2 years ago
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆320Updated 2 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆205Updated 7 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆151Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆265Updated 9 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated last year
- LoRA and DoRA from Scratch Implementations☆206Updated last year