aju22 / LLaMA2
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆64Updated last year
Alternatives and similar repositories for LLaMA2:
Users that are interested in LLaMA2 are comparing it to the libraries listed below
- ☆197Updated 4 months ago
- Code for studying the super weight in LLM☆98Updated 4 months ago
- ☆125Updated last year
- Official PyTorch implementation of QA-LoRA☆131Updated last year
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆207Updated last month
- ☆219Updated 10 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆205Updated 11 months ago
- Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.☆79Updated last year
- Explorations into some recent techniques surrounding speculative decoding☆259Updated 4 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆152Updated last week
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆93Updated last year
- ☆147Updated last year
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆93Updated last year
- ☆122Updated 2 months ago
- A pipeline for LLM knowledge distillation☆100Updated 3 weeks ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆135Updated 9 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆120Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆103Updated last year
- 🔥 A minimal training framework for scaling FLA models☆107Updated 2 weeks ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated 9 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆340Updated 8 months ago
- The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".☆167Updated 3 weeks ago
- Experiments on speculative sampling with Llama models☆125Updated last year
- Low-bit optimizers for PyTorch☆128Updated last year
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆181Updated 3 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆159Updated 9 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆158Updated 10 months ago
- An extension of the nanoGPT repository for training small MOE models.☆131Updated last month
- For releasing code related to compression methods for transformers, accompanying our publications☆424Updated 3 months ago