aju22 / LLaMA2
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆64Updated last year
Alternatives and similar repositories for LLaMA2:
Users that are interested in LLaMA2 are comparing it to the libraries listed below
- ☆220Updated 9 months ago
- Official PyTorch implementation of QA-LoRA☆129Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆192Updated last year
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆203Updated 3 weeks ago
- Explorations into some recent techniques surrounding speculative decoding☆250Updated 3 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆91Updated last year
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆155Updated 9 months ago
- Code for studying the super weight in LLM☆94Updated 4 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆92Updated last year
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆150Updated 3 months ago
- ☆125Updated last year
- ☆145Updated last year
- Low-bit optimizers for PyTorch☆125Updated last year
- ☆122Updated last month
- ☆195Updated 3 months ago
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆124Updated 7 months ago
- ☆253Updated last year
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆278Updated 3 weeks ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆272Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆71Updated 6 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆337Updated 7 months ago
- Prune transformer layers☆68Updated 10 months ago
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆180Updated 2 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆155Updated 5 months ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆129Updated 8 months ago
- A family of compressed models obtained via pruning and knowledge distillation☆331Updated 4 months ago
- 🔥 A minimal training framework for scaling FLA models☆92Updated last week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆158Updated 8 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆119Updated last year