aju22 / LLaMA2Links
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆68Updated last year
Alternatives and similar repositories for LLaMA2
Users that are interested in LLaMA2 are comparing it to the libraries listed below
Sorting:
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆98Updated last year
- Code for studying the super weight in LLM☆107Updated 6 months ago
- Official PyTorch implementation of QA-LoRA☆137Updated last year
- Explorations into some recent techniques surrounding speculative decoding☆269Updated 6 months ago
- ☆126Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆95Updated last year
- ☆130Updated 4 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆163Updated 11 months ago
- ☆198Updated 6 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆300Updated 3 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆222Updated 3 months ago
- ☆151Updated 2 years ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆359Updated 10 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆277Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆137Updated 10 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆149Updated 2 months ago
- A family of compressed models obtained via pruning and knowledge distillation☆343Updated 7 months ago
- ☆114Updated 3 weeks ago
- ☆223Updated last year
- Experiments on speculative sampling with Llama models☆128Updated 2 years ago
- Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.☆80Updated last year
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- ☆194Updated last month
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆163Updated last year
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆203Updated last year
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆137Updated 11 months ago
- ☆45Updated last year
- This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"☆80Updated 3 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆310Updated 11 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…☆57Updated last year