aju22 / LLaMA2Links
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆67Updated last year
Alternatives and similar repositories for LLaMA2
Users that are interested in LLaMA2 are comparing it to the libraries listed below
Sorting:
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆96Updated last year
- ☆222Updated 11 months ago
- Code for studying the super weight in LLM☆104Updated 6 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆202Updated last year
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆287Updated 3 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆95Updated last year
- Low-bit optimizers for PyTorch☆128Updated last year
- ☆197Updated 6 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆217Updated 2 months ago
- ☆129Updated 3 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆163Updated 10 months ago
- Explorations into some recent techniques surrounding speculative decoding☆268Updated 5 months ago
- Official PyTorch implementation of QA-LoRA☆135Updated last year
- Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.☆80Updated last year
- Experiments on speculative sampling with Llama models☆126Updated last year
- ☆151Updated last year
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆305Updated 11 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆161Updated 11 months ago
- ☆194Updated last month
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆136Updated 10 months ago
- ☆259Updated last year
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆357Updated 9 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆275Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆204Updated last year
- The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".☆173Updated 2 months ago
- A family of compressed models obtained via pruning and knowledge distillation☆342Updated 6 months ago
- ☆125Updated last year
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆185Updated 4 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆156Updated last month
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆103Updated 2 years ago