aju22 / LLaMA2
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.
☆47Updated 11 months ago
Related projects: ⓘ
- Official PyTorch implementation of QA-LoRA☆111Updated 6 months ago
- ☆191Updated 3 months ago
- Explorations into some recent techniques surrounding speculative decoding☆190Updated 11 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆87Updated last year
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆195Updated 4 months ago
- ☆130Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆156Updated 9 months ago
- ☆117Updated 7 months ago
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆290Updated 5 months ago
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"☆333Updated last month
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆123Updated 6 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago
- ☆145Updated last month
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆69Updated 6 months ago
- A pipeline to improve skills of large language models☆149Updated this week
- For releasing code related to compression methods for transformers, accompanying our publications☆356Updated 2 weeks ago
- Easy and Efficient Quantization for Transformers☆172Updated 2 months ago
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆282Updated last month
- ☆174Updated 4 months ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆119Updated 2 months ago
- Low-bit optimizers for PyTorch☆109Updated 11 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆155Updated last month
- ring-attention experiments☆89Updated 5 months ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆150Updated last week
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆264Updated 9 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆339Updated 6 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆258Updated 10 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆240Updated 2 weeks ago
- Multipack distributed sampler for fast padding-free training of LLMs☆170Updated last month