hkproj / pytorch-llama
LLaMA 2 implemented from scratch in PyTorch
☆303Updated last year
Alternatives and similar repositories for pytorch-llama:
Users that are interested in pytorch-llama are comparing it to the libraries listed below
- Notes about LLaMA 2 model☆54Updated last year
- ☆131Updated 2 months ago
- This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT)…☆63Updated last year
- LORA: Low-Rank Adaptation of Large Language Models implemented using PyTorch☆99Updated last year
- A family of compressed models obtained via pruning and knowledge distillation☆329Updated 4 months ago
- Notes and commented code for RLHF (PPO)☆74Updated last year
- Explorations into some recent techniques surrounding speculative decoding☆246Updated 2 months ago
- Fast inference from large lauguage models via speculative decoding☆678Updated 6 months ago
- Ring attention implementation with flash attention☆707Updated 2 weeks ago
- Reference implementation of Mistral AI 7B v0.1 model.☆28Updated last year
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆193Updated 5 months ago
- LLM KV cache compression made easy☆428Updated last week
- Large Context Attention☆687Updated last month
- Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)☆1,021Updated 3 weeks ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆594Updated last week
- Scalable toolkit for efficient model alignment☆740Updated this week
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning☆594Updated last year
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆227Updated last week
- Cataloging released Triton kernels.☆185Updated 2 months ago
- LoRA and DoRA from Scratch Implementations☆198Updated last year
- Llama from scratch, or How to implement a paper without crying☆547Updated 9 months ago
- 📰 Must-read papers and blogs on Speculative Decoding ⚡️☆637Updated this week
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆275Updated last week
- 📰 Must-read papers on KV Cache Compression (constantly updating 🤗).☆331Updated last week
- Advanced Quantization Algorithm for LLMs/VLMs.☆388Updated this week
- Official PyTorch implementation of QA-LoRA☆127Updated last year
- For releasing code related to compression methods for transformers, accompanying our publications☆413Updated last month
- Collection of kernels written in Triton language☆110Updated 3 weeks ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆152Updated 8 months ago
- Implementation of FlashAttention in PyTorch☆137Updated 2 months ago