hkproj / pytorch-llamaLinks
LLaMA 2 implemented from scratch in PyTorch
☆328Updated last year
Alternatives and similar repositories for pytorch-llama
Users that are interested in pytorch-llama are comparing it to the libraries listed below
Sorting:
- ☆168Updated 5 months ago
- LORA: Low-Rank Adaptation of Large Language Models implemented using PyTorch☆104Updated last year
- This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT)…☆67Updated last year
- 📰 Must-read papers and blogs on Speculative Decoding ⚡️☆755Updated last week
- Fast inference from large lauguage models via speculative decoding☆745Updated 9 months ago
- Ring attention implementation with flash attention☆771Updated last week
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning☆611Updated last year
- Notes about LLaMA 2 model☆59Updated last year
- LLM KV cache compression made easy☆493Updated 3 weeks ago
- For releasing code related to compression methods for transformers, accompanying our publications☆429Updated 4 months ago
- ☆322Updated last year
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆95Updated last year
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆287Updated 2 months ago
- Awesome list for LLM quantization☆223Updated 5 months ago
- Explorations into some recent techniques surrounding speculative decoding☆266Updated 5 months ago
- Tutorial for how to build BERT from scratch☆93Updated last year
- LoRA and DoRA from Scratch Implementations☆203Updated last year
- Efficient LLM Inference over Long Sequences☆376Updated this week
- A family of compressed models obtained via pruning and knowledge distillation☆341Updated 6 months ago
- A simple and effective LLM pruning approach.☆752Updated 9 months ago
- Official PyTorch implementation of QA-LoRA☆135Updated last year
- Notes and commented code for RLHF (PPO)☆94Updated last year
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆105Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆831Updated 8 months ago
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆185Updated 4 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,249Updated 2 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…☆471Updated 8 months ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)☆269Updated last month
- All Homeworks for TinyML and Efficient Deep Learning Computing 6.5940 • Fall • 2023 • https://efficientml.ai☆170Updated last year
- distributed trainer for LLMs☆575Updated last year