pytorch-labs / gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
☆5,912Updated last month
Alternatives and similar repositories for gpt-fast:
Users that are interested in gpt-fast are comparing it to the libraries listed below
- PyTorch native post-training library☆5,073Updated this week
- A PyTorch native library for large model training☆3,562Updated this week
- Tools for merging pretrained large language models.☆5,544Updated this week
- ☆4,076Updated 10 months ago
- Accessible large language models via k-bit quantization for PyTorch.☆6,901Updated this week
- A framework for few-shot evaluation of language models.☆8,595Updated this week
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration☆2,912Updated 2 weeks ago
- [ICLR 2024] Efficient Streaming Language Models with Attention Sinks☆6,848Updated 9 months ago
- Go ahead and axolotl questions☆9,075Updated this week
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆3,108Updated this week
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.☆4,802Updated 3 weeks ago
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,168Updated 6 months ago
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,813Updated last year
- Modeling, training, eval, and inference code for OLMo☆5,476Updated this week
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads☆2,489Updated 9 months ago
- 20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.☆11,948Updated this week
- Run Mixtral-8x7B models in Colab or consumer desktops☆2,303Updated last year
- TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain…☆10,173Updated this week
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,077Updated this week
- High-speed Large Language Model Serving for Local Deployment☆8,168Updated last month
- Train transformer language models with reinforcement learning.☆13,166Updated this week
- Fast and memory-efficient exact attention☆16,835Updated this week
- Tile primitives for speedy kernels☆2,251Updated this week
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection☆1,539Updated 5 months ago
- The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.☆8,380Updated 11 months ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆3,923Updated 3 weeks ago
- The official PyTorch implementation of Google's Gemma models☆5,414Updated 3 weeks ago
- Large Language Model Text Generation Inference☆9,992Updated this week
- Training LLMs with QLoRA + FSDP☆1,466Updated 5 months ago
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…☆2,346Updated this week