lyogavin / airllmLinks
AirLLM 70B inference with single 4GB GPU
☆5,913Updated last week
Alternatives and similar repositories for airllm
Users that are interested in airllm are comparing it to the libraries listed below
Sorting:
- Tools for merging pretrained large language models.☆6,275Updated 3 weeks ago
- High-speed Large Language Model Serving for Local Deployment☆8,329Updated last month
- [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which ach…☆5,407Updated 6 months ago
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆4,309Updated 3 weeks ago
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.☆4,942Updated 5 months ago
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs☆3,417Updated 3 months ago
- LMDeploy is a toolkit for compressing, deploying, and serving LLMs.☆7,039Updated this week
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,247Updated 4 months ago
- Run Mixtral-8x7B models in Colab or consumer desktops☆2,319Updated last year
- [ICLR 2024] Efficient Streaming Language Models with Attention Sinks☆7,038Updated last year
- QLoRA: Efficient Finetuning of Quantized LLMs☆10,656Updated last year
- PyTorch native post-training library☆5,484Updated this week
- Accessible large language models via k-bit quantization for PyTorch.☆7,567Updated this week
- SGLang is a fast serving framework for large language models and vision language models.☆17,823Updated this week
- The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.☆8,741Updated last year
- Tensor library for machine learning☆13,134Updated this week
- Chat language model that can use tools and interpret the results☆1,581Updated last month
- Go ahead and axolotl questions☆10,405Updated this week
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆3,587Updated this week
- A blazing fast inference solution for text embeddings models☆3,986Updated last week
- Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization☆1,354Updated 9 months ago
- g1: Using Llama-3.1 70b on Groq to create o1-like reasoning chains☆4,218Updated this week
- A Next-Generation Training Engine Built for Ultra-Large MoE Models☆4,823Updated this week
- LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve spee…☆3,069Updated 3 months ago
- Python bindings for llama.cpp☆9,566Updated last month
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆2,878Updated last week
- A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.☆2,898Updated last year
- Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali☆2,434Updated last week
- LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath☆9,458Updated 3 months ago
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,852Updated last year