intel/intel-extension-for-transformers

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/intel/intel-extension-for-transformers)

intel / intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

☆2,177

Alternatives and similar repositories for intel-extension-for-transformers

Users that are interested in intel-extension-for-transformers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

intel / neural-compressor
View on GitHub
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, …
☆2,684Updated this week
intel / neural-speed
View on GitHub
An innovative library for efficient LLM inference via low-bit quantization
☆352Aug 30, 2024Updated last year
intel / intel-extension-for-pytorch
View on GitHub
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
☆2,014Mar 30, 2026Updated 3 months ago
Tiiny-AI / PowerInfer
View on GitHub
High-speed Large Language Model Serving for Local Deployment
☆9,670May 11, 2026Updated 2 months ago
intel / xFasterTransformer
View on GitHub
☆435Sep 18, 2025Updated 10 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
huggingface / text-generation-inference
View on GitHub
Large Language Model Text Generation Inference
☆10,878Mar 21, 2026Updated 4 months ago
turboderp-org / exllamav2
View on GitHub
A fast inference library for running LLMs locally on modern consumer-class GPUs
☆4,586Mar 4, 2026Updated 4 months ago
mit-han-lab / streaming-llm
View on GitHub
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
☆7,248Jul 11, 2024Updated 2 years ago
NVIDIA / TensorRT-LLM
View on GitHub
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…
☆14,170Updated this week
mit-han-lab / llm-awq
View on GitHub
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,592Jul 17, 2025Updated last year
AutoGPTQ / AutoGPTQ
View on GitHub
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆5,073Apr 11, 2025Updated last year
meta-pytorch / gpt-fast
View on GitHub
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
☆6,228Aug 22, 2025Updated 10 months ago
bitsandbytes-foundation / bitsandbytes
View on GitHub
Accessible large language models via k-bit quantization for PyTorch.
☆8,337Updated this week
huggingface / alignment-handbook
View on GitHub
Robust recipes to align language models with human and AI preferences
☆5,639May 26, 2026Updated last month
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
mlc-ai / mlc-llm
View on GitHub
Universal LLM Deployment Engine with ML Compilation
☆22,974Jul 13, 2026Updated last week
casper-hansen / AutoAWQ
View on GitHub
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
☆2,350May 11, 2025Updated last year
NVIDIA / FasterTransformer
View on GitHub
Transformer related optimization, including BERT, GPT
☆6,442Mar 27, 2024Updated 2 years ago
ggml-org / ggml
View on GitHub
Tensor library for machine learning
☆15,034Updated this week
deepspeedai / DeepSpeed-MII
View on GitHub
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
☆2,107Jun 30, 2025Updated last year
FasterDecoding / Medusa
View on GitHub
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,757Jun 25, 2024Updated 2 years ago
jzhang38 / TinyLlama
View on GitHub
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
☆9,015May 3, 2024Updated 2 years ago
arcee-ai / mergekit
View on GitHub
Tools for merging pretrained large language models.
☆7,250Jun 17, 2026Updated last month
huggingface / optimum-intel
View on GitHub
🤗 Optimum Intel: Accelerate inference with Intel optimization tools
☆608Updated this week
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
intel / auto-round
View on GitHub
A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support…
☆1,532Updated this week
Lightning-AI / litgpt
View on GitHub
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
☆13,492Updated this week
S-LoRA / S-LoRA
View on GitHub
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,921Jan 21, 2024Updated 2 years ago
punica-ai / punica
View on GitHub
Serving multiple LoRA finetuned LLM as one
☆1,166May 8, 2024Updated 2 years ago
neuralmagic / deepsparse
View on GitHub
Sparsity-aware deep learning inference runtime for CPUs
☆3,160Jun 2, 2025Updated last year
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆24,502Updated this week
hao-ai-lab / LookaheadDecoding
View on GitHub
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,340Mar 6, 2025Updated last year
OpenNMT / CTranslate2
View on GitHub
Fast inference engine for Transformer models
☆4,579Jul 3, 2026Updated 2 weeks ago
axolotl-ai-cloud / axolotl
View on GitHub
Go ahead and axolotl questions
☆12,222Updated this week
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
flexflow / flexflow-train
View on GitHub
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
☆1,896Jul 1, 2026Updated 2 weeks ago
artidoro / qlora
View on GitHub
QLoRA: Efficient Finetuning of Quantized LLMs
☆10,964Jun 10, 2024Updated 2 years ago
huggingface / optimum
View on GitHub
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization…
☆3,448Updated this week
marella / ctransformers
View on GitHub
Python bindings for the Transformer models implemented in C/C++ using GGML library.
☆1,884Jan 28, 2024Updated 2 years ago
IST-DASLab / gptq
View on GitHub
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,336Mar 27, 2024Updated 2 years ago
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,339Updated this week
tomaarsen / attention_sinks
View on GitHub
Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
☆735Apr 10, 2024Updated 2 years ago