huggingface / tgi-gaudi
Large Language Model Text Generation Inference on Habana Gaudi
☆26Updated this week
Related projects ⓘ
Alternatives and complementary repositories for tgi-gaudi
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆152Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆41Updated this week
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆101Updated last week
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆251Updated this week
- GenAI components at micro-service level; GenAI service composer to create mega-service☆68Updated this week
- ☆189Updated this week
- Reference models for Intel(R) Gaudi(R) AI Accelerator☆155Updated this week
- Easy and Efficient Quantization for Transformers☆178Updated 3 months ago
- ☆44Updated last month
- AMD related optimizations for transformer models☆57Updated this week
- This repo contains documents of the OPEA project☆26Updated this week
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆57Updated 2 months ago
- Materials for learning SGLang☆75Updated this week
- Benchmark suite for LLMs from Fireworks.ai☆58Updated this week
- OpenAI Triton backend for Intel® GPUs☆143Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆43Updated this week
- IBM development fork of https://github.com/huggingface/text-generation-inference☆57Updated last month
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆118Updated this week
- ☆156Updated last month
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆245Updated this week
- ☆109Updated 7 months ago
- A low-latency & high-throughput serving engine for LLMs☆231Updated last month
- OpenAI compatible API for TensorRT LLM triton backend☆174Updated 3 months ago
- Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.☆85Updated 2 months ago
- Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety…☆22Updated this week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆302Updated 2 months ago
- ☆33Updated 3 months ago
- Ultra-Fast and Cheaper Long-Context LLM Inference☆194Updated this week
- Applied AI experiments and examples for PyTorch☆159Updated last week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆611Updated 2 months ago