friendliai / friendli-model-optimizerLinks
FMO (Friendli Model Optimizer)
☆13Updated 9 months ago
Alternatives and similar repositories for friendli-model-optimizer
Users that are interested in friendli-model-optimizer are comparing it to the libraries listed below
Sorting:
- ☆48Updated last year
 - [⛔️ DEPRECATED] Friendli: the fastest serving engine for generative AI☆48Updated 4 months ago
 - Welcome to PeriFlow CLI ☁︎☆12Updated 2 years ago
 - FriendliAI Model Hub☆91Updated 3 years ago
 - A performance library for machine learning applications.☆184Updated 2 years ago
 - ☆103Updated 2 years ago
 - QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
 - Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆148Updated 2 weeks ago
 - ☆73Updated 5 months ago
 - ☆54Updated 11 months ago
 - ☆25Updated 2 years ago
 - ☆27Updated last year
 - Easy and Efficient Quantization for Transformers☆202Updated 4 months ago
 - ☆24Updated 6 years ago
 - Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.☆49Updated 3 months ago
 - ☆15Updated 4 years ago
 - OwLite is a low-code AI model compression toolkit for AI models.☆50Updated 5 months ago
 - A high-throughput and memory-efficient inference and serving engine for LLMs☆84Updated this week
 - Boosting 4-bit inference kernels with 2:4 Sparsity☆85Updated last year
 - Lightweight and Parallel Deep Learning Framework☆264Updated 2 years ago
 - PyTorch CoreSIG☆57Updated 10 months ago
 - ☆91Updated last year
 - torchcomms: a modern PyTorch communications API☆219Updated this week
 - Dynamic Memory Management for Serving LLMs without PagedAttention☆432Updated 5 months ago
 - Large Language Model Text Generation Inference on Habana Gaudi☆34Updated 7 months ago
 - ☆19Updated 11 months ago
 - MIST: High-performance IoT Stream Processing☆17Updated 6 years ago
 - This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆18Updated 2 weeks ago
 - llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆344Updated 6 months ago
 - [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆389Updated last year