Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
☆553Mar 24, 2026Updated this week
Alternatives and similar repositories for AngelSlim
Users that are interested in AngelSlim are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 7 months ago
- ☆11Dec 11, 2024Updated last year
- ☆13Oct 14, 2025Updated 5 months ago
- An efficient multi-modal instruction-following data synthesis tool and the official implementation of Oasis https://arxiv.org/abs/2503.08…☆40Jun 4, 2025Updated 9 months ago
- An acceleration library that supports arbitrary bit-width combinatorial quantization operations☆242Sep 30, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- [EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.☆692Mar 11, 2026Updated 2 weeks ago
- ☆28May 24, 2025Updated 10 months ago
- PyTorch implementation of "Deep Transferring Quantization" (ECCV2020)☆18Jun 22, 2022Updated 3 years ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).☆2,246Feb 20, 2026Updated last month
- ☆14Jun 22, 2025Updated 9 months ago
- Tencent Hunyuan 7B (short as Hunyuan-7B) is one of the large language dense models of Tencent Hunyuan☆72Aug 11, 2025Updated 7 months ago
- Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge☆115Jan 30, 2026Updated last month
- C++ implementation of "Mobile Vision Transformer-based Visual Object Tracking" (BMVC2023) and "Separable Self and Mixed Attention Transf…☆12Apr 23, 2024Updated last year
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆150May 10, 2025Updated 10 months ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Reading notes on Speculative Decoding papers☆27Feb 24, 2026Updated last month
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆1,074Updated this week
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- [ICLR 2025] "GraphRouter: A Graph-based Router for LLM Selections", Tao Feng, Yanzhen Shen, Jiaxuan You☆63Dec 30, 2025Updated 3 months ago
- [ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation☆154Mar 21, 2025Updated last year
- Keyformer proposes KV Cache reduction through key tokens identification and without the need for fine-tuning☆57Mar 26, 2024Updated 2 years ago
- ☆67Oct 25, 2025Updated 5 months ago
- This repository contains integer operators on GPUs for PyTorch.☆237Sep 29, 2023Updated 2 years ago
- ☆171Mar 9, 2023Updated 3 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- collab-dev - Collaboration Metrics for Code Reviews☆23May 12, 2025Updated 10 months ago
- ☆105Sep 9, 2024Updated last year
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆262Aug 9, 2025Updated 7 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 9 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)☆55Mar 14, 2025Updated last year
- The official PyTorch implementation for Improving Long-Text Alignment for Text-to-Image Diffusion Models (LongAlign)☆80Apr 23, 2025Updated 11 months ago
- ☆17Apr 11, 2025Updated 11 months ago
- The code for Joint Neural Architecture Search and Quantization☆14Apr 10, 2019Updated 6 years ago
- The official PyTorch implementation of the NeurIPS2022 (spotlight) paper, Outlier Suppression: Pushing the Limit of Low-bit Transformer L…☆49Oct 5, 2022Updated 3 years ago
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,324Mar 6, 2025Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆821Mar 6, 2025Updated last year
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆155Aug 21, 2025Updated 7 months ago
- Ring-V2 is a reasoning MoE LLM provided and open-sourced by InclusionAI.☆97Oct 23, 2025Updated 5 months ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆19Updated this week
- ☆28Aug 13, 2025Updated 7 months ago
- ☆26Mar 4, 2026Updated 3 weeks ago