A minimal implementation of vllm.
☆68Jul 27, 2024Updated last year
Alternatives and similar repositories for vllmini
Users that are interested in vllmini are comparing it to the libraries listed below
Sorting:
- A Triton-only attention backend for vLLM☆24Feb 11, 2026Updated 3 weeks ago
- ☆15Nov 10, 2023Updated 2 years ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆315Jun 10, 2025Updated 8 months ago
- Sparsity support for PyTorch☆38Mar 22, 2025Updated 11 months ago
- Manages vllm-nccl dependency☆17Jun 3, 2024Updated last year
- ☆87Jan 23, 2025Updated last year
- LLM Inference via Triton (Flexible & Modular): Focused on Kernel Optimization using CUBIN binaries, Starting from gpt-oss Model☆75Oct 18, 2025Updated 4 months ago
- Prototyp MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism☆27Apr 4, 2025Updated 11 months ago
- Accelerating Deep Learning Training Through Transparent Storage Tiering (CCGrid'22)☆19Dec 13, 2022Updated 3 years ago
- This is the repo for remote direct memory introspection.☆23Jun 21, 2023Updated 2 years ago
- Lightning In-Memory Object Store☆46Jan 22, 2022Updated 4 years ago
- ☆31Feb 12, 2026Updated 3 weeks ago
- Johnny Cache: the End of DRAM Cache Conflicts (in Tiered Main Memory Systems)☆20Aug 2, 2023Updated 2 years ago
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆92Jan 26, 2026Updated last month
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆25May 12, 2025Updated 9 months ago
- GPTQ inference Triton kernel☆321May 18, 2023Updated 2 years ago
- Compression for Foundation Models☆35Jul 21, 2025Updated 7 months ago
- Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]☆24Nov 21, 2024Updated last year
- [EMNLP 2024 Main] Virtual Personas for Language Models via an Anthology of Backstories☆36Feb 10, 2026Updated 3 weeks ago
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)☆369Apr 22, 2025Updated 10 months ago
- ☆56Jan 25, 2021Updated 5 years ago
- An MLIR-based toy DL compiler for TVM Relay.☆61Oct 16, 2022Updated 3 years ago
- "Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices", official implementation☆30Feb 4, 2025Updated last year
- PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)☆30Jun 14, 2024Updated last year
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆73Sep 8, 2024Updated last year
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding☆277Aug 31, 2024Updated last year
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆283Mar 6, 2025Updated last year
- Official Implementation of APB (ACL 2025 main Oral) and Spava.☆34Jan 30, 2026Updated last month
- A Python wrapper around HuggingFace's TGI (text-generation-inference) and TEI (text-embedding-inference) servers.☆32Sep 19, 2025Updated 5 months ago
- ☆23Dec 30, 2025Updated 2 months ago
- [Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…☆13Jan 16, 2026Updated last month
- Fun project: LLM powered RAG Discord Bot that works seamlessly on CPU☆33Nov 12, 2023Updated 2 years ago
- Source code for iCache-HPCA'23☆50Apr 22, 2023Updated 2 years ago
- "Efficient Federated Learning for Modern NLP", to appear at MobiCom 2023.☆34Aug 18, 2023Updated 2 years ago
- Implementation of the FedPM framework by the authors of the ICLR 2023 paper "Sparse Random Networks for Communication-Efficient Federated…☆30Feb 10, 2023Updated 3 years ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆464May 30, 2025Updated 9 months ago
- Prefix-Aware Attention for LLM Decoding☆29Jan 23, 2026Updated last month
- [NeurIPS 2023] Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective☆41Oct 17, 2023Updated 2 years ago
- ML Input Data Processing as a Service. This repository contains the source code for Cachew (built on top of TensorFlow).☆40Sep 10, 2024Updated last year