snowflakedb / ArcticInference
☆45Updated this week
Alternatives and similar repositories for ArcticInference:
Users that are interested in ArcticInference are comparing it to the libraries listed below
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆115Updated 5 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆59Updated 6 months ago
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆71Updated this week
- Code for data-aware compression of DeepSeek models☆21Updated 3 weeks ago
- LLM Serving Performance Evaluation Harness☆77Updated 2 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆36Updated last year
- Vocabulary Parallelism☆19Updated last month
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆103Updated last month
- Code repo for "CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs".☆14Updated 7 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆100Updated 2 weeks ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆130Updated 8 months ago
- ☆50Updated 6 months ago
- ☆44Updated 2 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆36Updated 11 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimzer☆91Updated this week
- ☆18Updated last week
- The official repo for "LLoCo: Learning Long Contexts Offline"☆116Updated 10 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆161Updated 9 months ago
- ☆13Updated last month
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆83Updated last month
- ☆126Updated 2 months ago
- ☆45Updated 10 months ago
- Example of applying CUDA graphs to LLaMA-v2☆12Updated last year
- PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)☆19Updated 10 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- Cascade Speculative Drafting☆29Updated last year
- ☆15Updated last month
- Simple extension on vLLM to help you speed up reasoning model without training.☆148Updated this week