Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Feb 22, 2026Updated last week
Alternatives and similar repositories for IceFormer
Users that are interested in IceFormer are comparing it to the libraries listed below
Sorting:
- ☆13Jan 7, 2025Updated last year
- Longitudinal Evaluation of LLMs via Data Compression☆33May 29, 2024Updated last year
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- ☆16Mar 13, 2023Updated 2 years ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆17Jun 3, 2024Updated last year
- Whisper in TensorRT-LLM☆17Sep 21, 2023Updated 2 years ago
- Official repository of "Distort, Distract, Decode: Instruction-Tuned Model Can Refine its Response from Noisy Instructions", ICLR 2024 Sp…☆21Mar 7, 2024Updated 2 years ago
- ☆71Mar 26, 2025Updated 11 months ago
- Running inference on the ZeroSCROLLS benchmark☆20Apr 18, 2024Updated last year
- [NAACL 2025] MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning☆19May 31, 2025Updated 9 months ago
- Loop Nest - Linear algebra compiler and code generator.☆20Oct 22, 2022Updated 3 years ago
- The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":☆44Updated this week
- QAQ: Quality Adaptive Quantization for LLM KV Cache☆54Mar 27, 2024Updated last year
- SGEMM optimization with cuda step by step☆21Mar 23, 2024Updated last year
- The source code and dataset mentioned in the paper Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmar…☆53Nov 5, 2024Updated last year
- ☆302Jul 10, 2025Updated 7 months ago
- ☆21Mar 22, 2021Updated 4 years ago
- Repo for "Smart Word Suggestions" (SWS) task and benchmark☆20Dec 4, 2023Updated 2 years ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆23Mar 15, 2024Updated last year
- ☆12Mar 21, 2024Updated last year
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆65Sep 28, 2024Updated last year
- This repository contains the code for the paper: SirLLM: Streaming Infinite Retentive LLM☆60May 28, 2024Updated last year
- Dateset Reset Policy Optimization☆31Apr 12, 2024Updated last year
- Contrastive Chain-of-Thought Prompting☆68Nov 18, 2023Updated 2 years ago
- ☆34Feb 3, 2025Updated last year
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆122Jul 4, 2025Updated 8 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆405Aug 13, 2024Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆79Aug 12, 2024Updated last year
- LLMs as Collaboratively Edited Knowledge Bases☆46Feb 8, 2026Updated 3 weeks ago
- ☆35Jun 15, 2023Updated 2 years ago
- ☆85Apr 18, 2025Updated 10 months ago
- Continual Resilient (CoRe) Optimizer for PyTorch☆11Jun 10, 2024Updated last year
- 分层解耦的深度学习推理引擎☆79Feb 17, 2025Updated last year
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks (EMNLP'24)☆145Sep 20, 2024Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆358Nov 20, 2025Updated 3 months ago
- Code and Model for NeurIPS 2024 Spotlight Paper "Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training…☆44Oct 16, 2024Updated last year
- FocusLLM: Scaling LLM’s Context by Parallel Decoding☆44Dec 8, 2024Updated last year
- Multi-Candidate Speculative Decoding☆39Apr 22, 2024Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Jan 15, 2024Updated 2 years ago