yuzhenmao / IceFormerLinks

Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).

☆25

Alternatives and similar repositories for IceFormer

Users that are interested in IceFormer are comparing it to the libraries listed below

Sorting:

li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆42Updated last week
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated last year
facebookresearch / Ternary_Binary_Transformer
ACL 2023
☆39Updated 2 years ago
BBuf / flash-rwkv
☆31Updated last year
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 7 months ago
linxihui / dkernel
☆20Updated 2 months ago
LiqunMa / FBI-LLM
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
☆49Updated last year
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 2 months ago
ziplab / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆27Updated last year
pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆89Updated 7 months ago
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆48Updated 2 years ago
microsoft / AttentionEngine
☆74Updated last month
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆26Updated 2 years ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆85Updated last year
megvii-research / IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA
☆18Updated last year
Raincleared-Song / sparse_gpu_operator
GPU operators for sparse tensor operations
☆33Updated last year
mobiusml / low-rank-llama2
Low-Rank Llama Custom Training
☆23Updated last year
TianjinYellow / StableSPAM
☆22Updated 3 months ago
Dao-AILab / grouped-latent-attention
☆119Updated last month
Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
JarvisPei / CMoE
Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
☆22Updated 4 months ago
tile-ai / AttentionEngine
☆49Updated last month
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated last week
INT-FlashAttention2024 / INT-FlashAttention
☆77Updated 5 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆38Updated last month
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆31Updated last year
ModelTC / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆39Updated last year