yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆22Updated 9 months ago
Alternatives and similar repositories for IceFormer:
Users that are interested in IceFormer are comparing it to the libraries listed below
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- ☆30Updated 10 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 9 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆26Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- Benchmark tests supporting the TiledCUDA library.☆15Updated 4 months ago
- Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference☆16Updated 3 weeks ago
- Quantized Attention on GPU☆45Updated 4 months ago
- ☆51Updated 2 weeks ago
- ☆22Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆46Updated last year
- ☆19Updated this week
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated 2 weeks ago
- ☆23Updated 8 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆63Updated last week
- Transformers components but in Triton☆32Updated last week
- GPTQ inference TVM kernel☆39Updated 11 months ago
- Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM☆14Updated last year
- Low-Rank Llama Custom Training☆22Updated last year
- ☆28Updated last year
- ☆25Updated last year
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated this week
- GPU operators for sparse tensor operations☆31Updated last year
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆62Updated last week
- ACL 2023☆39Updated last year
- ☆65Updated 2 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆80Updated 4 months ago
- Using FlexAttention to compute attention with different masking patterns☆42Updated 6 months ago