yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆22Updated 8 months ago
Alternatives and similar repositories for IceFormer:
Users that are interested in IceFormer are comparing it to the libraries listed below
- ☆30Updated 9 months ago
- Benchmark tests supporting the TiledCUDA library.☆15Updated 3 months ago
- Quantized Attention on GPU☆45Updated 3 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆26Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 8 months ago
- Transformers components but in Triton☆32Updated 3 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆30Updated 9 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- ACL 2023☆39Updated last year
- ☆46Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆47Updated 8 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆34Updated this week
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- ☆22Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆35Updated last year
- Framework to reduce autotune overhead to zero for well known deployments.☆62Updated 2 weeks ago
- Linear Attention Sequence Parallelism (LASP)☆79Updated 9 months ago
- Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM☆14Updated last year
- GPTQ inference TVM kernel☆39Updated 10 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆46Updated last year
- ☆35Updated 4 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆29Updated last week
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated 3 weeks ago
- LLM Inference with Microscaling Format☆19Updated 4 months ago
- TensorRT LLM Benchmark Configuration☆13Updated 7 months ago