NVlabs / COAT
☆52Updated last week
Alternatives and similar repositories for COAT:
Users that are interested in COAT are comparing it to the libraries listed below
- Patch convolution to avoid large GPU memory usage of Conv2D☆81Updated 7 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 7 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆75Updated this week
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆38Updated 10 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆23Updated last month
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆32Updated 5 months ago
- [ICML 2024] SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models☆17Updated 7 months ago
- A sparse attention kernel supporting mix sparse patterns☆93Updated 3 months ago
- An algorithm for static activation quantization of LLMs☆107Updated this week
- Quantized Attention on GPU☆34Updated last month
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆35Updated 10 months ago
- ☆107Updated 3 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆29Updated 6 months ago
- The code repository of "MBQ: Modality-Balanced Quantization for Large Vision-Language Models"☆26Updated 2 weeks ago
- [EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization☆27Updated 3 months ago
- ☆30Updated 5 months ago
- ACL 2023☆38Updated last year
- ☆55Updated 3 months ago
- PyTorch code for Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers☆36Updated 4 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆58Updated 2 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆44Updated last year
- ☆52Updated last month
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆76Updated last month
- Triton implementation of FlashAttention2 that adds Custom Masks.☆88Updated 5 months ago
- ☆31Updated 7 months ago
- 16-fold memory access reduction with nearly no loss☆63Updated 2 months ago
- ☆25Updated 2 months ago
- ☆27Updated 9 months ago