graphcore-research / unit-scaling-demo
Unit Scaling demo and experimentation code
☆16Updated 7 months ago
Related projects ⓘ
Alternatives and complementary repositories for unit-scaling-demo
- Repository for CPU Kernel Generation for LLM Inference☆24Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 9 months ago
- ☆45Updated last month
- Triton implementation of Flash Attention2.0☆22Updated last year
- ☆55Updated 5 months ago
- Flexible simulator for mixed precision and format simulation of LLMs and vision transformers.☆43Updated last year
- ☆88Updated 2 months ago
- FlexAttention w/ FlashAttention3 Support☆26Updated last month
- Odysseus: Playground of LLM Sequence Parallelism☆53Updated 4 months ago
- ☆24Updated 7 months ago
- ☆17Updated 3 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆57Updated this week
- QuIP quantization☆46Updated 7 months ago
- An algorithm for static activation quantization of LLMs☆67Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆146Updated 3 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆49Updated 2 weeks ago
- GPTQ inference TVM kernel☆35Updated 6 months ago
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆54Updated this week
- ☆95Updated last month
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆20Updated 7 months ago
- ☆43Updated this week
- PyTorch bindings for CUTLASS grouped GEMM.☆51Updated last week
- Fast Hadamard transform in CUDA, with a PyTorch interface☆107Updated 5 months ago
- ACL 2023☆38Updated last year
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆85Updated 3 weeks ago
- extensible collectives library in triton☆61Updated last month
- ☆156Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 5 months ago
- GPU operators for sparse tensor operations☆29Updated 7 months ago