apd10 / RzLinear
A compressed alternative to matrix multiplication using state-of-the art compression ROBE-Z
☆9Updated last year
Alternatives and similar repositories for RzLinear:
Users that are interested in RzLinear are comparing it to the libraries listed below
- ☆15Updated 3 years ago
- ☆32Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆64Updated last year
- ☆104Updated 8 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 9 months ago
- Research and development for optimizing transformers☆126Updated 4 years ago
- Unit Scaling demo and experimentation code☆16Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago
- Code for the paper: https://arxiv.org/pdf/2309.06979.pdf☆19Updated 9 months ago
- ☆21Updated last year
- ☆158Updated last year
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated last week
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆48Updated last year
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆28Updated last year
- ☆143Updated last year
- CUDA implementation of autoregressive linear attention, with all the latest research findings☆44Updated last year
- sigma-MoE layer☆18Updated last year
- ☆22Updated last year
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 3 years ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆122Updated this week
- Fast sparse deep learning on CPUs☆53Updated 2 years ago
- FlexAttention w/ FlashAttention3 Support☆26Updated 7 months ago
- QuIP quantization☆52Updated last year
- Make triton easier☆47Updated 10 months ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 11 months ago
- extensible collectives library in triton☆86Updated last month
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- GPU operators for sparse tensor operations☆32Updated last year
- ☆20Updated 11 months ago
- Here we will test various linear attention designs.☆60Updated last year