satabios / sconce
Model Compression/Inference Made Easy
☆38Updated 3 weeks ago
Related projects: ⓘ
- Notes on quantization in neural networks☆54Updated 9 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆73Updated 3 weeks ago
- Collection of autoregressive model implementation☆62Updated 2 weeks ago
- A single repo with all scripts and utils to train / fine-tune the Mamba model with or without FIM☆46Updated 5 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆28Updated 4 months ago
- Prune transformer layers☆60Updated 3 months ago
- ML/DL Math and Method notes☆56Updated 9 months ago
- ☆24Updated last year
- a highly efficient compression algorithm for the n1 implant (neuralink's compression challenge)☆45Updated 3 months ago
- ☆38Updated 8 months ago
- ☆124Updated 7 months ago
- Simple and fast low-bit matmul kernels in CUDA☆48Updated this week
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆68Updated 2 months ago
- The official repository for HyperZ⋅Z⋅W Operator Connects Slow-Fast Networks for Full Context Interaction.☆29Updated this week
- ☆30Updated 2 months ago
- ☆59Updated last week
- PB-LLM: Partially Binarized Large Language Models☆143Updated 10 months ago
- Evaluation Code repository for the paper "ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers". (2023…☆11Updated 9 months ago
- Attention in SRAM on Tenstorrent Grayskull☆22Updated 2 months ago
- ☆27Updated 2 months ago
- ☆40Updated 2 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆94Updated 2 weeks ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆36Updated 8 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆155Updated 2 months ago
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆21Updated 2 months ago
- Gpu benchmark☆35Updated 2 weeks ago
- ☆25Updated this week
- Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs☆109Updated 8 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 3 months ago