thevasudevgupta / gpt-triton
Triton implementation of GPT/LLAMA
☆15Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for gpt-triton
- Cataloging released Triton kernels.☆133Updated 2 months ago
- ring-attention experiments☆96Updated 3 weeks ago
- ☆145Updated this week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆104Updated last month
- ☆133Updated 9 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆103Updated last year
- ☆72Updated 4 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- Understand and test language model architectures on synthetic tasks.☆161Updated 6 months ago
- Prune transformer layers☆64Updated 5 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆140Updated this week
- extensible collectives library in triton☆65Updated last month
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆70Updated last week
- Triton-based implementation of Sparse Mixture of Experts.☆184Updated last month
- Applied AI experiments and examples for PyTorch☆160Updated last week
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- code for training & evaluating Contextual Document Embedding models☆93Updated this week
- Learn CUDA with PyTorch☆14Updated this week
- Collection of kernels written in Triton language☆63Updated 2 weeks ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆84Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆46Updated this week
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 5 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆206Updated 2 weeks ago
- ☆82Updated 8 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆112Updated 6 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆86Updated 3 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 3 weeks ago
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆56Updated this week
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆111Updated 2 months ago
- ☆96Updated last month