stanford-cs336 / assignment2-systemsLinks
Student version of Assignment 2 for Stanford CS336 - Language Modeling From Scratch
☆111Updated 3 months ago
Alternatives and similar repositories for assignment2-systems
Users that are interested in assignment2-systems are comparing it to the libraries listed below
Sorting:
- ☆41Updated 8 months ago
- ☆71Updated 3 months ago
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.☆302Updated 2 weeks ago
- ☆216Updated 10 months ago
- making the official triton tutorials actually comprehensible☆61Updated 2 months ago
- Efficient LLM Inference over Long Sequences☆390Updated 4 months ago
- ☆225Updated 3 weeks ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆240Updated 11 months ago
- The evaluation framework for training-free sparse attention in LLMs☆102Updated last month
- An extension of the nanoGPT repository for training small MOE models.☆210Updated 8 months ago
- ring-attention experiments☆155Updated last year
- Cataloging released Triton kernels.☆265Updated 2 months ago
- ☆451Updated 2 months ago
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆98Updated last week
- LLM KV cache compression made easy☆680Updated last week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA (+ more DSLs)☆655Updated last week
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆124Updated 4 months ago
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models☆223Updated last week
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆367Updated 2 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆145Updated 8 months ago
- Asynchronous pipeline parallel optimization☆18Updated 5 months ago
- Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts.☆138Updated last year
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".☆80Updated 7 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆147Updated last year
- JAX backend for SGL☆163Updated this week
- ☆130Updated 5 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆150Updated 4 months ago
- Efficient triton implementation of Native Sparse Attention.☆247Updated 5 months ago
- ☆177Updated last year
- a minimal cache manager for PagedAttention, on top of llama3.☆125Updated last year