thunlp / Seq1F1BLinks
Sequence-level 1F1B schedule for LLMs.
☆27Updated 3 weeks ago
Alternatives and similar repositories for Seq1F1B
Users that are interested in Seq1F1B are comparing it to the libraries listed below
Sorting:
- ☆96Updated 9 months ago
- Implement Flash Attention using Cute.☆87Updated 6 months ago
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆55Updated 10 months ago
- A lightweight design for computation-communication overlap.☆141Updated last week
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆47Updated last week
- ☆60Updated last month
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆48Updated 2 months ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆70Updated 10 months ago
- ☆54Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆99Updated 3 weeks ago
- ☆74Updated 4 years ago
- nnScaler: Compiling DNN models for Parallel Training☆113Updated this week
- Sequence-level 1F1B schedule for LLMs.☆17Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆127Updated 5 months ago
- DeeperGEMM: crazy optimized version☆69Updated last month
- ☆147Updated 11 months ago
- ☆77Updated last month
- ☆86Updated 2 months ago
- ☆77Updated 2 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- ☆141Updated 3 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆79Updated last month
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆100Updated last year
- ☆75Updated 5 months ago
- ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)☆40Updated 6 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆38Updated 3 months ago
- High Performance Grouped GEMM in PyTorch☆30Updated 3 years ago
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆66Updated 3 months ago
- ☆103Updated 7 months ago
- 16-fold memory access reduction with nearly no loss☆99Updated 2 months ago