dlsyscourse / hw1Links
☆13Updated 2 months ago
Alternatives and similar repositories for hw1
Users that are interested in hw1 are comparing it to the libraries listed below
Sorting:
- A simple calculation for LLM MFU.☆50Updated 3 months ago
- ☆51Updated 3 months ago
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆113Updated 5 months ago
- Flash Attention from Scratch on CUDA Ampere☆84Updated 3 months ago
- Machine Learning Compiler Road Map☆45Updated 2 years ago
- Systems for GenAI☆148Updated 7 months ago
- ☆70Updated 2 weeks ago
- A minimal implementation of vllm.☆62Updated last year
- Tutorials for writing high-performance GPU operators in AI frameworks.☆133Updated 2 years ago
- ☆214Updated last year
- High performance Transformer implementation in C++.☆143Updated 10 months ago
- LLM training technologies developed by kwai☆66Updated 2 weeks ago
- Code base and slides for ECE408:Applied Parallel Programming On GPU.☆141Updated 4 years ago
- A curated list of awesome projects and papers for distributed training or inference☆258Updated last year
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆146Updated 2 months ago
- ☆44Updated last year
- From Minimal GEMM to Everything☆84Updated last month
- A PyTorch-like deep learning framework. Just for fun.☆157Updated 2 years ago
- Codes & examples for "CUDA - From Correctness to Performance"☆117Updated last year
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆227Updated 2 years ago
- Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocatio…☆72Updated 3 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆119Updated last year
- GPTQ inference TVM kernel☆40Updated last year
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆43Updated 3 years ago
- ☆97Updated 8 months ago
- A practical way of learning Swizzle☆36Updated 10 months ago
- ☆21Updated last year
- gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling☆51Updated last week
- 注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能☆114Updated 4 months ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆75Updated 4 years ago