fw-ai / llama-cuda-graph-example
Example of applying CUDA graphs to LLaMA-v2
☆12Updated last year
Alternatives and similar repositories for llama-cuda-graph-example
Users that are interested in llama-cuda-graph-example are comparing it to the libraries listed below
Sorting:
- ☆69Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆124Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago
- Make triton easier☆47Updated 11 months ago
- ☆26Updated last year
- extensible collectives library in triton☆86Updated last month
- Code for data-aware compression of DeepSeek models☆24Updated last month
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆36Updated last year
- Transformers components but in Triton☆33Updated last week
- Framework to reduce autotune overhead to zero for well known deployments.☆70Updated this week
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆41Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 5 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 11 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆41Updated 2 weeks ago
- ☆104Updated 8 months ago
- ☆79Updated 6 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆109Updated last week
- ☆70Updated last week
- A bunch of kernels that might make stuff slower 😉☆40Updated this week
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆117Updated last year
- ☆58Updated 3 weeks ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆109Updated 10 months ago
- ring-attention experiments☆142Updated 6 months ago
- ☆59Updated this week
- Elixir: Train a Large Language Model on a Small GPU Cluster☆14Updated last year
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- A minimal implementation of vllm.☆40Updated 9 months ago
- ☆49Updated last year
- Load compute kernels from the Hub☆119Updated last week
- ☆43Updated last year