abacusai / gh200-llm
Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning
☆26Updated this week
Alternatives and similar repositories for gh200-llm:
Users that are interested in gh200-llm are comparing it to the libraries listed below
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆99Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆66Updated this week
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- Example of applying CUDA graphs to LLaMA-v2☆10Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆113Updated last month
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆80Updated last week
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆56Updated this week
- train with kittens!☆53Updated 3 months ago
- Comprehensive analysis of difference in performance of QLora, Lora, and Full Finetunes.☆82Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆112Updated 5 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆17Updated this week
- Benchmark suite for LLMs from Fireworks.ai☆64Updated last month
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 3 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆101Updated 7 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆88Updated this week
- RWKV-7: Surpassing GPT☆73Updated 2 months ago
- Make triton easier☆44Updated 7 months ago
- Experiments on speculative sampling with Llama models☆123Updated last year
- ☆192Updated last week
- ☆49Updated 10 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆221Updated last week
- experiments with inference on llama☆104Updated 7 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆263Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 4 months ago
- Fast low-bit matmul kernels in Triton☆199Updated last week
- ☆74Updated last year
- ☆41Updated last year
- Experiment of using Tangent to autodiff triton☆74Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆105Updated last month
- ☆66Updated 6 months ago