groq / mlagility
Machine Learning Agility (MLAgility) benchmark and benchmarking tools
☆38Updated 2 months ago
Alternatives and similar repositories for mlagility:
Users that are interested in mlagility are comparing it to the libraries listed below
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆38Updated 9 months ago
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 2 months ago
- Example of applying CUDA graphs to LLaMA-v2☆11Updated last year
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆35Updated 9 months ago
- python package of rocm-smi-lib☆20Updated 4 months ago
- GroqFlow provides an automated tool flow for compiling machine learning and linear algebra workloads into Groq programs and executing tho…☆106Updated 2 months ago
- High-Performance SGEMM on CUDA devices☆74Updated 3 weeks ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- ☆59Updated 2 weeks ago
- Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning☆34Updated this week
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆88Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆75Updated this week
- Example ML projects that use the Determined library.☆26Updated 5 months ago
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- AMD related optimizations for transformer models☆67Updated 3 months ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆25Updated 3 months ago
- A minimal implementation of vllm.☆33Updated 6 months ago
- Open Source Projects from Pallas Lab☆20Updated 3 years ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆102Updated 4 months ago
- Attention in SRAM on Tenstorrent Grayskull☆31Updated 7 months ago
- LLM training in simple, raw C/CUDA☆91Updated 9 months ago
- Train, tune, and infer Bamba model☆84Updated last month
- ☆25Updated last year
- ☆34Updated this week
- TORCH_LOGS parser for PT2☆32Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆18Updated this week
- ☆21Updated last week
- Explore training for quantized models☆15Updated last month
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆37Updated 6 months ago