foundation-model-stack / fms-extrasLinks

☆24

Alternatives and similar repositories for fms-extras

Users that are interested in fms-extras are comparing it to the libraries listed below

Sorting:

foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆269Updated 2 months ago
dust-tt / llama-ssp
Experiments on speculative sampling with Llama models
☆126Updated 2 years ago
apple / ml-recurrent-drafter
☆218Updated 8 months ago
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆96Updated 5 months ago
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆214Updated this week
IST-DASLab / qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆277Updated last year
jaymody / speculative-sampling
Simple implementation of Speculative Sampling in NumPy for GPT-2.
☆96Updated 2 years ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆167Updated this week
huggingface / kernels
Load compute kernels from the Hub
☆293Updated last week
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆265Updated last year
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 10 months ago
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆83Updated this week
huggingface / kernel-builder
👷 Build compute kernels
☆155Updated this week
FasterDecoding / BitDelta
☆201Updated 10 months ago
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆73Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆155Updated last year
whyNLP / LCKV
Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…
☆155Updated 6 months ago
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆203Updated 3 months ago
HanGuo97 / lq-lora
☆127Updated last year
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆382Updated last year
vdesai2014 / inference-optimization-blog-post
☆89Updated last year
abacusai / gh200-llm
Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning
☆50Updated 7 months ago
stanford-futuredata / stk
☆113Updated last year
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆242Updated last week
graphcore-research / out-of-the-box-fp8-training
Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.
☆45Updated last year
anyscale / llm-continuous-batching-benchmarks
☆121Updated last year