foundation-model-stack / fms-extrasLinks
☆24Updated last year
Alternatives and similar repositories for fms-extras
Users that are interested in fms-extras are comparing it to the libraries listed below
Sorting:
- ☆218Updated 9 months ago
 - A safetensors extension to efficiently store sparse quantized tensors on disk☆183Updated this week
 - Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆130Updated 11 months ago
 - 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆270Updated 3 months ago
 - Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆248Updated 9 months ago
 - PB-LLM: Partially Binarized Large Language Models☆156Updated last year
 - This repository contains the experimental PyTorch native float8 training UX☆223Updated last year
 - The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆60Updated last year
 - Official implementation for Training LLMs with MXFP4☆101Updated 6 months ago
 - 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆216Updated this week
 - Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
 - Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated last year
 - A high-throughput and memory-efficient inference and serving engine for LLMs☆266Updated last year
 - Make triton easier☆48Updated last year
 - scalable and robust tree-based speculative decoding algorithm☆361Updated 9 months ago
 - Experiments on speculative sampling with Llama models☆125Updated 2 years ago
 - CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73Updated last year
 - Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆277Updated 2 years ago
 - Triton-based implementation of Sparse Mixture of Experts.☆247Updated last month
 - Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆147Updated last year
 - An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated last year
 - ring-attention experiments☆155Updated last year
 - ☆112Updated last year
 - ☆202Updated 10 months ago
 - QuIP quantization☆59Updated last year
 - Simple implementation of Speculative Sampling in NumPy for GPT-2.☆98Updated 2 years ago
 - ☆71Updated 7 months ago
 - ☆120Updated last year
 - Token Omission Via Attention☆127Updated last year
 - Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning☆50Updated 8 months ago