mlc-ai / packageLinks
☆14Updated last month
Alternatives and similar repositories for package
Users that are interested in package are comparing it to the libraries listed below
Sorting:
- AMD related optimizations for transformer models☆97Updated 3 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆225Updated 3 weeks ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆93Updated this week
- ☆120Updated last year
- ☆172Updated last week
- Inference of Mamba and Mamba2 models in pure C☆196Updated 2 weeks ago
- xllamacpp - a Python wrapper of llama.cpp☆73Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆238Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Updated 2 months ago
- 🏙 Interactive performance profiling and debugging tool for PyTorch neural networks.☆64Updated last year
- Fused Qwen3 MoE layer for faster training, compatible with Transformers, LoRA, bnb 4-bit quant, Unsloth. Also possible to train LoRA over…☆231Updated this week
- Comparison of Language Model Inference Engines☆239Updated last year
- An educational Rust project for exporting and running inference on Qwen3 LLM family☆40Updated 6 months ago
- RWKV models and examples powered by candle.☆24Updated 3 weeks ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆74Updated last year
- Estimating hardware and cloud costs of LLMs and transformer projects☆20Updated 3 weeks ago
- Thin wrapper around GGML to make life easier☆42Updated 3 months ago
- A memory efficient DLRM training solution using ColossalAI☆105Updated 3 years ago
- An innovative library for efficient LLM inference via low-bit quantization☆352Updated last year
- Python bindings for ggml☆147Updated last year
- AirLLM 70B inference with single 4GB GPU☆17Updated 7 months ago
- EdgeInfer enables efficient edge intelligence by running small AI models, including embeddings and OnnxModels, on resource-constrained de…☆50Updated last year
- Bamboo-7B Large Language Model☆93Updated last year
- LLM inference in C/C++☆104Updated 2 weeks ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆184Updated 10 months ago
- minimal C implementation of speculative decoding based on llama2.c☆25Updated last year
- Hackable and optimized Transformers building blocks, supporting a composable construction.☆34Updated 3 weeks ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆260Updated last year
- General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). …☆51Updated 11 months ago
- A converter and basic tester for rwkv onnx☆43Updated 2 years ago