philipturner / metal-flash-attention
FlashAttention (Metal Port)
☆453Updated 5 months ago
Alternatives and similar repositories for metal-flash-attention:
Users that are interested in metal-flash-attention are comparing it to the libraries listed below
- ☆528Updated 4 months ago
- ☆203Updated last month
- Python bindings for ggml☆140Updated 6 months ago
- Large Language Models (LLMs) applications and tools running on Apple Silicon in real-time with Apple MLX.☆426Updated last month
- On-device Diffusion Models for Apple Silicon☆598Updated 3 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆765Updated this week
- Inference of Mamba models in pure C☆186Updated last year
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆272Updated last year
- Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.☆166Updated 4 months ago
- Fast parallel LLM inference for MLX☆174Updated 8 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆362Updated last year
- CLIP inference in plain C/C++ with no extra dependencies☆486Updated 7 months ago
- Start a server from the MLX library.☆182Updated 7 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 5 months ago
- scalable and robust tree-based speculative decoding algorithm☆337Updated last month
- GGUF implementation in C as a library and a tools CLI program☆261Updated 2 months ago
- Efficient framework-agnostic data loading☆409Updated 3 weeks ago
- Run transformers (incl. LLMs) on the Apple Neural Engine.☆58Updated last year
- C API for MLX☆101Updated last week
- Phi-3.5 for Mac: Locally-run Vision and Language Models for Apple Silicon☆262Updated 6 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆763Updated 6 months ago
- ☆147Updated this week
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆328Updated 9 months ago
- FastMLX is a high performance production ready API to host MLX models.☆272Updated last week
- An innovative library for efficient LLM inference via low-bit quantization☆351Updated 6 months ago
- Port of Andrej Karpathy's nanoGPT to Apple MLX framework.☆105Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 5 months ago
- 1.58 Bit LLM on Apple Silicon using MLX☆192Updated 10 months ago
- A simple UI / Web / Frontend for MLX mlx-lm using Streamlit.☆245Updated last month
- ☆330Updated 5 months ago