philipturner / metal-flash-attention
FlashAttention (Metal Port)
☆476Updated 6 months ago
Alternatives and similar repositories for metal-flash-attention:
Users that are interested in metal-flash-attention are comparing it to the libraries listed below
- ☆530Updated 5 months ago
- Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.☆171Updated last week
- On-device Image Generation for Apple Silicon☆612Updated this week
- ☆207Updated 2 months ago
- Efficient framework-agnostic data loading☆418Updated 2 weeks ago
- Run transformers (incl. LLMs) on the Apple Neural Engine.☆61Updated last year
- Fast parallel LLM inference for MLX☆179Updated 9 months ago
- CLI to demonstrate running a large language model (LLM) on Apple Neural Engine.☆96Updated 3 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆784Updated last week
- Large Language Models (LLMs) applications and tools running on Apple Silicon in real-time with Apple MLX.☆435Updated 2 months ago
- Python bindings for ggml☆140Updated 7 months ago
- FastMLX is a high performance production ready API to host MLX models.☆288Updated 3 weeks ago
- SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.☆262Updated last week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆362Updated last year
- MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.☆1,155Updated this week
- 1.58 Bit LLM on Apple Silicon using MLX☆195Updated 11 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆273Updated last year
- LLM-based code completion engine☆182Updated 2 months ago
- An implementation of bucketMul LLM inference☆216Updated 9 months ago
- Phi-3.5 for Mac: Locally-run Vision and Language Models for Apple Silicon☆265Updated 7 months ago
- Apple MLX engine for LM Studio☆506Updated this week
- Inference of Mamba models in pure C☆187Updated last year
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆529Updated last month
- Run LLMs with MLX☆394Updated this week
- Start a server from the MLX library.☆182Updated 8 months ago
- GPTQ inference Triton kernel☆299Updated last year
- LLM training in simple, raw C/Metal Shading Language☆50Updated 11 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 6 months ago
- Explore a simple example of utilizing MLX for RAG application running locally on your Apple Silicon device.☆168Updated last year
- ☆55Updated 2 years ago