philipturner / metal-flash-attention
FlashAttention (Metal Port)
☆389Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for metal-flash-attention
- ☆507Updated 3 weeks ago
- Inference of Mamba models in pure C☆178Updated 8 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆262Updated last year
- Official implementation of Half-Quadratic Quantization (HQQ)☆704Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆350Updated 8 months ago
- 1.58 Bit LLM on Apple Silicon using MLX☆148Updated 6 months ago
- Efficient framework-agnostic data loading☆380Updated 2 months ago
- An implementation of bucketMul LLM inference☆214Updated 4 months ago
- C API for MLX☆79Updated this week
- Python bindings for ggml☆132Updated 2 months ago
- Run transformers (incl. LLMs) on the Apple Neural Engine.☆53Updated last year
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated last month
- scalable and robust tree-based speculative decoding algorithm☆318Updated 3 months ago
- Large Language Models (LLMs) applications and tools running on Apple Silicon in real-time with Apple MLX.☆351Updated 2 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆483Updated 3 weeks ago
- GGUF implementation in C as a library and a tools CLI program☆244Updated 4 months ago
- Inference Vision Transformer (ViT) in plain C/C++ with ggml☆233Updated 7 months ago
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆173Updated 7 months ago
- On-device Inference of Diffusion Models for Apple Silicon☆510Updated 3 weeks ago
- ☆101Updated last month
- Flash Attention in ~100 lines of CUDA (forward pass only)☆631Updated 7 months ago
- CLIP inference in plain C/C++ with no extra dependencies☆460Updated 3 months ago
- GPTQ inference Triton kernel☆284Updated last year
- LLM-based code completion engine☆175Updated last year
- Fast parallel LLM inference for MLX☆149Updated 4 months ago
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization☆649Updated 3 months ago
- Apple GPU microarchitecture☆474Updated 2 months ago
- Experimental BitNet Implementation☆61Updated 8 months ago
- SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.☆228Updated this week
- ☆471Updated 3 months ago