AuleTechnologies / Aule-AttentionLinks
High-performance FlashAttention-2 for AMD, Intel, and Apple GPUs. Drop-in replacement for PyTorch SDPA. Triton backend for ROCm (MI300X, RDNA3), Vulkan backend for consumer GPUs. No CUDA required.
☆134Updated 2 weeks ago
Alternatives and similar repositories for Aule-Attention
Users that are interested in Aule-Attention are comparing it to the libraries listed below
Sorting:
- ☆62Updated 6 months ago
- Efficient non-uniform quantization with GPTQ for GGUF☆57Updated 4 months ago
- Sparse Inferencing for transformer based LLMs☆217Updated 5 months ago
- High-throughput tensor loading for PyTorch☆219Updated last month
- Run multiple resource-heavy Large Models (LM) on the same machine with limited amount of VRAM/other resources by exposing them on differe…☆87Updated this week
- AnyModal is a Flexible Multimodal Language Model Framework for PyTorch☆103Updated last year
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆106Updated 7 months ago
- Liquid Audio - Speech-to-Speech audio models by Liquid AI☆356Updated last week
- InferX: Inference as a Service Platform☆146Updated this week
- Super simple python connectors for llama.cpp, including vision models (Gemma 3, Qwen2-VL). Compile llama.cpp and run!☆29Updated last month
- Thin wrapper around GGML to make life easier☆42Updated 2 months ago
- Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.☆47Updated 2 months ago
- AirLLM 70B inference with single 4GB GPU☆14Updated 6 months ago
- ☆15Updated last month
- llama.cpp fork with additional SOTA quants and improved performance☆44Updated this week
- Kyutai with an "eye"☆233Updated 9 months ago
- A self-hosted HuggingFace alternative☆151Updated 2 months ago
- Automated LLM Coding Tournaments. There can be only one (winning code solution from the competing AIs)☆44Updated 9 months ago
- ☆69Updated 6 months ago
- Smart proxy for LLM APIs that enables model-specific parameter control, automatic mode switching (like Qwen3's /think and /no_think), and…☆50Updated 7 months ago
- ☆101Updated last year
- A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.☆99Updated 6 months ago
- The hearth of The Pulsar App, fast, secure and shared inference with modern UI☆59Updated last year
- automatically quant GGUF models☆220Updated 3 weeks ago
- Generate a llama-quantize command to copy the quantization parameters of any GGUF☆29Updated 5 months ago
- Montelimar - Extract text from anywhere☆87Updated 3 months ago
- Distributed Inference for mlx LLm☆100Updated last year
- Optimizing Causal LMs through GRPO with weighted reward functions and automated hyperparameter tuning using Optuna☆59Updated 2 months ago
- Lightweight package that tracks and summarizes code changes using LLMs (Large Language Models)☆34Updated 10 months ago
- Simple high-throughput inference library☆155Updated 8 months ago