kyegomez / FlashMHA
An simple pytorch implementation of Flash MultiHead Attention
☆14Updated 9 months ago
Related projects ⓘ
Alternatives and complementary repositories for FlashMHA
- Implementation of the LDP module block in PyTorch and Zeta from the paper: "MobileVLM: A Fast, Strong and Open Vision Language Assistant …☆14Updated 8 months ago
- Simple Implementation of TinyGPTV in super simple Zeta lego blocks☆15Updated this week
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆45Updated 4 months ago
- My personal implementation of the model from "Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities", they haven't rel…☆11Updated 9 months ago
- A simple reproducible template to implement AI research papers☆23Updated 2 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆67Updated this week
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆71Updated 4 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆91Updated last month
- A toolkit enhances PyTorch with specialized functions for low-bit quantized neural networks.☆28Updated 4 months ago
- The open source implementation of the model from "Scaling Vision Transformers to 22 Billion Parameters"☆25Updated last week
- My fork os allen AI's OLMo for educational purposes.☆28Updated 6 months ago
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆36Updated 11 months ago
- Here we collect trick questions and failed tasks for open source LLMs to improve them.☆32Updated last year
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…☆43Updated 3 months ago
- A repository for research on medium sized language models.☆74Updated 5 months ago
- Official PyTorch implementation of "LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging" (ICML'24)☆27Updated 2 months ago
- ☆34Updated 8 months ago
- An algorithm for static activation quantization of LLMs☆67Updated this week
- ☆29Updated 5 months ago
- QuIP quantization☆46Updated 7 months ago
- Repository for CPU Kernel Generation for LLM Inference☆24Updated last year
- Structural Pruning for LLaMA☆54Updated last year
- ☆41Updated 11 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆20Updated 4 months ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆72Updated 3 weeks ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- Code for NOLA, an implementation of "nola: Compressing LoRA using Linear Combination of Random Basis"☆48Updated 2 months ago
- Lottery Ticket Adaptation☆35Updated last month
- Official Pytorch Implementation of Self-emerging Token Labeling☆30Updated 7 months ago
- This repo is based on https://github.com/jiaweizzhao/GaLore☆18Updated last month