qwopqwop200 / AutoQuant
☆9Updated last year
Alternatives and similar repositories for AutoQuant:
Users that are interested in AutoQuant are comparing it to the libraries listed below
- TensorRT LLM Benchmark Configuration☆12Updated 5 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 7 months ago
- Various test models in WNNX format. It can view with `pip install wnetron && wnetron`☆12Updated 2 years ago
- ☆45Updated last year
- Make triton easier☆42Updated 7 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- code for paper "Accessing higher dimensions for unsupervised word translation"☆21Updated last year
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆38Updated 10 months ago
- ONNX Command-Line Toolbox☆35Updated 3 months ago
- ☆21Updated last week
- PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training"☆23Updated this week
- NASRec Weight Sharing Neural Architecture Search for Recommender Systems☆29Updated last year
- ☆57Updated 7 months ago
- 📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.☆49Updated this week
- FlexAttention w/ FlashAttention3 Support☆27Updated 3 months ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆17Updated 2 weeks ago
- ☆62Updated last month
- ☆25Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 7 months ago
- ACL 2023☆38Updated last year
- GPTQ inference TVM kernel☆38Updated 8 months ago
- Open Source Projects from Pallas Lab☆20Updated 3 years ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆21Updated 2 weeks ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆44Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆22Updated 10 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- An object detection codebase based on MegEngine.☆28Updated 2 years ago