quic / efficient-transformers
This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transformers library) into inference-ready formats that run efficiently on Qualcomm Cloud AI 100 accelerators.
☆57Updated this week
Alternatives and similar repositories for efficient-transformers:
Users that are interested in efficient-transformers are comparing it to the libraries listed below
- Qualcomm Cloud AI SDK (Platform and Apps) enable high performance deep learning inference on Qualcomm Cloud AI platforms delivering high …☆55Updated 2 months ago
- Model compression for ONNX☆81Updated 2 months ago
- ☆57Updated 7 months ago
- A block oriented training approach for inference time optimization.☆32Updated 5 months ago
- Fast low-bit matmul kernels in Triton☆187Updated last week
- llama INT4 cuda inference with AWQ☆49Updated 6 months ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆63Updated 2 years ago
- ☆21Updated last week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆182Updated this week
- ☆131Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆75Updated this week
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 7 months ago
- Nsight Systems in Docker☆19Updated last year
- CUDA Matrix Multiplication Optimization☆152Updated 6 months ago
- ☆56Updated 3 months ago
- Cataloging released Triton kernels.☆156Updated last week
- ☆22Updated last year
- ☆170Updated last week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆93Updated 6 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆64Updated this week
- Open Source Projects from Pallas Lab☆20Updated 3 years ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆99Updated 4 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated 10 months ago
- ☆157Updated last year
- ☆96Updated 4 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆21Updated 3 weeks ago
- FlexAttention w/ FlashAttention3 Support☆27Updated 3 months ago
- ☆25Updated last year
- Learning about CUDA by writing PTX code.☆31Updated 10 months ago
- ☆54Updated last month