facebookresearch / RLCompOpt
Learning Compiler Pass Orders using Coreset and Normalized Value Prediction. (ICML 2023)
☆19Updated last year
Alternatives and similar repositories for RLCompOpt:
Users that are interested in RLCompOpt are comparing it to the libraries listed below
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆104Updated this week
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated last month
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated last year
- FlexAttention w/ FlashAttention3 Support☆26Updated 5 months ago
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆32Updated 3 weeks ago
- ☆102Updated 7 months ago
- ☆13Updated 3 weeks ago
- ☆46Updated last year
- Experimental scripts for researching data adaptive learning rate scheduling.☆23Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 8 months ago
- ☆63Updated this week
- NASRec Weight Sharing Neural Architecture Search for Recommender Systems☆29Updated last year
- Triton kernels for Flux☆20Updated 2 months ago
- CUDA implementation of autoregressive linear attention, with all the latest research findings☆44Updated last year
- ☆13Updated this week
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆44Updated 8 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 9 months ago
- Prototype routines for GPU quantization written using PyTorch.☆20Updated last month
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆59Updated 5 months ago
- MACTA: A Multi-agent Reinforcement Learning Approach for Cache Timing Attacks and Detection☆46Updated last year
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆78Updated 4 months ago
- ☆67Updated 2 months ago
- Quantized Attention on GPU☆45Updated 4 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆63Updated last week
- DPO, but faster 🚀☆40Updated 3 months ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆135Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆77Updated 4 months ago
- DeeperGEMM: crazy optimized version☆61Updated 2 weeks ago