dthuerck / culipLinks
Code for the culip ("CUda for Linear and Integer Programming") project, containing GPU primitives for linear algebra, linear optimization and (someday) integer optimization.
☆19Updated 7 years ago
Alternatives and similar repositories for culip
Users that are interested in culip are comparing it to the libraries listed below
Sorting:
- benchmarking some transformer deployments☆26Updated last month
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆70Updated 9 months ago
- Nod.ai 🦈 version of 👻 . You probably want to start at https://github.com/nod-ai/shark for the product and the upstream IREE repository …☆107Updated last month
- SYCL implementation of Fused MLPs for Intel GPUs☆51Updated 2 months ago
- Loop Nest - Linear algebra compiler and code generator.☆21Updated 3 years ago
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆46Updated last year
- Experimental scripts for researching data adaptive learning rate scheduling.☆22Updated 2 years ago
- Official code for "SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient"☆149Updated 2 years ago
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆153Updated last year
- train with kittens!☆63Updated last year
- [WIP] Better (FP8) attention for Hopper☆32Updated 11 months ago
- A block oriented training approach for inference time optimization.☆34Updated last year
- ☆32Updated last year
- A tracing JIT compiler for PyTorch☆13Updated 4 years ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆30Updated 2 weeks ago
- ☆71Updated 10 months ago
- The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )☆230Updated 2 months ago
- Inference code for LLaMA models☆41Updated 2 years ago
- ☆32Updated last year
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆93Updated this week
- Compression for Foundation Models☆35Updated 6 months ago
- torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters i…☆182Updated last month
- ☆23Updated 3 months ago
- ☆137Updated last week
- ☆52Updated 2 years ago
- PyCUDA based PyTorch Extension Made Easy☆26Updated last year
- ML model training for edge devices☆168Updated 2 years ago
- ☆55Updated last year
- [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling☆40Updated 2 years ago