NVIDIA / free-threaded-pythonLinks
No-GIL Python environment featuring NVIDIA Deep Learning libraries.
☆60Updated last month
Alternatives and similar repositories for free-threaded-python
Users that are interested in free-threaded-python are comparing it to the libraries listed below
Sorting:
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆45Updated 2 weeks ago
- ☆13Updated 4 years ago
- ☆21Updated 3 months ago
- Collection of scripts to build PyTorch and the domain libraries from source.☆11Updated this week
- The CUDA target for Numba☆128Updated this week
- FlexAttention w/ FlashAttention3 Support☆26Updated 8 months ago
- A user-friendly tool chain that enables the seamless execution of ONNX models using JAX as the backend.☆110Updated 3 weeks ago
- Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.☆47Updated last week
- Hacks for PyTorch☆19Updated 2 years ago
- MLPerf™ logging library☆36Updated last month
- Experiment of using Tangent to autodiff triton☆79Updated last year
- Loop Nest - Linear algebra compiler and code generator.☆22Updated 2 years ago
- ☆52Updated 9 months ago
- PyTorch centric eager mode debugger☆47Updated 5 months ago
- A tracing JIT for PyTorch☆17Updated 2 years ago
- Benchmarking PyTorch 2.0 different models☆21Updated 2 years ago
- Better bindings for Python☆17Updated 2 years ago
- extensible collectives library in triton☆87Updated 2 months ago
- Worked example of the process from Python source to CUDA kernel execution with Numba☆40Updated 8 months ago
- Make triton easier☆47Updated 11 months ago
- A collection of reproducible inference engine benchmarks☆31Updated last month
- [WIP] Better (FP8) attention for Hopper☆30Updated 3 months ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆153Updated this week
- ☆16Updated 8 months ago
- ☆21Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆88Updated this week
- TORCH_LOGS parser for PT2☆38Updated last week
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated 2 weeks ago
- A tracing JIT compiler for PyTorch☆13Updated 3 years ago
- NPBench - A Benchmarking Suite for High-Performance NumPy☆81Updated 2 weeks ago