zhuohan123 / openmp-for-pythonLinks

An OpenMP implementation for Python2

☆9

Alternatives and similar repositories for openmp-for-python

Users that are interested in openmp-for-python are comparing it to the libraries listed below

Sorting:

Oneflow-Inc / oneflow-lite
☆18Updated last year
bcaine / nn_cpp
A minimalistic header only C++11 Neural Network library based on Eigen::Tensor
☆20Updated 7 years ago
jiazhihao / attention_superoptimizer
An Attention Superoptimizer
☆21Updated 4 months ago
bytedance / QSync
Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".
☆20Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
zhuzilin / pytorch-malloc
An external memory allocator example for PyTorch.
☆14Updated 3 years ago
ruipeterpan / torch_profiler
Simple PyTorch profiler that combines DeepSpeed Flops Profiler and TorchInfo
☆11Updated 2 years ago
zheng-ningxin / SparTA
☆9Updated last year
Cjkkkk / KgeN
A TVM-like CUDA/C code generator.
☆9Updated 3 years ago
dlsyscourse / lecture13
☆9Updated 8 months ago
chhzh123 / ptc-tutorial
PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo
☆18Updated 2 years ago
Oneflow-Inc / serving
OneFlow Serving
☆20Updated last month
Oneflow-Inc / conda-env
☆12Updated 2 years ago
LeiWang1999 / Stream-k.tvm
☆19Updated 8 months ago
cassiewilliam / cuda_op_benchmark
方便扩展的Cuda算子理解和优化框架，仅用在学习使用
☆15Updated 11 months ago
MegEngine / cutlass-bak
modified cutlass
☆14Updated 4 years ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
heheda12345 / MagPy
☆39Updated last year
BBuf / megatron-lm-parallel-group-playground
☆16Updated last year
TiledTensor / TiledLower
ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.
☆14Updated 6 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆36Updated 2 months ago
stepbuystep / LightNAS
You Only Search Once: On Lightweight Differentiable Architecture Search for Resource-Constrained Embedded Platforms
☆11Updated 2 years ago
quiver-team / quiver-feature
High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph
☆54Updated 2 years ago
caseyfleeter / stanford-cme213.github.io
GitHub page for CME213, Spring 2019
☆21Updated 6 years ago
Linestro / GRACE
Artifact of ASPLOS'23 paper entitled: GRACE: A Scalable Graph-Based Approach to Accelerating Recommendation Model Inference
☆18Updated 2 years ago
NVIDIA / HMM_sample_code
CUDA 12.2 HMM demos
☆19Updated 10 months ago
zxytim / arithmetic-encoding-compression
☆11Updated 2 years ago
SJTU-IPADS / disb
DISB is a new DNN inference serving benchmark with diverse workloads and models, as well as real-world traces.
☆52Updated 9 months ago
billmuch / matmul_perf_test
☆14Updated 3 years ago
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆16Updated 6 months ago