Eddie-Wang1120 / Eddie-Wang-Hackathon2023Links
Whisper inference with TensorRT-LLM
☆22Updated last year
Alternatives and similar repositories for Eddie-Wang-Hackathon2023
Users that are interested in Eddie-Wang-Hackathon2023 are comparing it to the libraries listed below
Sorting:
- ☆71Updated 2 years ago
- export llama to onnx☆127Updated 6 months ago
- ☆139Updated last year
- Transformer related optimization, including BERT, GPT☆59Updated last year
- ☆75Updated 3 years ago
- A quantization algorithm for LLM☆141Updated last year
- ☆128Updated 6 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆38Updated 4 months ago
- LLaMa/RWKV onnx models, quantization and testcase☆363Updated last year
- Serving Inside Pytorch☆160Updated 2 weeks ago
- symmetric int8 gemm☆66Updated 5 years ago
- ASR client for Triton ASR Service☆29Updated 6 months ago
- simplify >2GB large onnx model☆59Updated 6 months ago
- Transformer related optimization, including BERT, GPT☆17Updated last year
- ☆58Updated 7 months ago
- ☆148Updated 5 months ago
- ☆195Updated last month
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆42Updated 10 months ago
- A Toolkit to Help Optimize Large Onnx Model☆157Updated last year
- llm-export can export llm model to onnx.☆297Updated 5 months ago
- ☆26Updated last year
- Simple Dynamic Batching Inference☆145Updated 3 years ago
- Transformer related optimization, including BERT, GPT☆39Updated 2 years ago
- A collection of memory efficient attention operators implemented in the Triton language.☆272Updated last year
- Inference of quantization aware trained networks using TensorRT☆82Updated 2 years ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆102Updated 2 months ago
- ☆124Updated last year
- 使用 CUDA C++ 实现的 llama 模型推理框架☆57Updated 7 months ago
- qwen2 and llama3 cpp implementation☆44Updated last year
- ☆37Updated 8 months ago