adamydwang / mobilellamaLinks

a lightweight C++ LLaMA inference engine for mobile devices

☆13

Alternatives and similar repositories for mobilellama

Users that are interested in mobilellama are comparing it to the libraries listed below

Sorting:

tile-ai / tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Updated this week
ARM-software / kleidiai
This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai
☆51Updated this week
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆38Updated 2 weeks ago
leimao / Nsight-Compute-Docker-Image
Nsight Compute In Docker
☆12Updated last year
TRT2022 / trtllm-llama
☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化
☆48Updated last year
yvonwin / qwen2.cpp
qwen2 and llama3 cpp implementation
☆44Updated last year
caijixueIT / CUDA_Learning_for_Freshman
☆11Updated 3 months ago
ShaYeBuHui01 / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆16Updated last year
BBuf / tensorrt-llm-moe
☆29Updated 4 months ago
EdVince / llm-cpp
☆32Updated 11 months ago
StudyingLover / ggml-tutorial
☆31Updated 9 months ago
hisrg / SNPE
Snapdragon Neural Processing Engine (SNPE) SDKThe Snapdragon Neural Processing Engine (SNPE) is a Qualcomm Snapdragon software accelerate…
☆34Updated 3 years ago
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆16Updated 7 months ago
wangzhaode / onnx-llm
llm deploy project based onnx.
☆42Updated 8 months ago
Chtholly-Boss / swizzle
A practical way of learning Swizzle
☆20Updated 4 months ago
MollySophia / rwkv-qualcomm
Inference RWKV v5, v6 and v7 with Qualcomm AI Engine Direct SDK
☆72Updated last week
li199603 / sgemm_with_cuda
SGEMM optimization with cuda step by step
☆19Updated last year
keith2018 / TinyGPT
Tiny C++11 GPT-2 inference implementation from scratch
☆62Updated last month
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆110Updated 9 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 7 months ago
SkyworkAI / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆16Updated last year
lx200916 / ChatBotApp
☆35Updated 2 months ago
caibucai22 / awesome-cuda
Awesome code, projects, books, etc. related to CUDA
☆17Updated last week
microsoft / AttentionEngine
☆71Updated last month
mit-han-lab / tinychat-tutorial
☆69Updated 7 months ago
catid / bitnet_cpu
Experiments with BitNet inference on CPU
☆54Updated last year
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆54Updated this week
eth-easl / deltazip
Compression for Foundation Models
☆32Updated 3 months ago
OpenPPL / ppl.pmx
☆58Updated 7 months ago
microsoft / chunk-attention
☆77Updated 2 months ago