top-yun / SPARKLinks

A benchmark dataset and simple code examples for measuring the perception and reasoning of multi-sensor Vision Language models.

☆19

Alternatives and similar repositories for SPARK

Users that are interested in SPARK are comparing it to the libraries listed below

Sorting:

eric-ai-lab / ComCLIP
Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"
☆37Updated last year
NVlabs / STL
Official Pytorch Implementation of Self-emerging Token Labeling
☆35Updated last year
WeihuangLin / INF-LLaVA
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
☆42Updated last year
DCDmllm / HyperLLaVA
Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
☆28Updated last year
shulin16 / MMInA
[ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents
☆47Updated 8 months ago
locuslab / llava-token-compression
☆44Updated last year
uvavision / SyViC
[ICCV 2023] Going Beyond Nouns With Vision & Language Models Using Synthetic Data
☆13Updated 2 years ago
HanSolo9682 / CounterCurate
This is the implementation of CounterCurate, the data curation pipeline of both physical and semantic counterfactual image-caption pairs.
☆18Updated last year
jialuli-luka / SELMA
Code and Data for Paper: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
☆35Updated last year
kyegomez / MC-ViT
Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"
☆24Updated 3 weeks ago
ethanlshen / HierNet
Code for "Are “Hierarchical” Visual Representations Hierarchical?" in NeurIPS Workshop for Symmetry and Geometry in Neural Representation…
☆21Updated 2 years ago
tianyi-lab / R2-T2
[ICML 2025] Code for "R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts"
☆16Updated 8 months ago
kaist-ami / BEAF
[ECCV’24] Official repository for "BEAF: Observing Before-AFter Changes to Evaluate Hallucination in Vision-language Models"
☆21Updated 7 months ago
WangWenhao0716 / PDF-Embedding
[NeurIPS 2024] The official implementation of "Image Copy Detection for Diffusion Models"
☆17Updated last year
eric-ai-lab / Discffusion
Official repo for the TMLR paper "Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners"
☆30Updated last year
mbzuai-oryx / VideoMolmo
Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"
☆54Updated 4 months ago
umd-huang-lab / perceptionCLIP
Code for our ICLR 2024 paper "PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts"
☆79Updated last year
EPFL-VILAB / fm-vision-evals
☆71Updated 4 months ago
naver-ai / maskris
Official PyTorch implementation of “MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation”
☆18Updated 11 months ago
Hao840 / ADEM-VL
PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"
☆20Updated last year
jonkahana / CLIPPR
An official PyTorch implementation for CLIPPR
☆29Updated 2 years ago
csarron / PuMer
[ACL 2023] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
☆34Updated last year
mbzuai-oryx / Agent-X
Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
☆32Updated this week
Hritikbansal / videocon
☆58Updated last year
UCSC-VLAA / Sight-Beyond-Text
[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"
☆20Updated 2 years ago
SHI-Labs / VisPer-LM
[NeurIPS 2025] Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation, arXiv 2024
☆64Updated last month
aszala / VPEval
VPEval Codebase from Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)
☆44Updated last year
ml-jku / semantic-image-text-alignment
☆25Updated 2 years ago
OurBluePrint / easy_video
☆20Updated 8 months ago
mlvlab / RALF
Official implementation of CVPR 2024 paper "Retrieval-Augmented Open-Vocabulary Object Detection".
☆44Updated last year