zer0int / CLIP-text-image-interpretability
Get CLIP ViT text tokens about an image, visualize attention as a heatmap.
☆10Updated last year
Alternatives and similar repositories for CLIP-text-image-interpretability:
Users that are interested in CLIP-text-image-interpretability are comparing it to the libraries listed below
- ☆41Updated last year
- Official Pytorch Implementation of Paper "A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Des…☆54Updated 6 months ago
- ☆34Updated 11 months ago
- OLA-VLM: Elevating Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆45Updated last month
- Awsome works based on SSM and Mamba☆17Updated 9 months ago
- Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing☆22Updated last month
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆18Updated 2 months ago
- Multimodal Video Understanding Framework (MVU)☆26Updated 8 months ago
- Official code repository for paper: "ExPLoRA: Parameter-Efficient Extended Pre-training to Adapt Vision Transformers under Domain Shifts"☆28Updated 3 months ago
- Official Implementation of Attentive Mask CLIP (ICCV2023, https://arxiv.org/abs/2212.08653)☆26Updated 7 months ago
- Implementation of the "the first large-scale multimodal mixture of experts models." from the paper: "Multimodal Contrastive Learning with…☆25Updated 2 months ago
- An interactive demo based on Segment-Anything for stroke-based painting which enables human-like painting.☆34Updated last year
- CLIP GUI - XAI app ~ explainable (and guessable) AI with ViT & ResNet models☆17Updated 4 months ago
- [IJCAI'23] Complete Instances Mining for Weakly Supervised Instance Segmentation☆37Updated 11 months ago
- [EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"☆14Updated 3 months ago
- Simple Implementation of TinyGPTV in super simple Zeta lego blocks☆15Updated 2 months ago
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆21Updated 3 months ago
- Retrieval-Augmented Personalization☆12Updated last month
- The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …☆45Updated 2 months ago
- [NIPS2023]Implementation of Foundation Model is Efficient Multimodal Multitask Model Selector☆35Updated 10 months ago
- Implementation of the paper: "BRAVE : Broadening the visual encoding of vision-language models"☆22Updated this week
- Masked Vision-Language Transformer in Fashion☆33Updated last year
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"☆40Updated 9 months ago
- [ICLR 2024] Official code for the paper "LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts"☆70Updated 8 months ago
- [AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues☆50Updated last month
- Clipora is a powerful toolkit for fine-tuning OpenCLIP models using Low Rank Adapters (LoRA).☆19Updated 5 months ago
- [ECCV 2024] Soft Prompt Generation for Domain Generalization☆17Updated 3 months ago
- Code release for "SegLLM: Multi-round Reasoning Segmentation"☆56Updated last week