shreydan / VisionGPT2
Combining ViT and GPT-2 for image captioning. Trained on MS-COCO. The model was implemented mostly from scratch.
☆22Updated 11 months ago
Related projects: ⓘ
- From scratch implementation of a vision language model in pure PyTorch☆149Updated 4 months ago
- a family of highly capabale yet efficient large multimodal models☆155Updated 3 weeks ago
- Parameter-efficient finetuning script for Phi-3-vision, the strong multimodal language model by Microsoft.☆48Updated 3 months ago
- (WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, B…☆77Updated last week
- LORA: Low-Rank Adaptation of Large Language Models implemented using PyTorch☆72Updated last year
- Cerule - A Tiny Mighty Vision Model☆67Updated 2 weeks ago
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆138Updated last week
- Video+code lecture on building nanoGPT from scratch☆64Updated 3 months ago
- Quick exploration into fine tuning florence 2☆250Updated last month
- ☆18Updated last month
- Fully fine-tune large models like Mistral, Llama-2-13B, or Qwen-14B completely for free☆217Updated 6 months ago
- Embed arbitrary modalities (images, audio, documents, etc) into large language models.☆170Updated 5 months ago
- Notes and commented code for RLHF (PPO)☆29Updated 6 months ago
- A pipeline for LLM knowledge distillation☆68Updated last month
- LoRA and DoRA from Scratch Implementations☆179Updated 6 months ago
- A real-time video caption to conversation bot that captures frames generates captions and creates conversational responses using a Large …☆118Updated 11 months ago
- ☆92Updated last year
- Implementation of DoRA☆278Updated 3 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆123Updated 6 months ago
- Python bindings for ggml☆125Updated 2 weeks ago
- an implementation of Self-Extend, to expand the context window via grouped attention☆117Updated 8 months ago
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆252Updated 3 months ago
- 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".☆417Updated 8 months ago
- Small and Efficient Mathematical Reasoning LLMs☆69Updated 7 months ago
- Low-Rank adapter extraction for fine-tuned transformers model☆154Updated 4 months ago
- a simplified version of Meta's Llama 3 model to be used for learning☆26Updated 4 months ago
- The simplest, fastest repository for training/finetuning medium-sized xLSTMs.☆38Updated 3 months ago
- Chat with Phi 3.5/3 Vision LLMs. Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which includ…☆25Updated last week
- Implementation of the Llama architecture with RLHF + Q-learning☆155Updated 8 months ago
- Famous Vision Language Models and Their Architectures☆295Updated last week
- Reference implementation of Mistral AI 7B v0.1 model.☆26Updated 8 months ago