THUDM / CogAgent
An open-sourced end-to-end VLM-based GUI Agent
β753Updated this week
Alternatives and similar repositories for CogAgent:
Users that are interested in CogAgent are comparing it to the libraries listed below
- An LLM-based Web Navigating Agent (KDD'24)β815Updated 4 months ago
- Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.β986Updated last week
- π WebWalker: Benchmarking LLMs in Web Traversalβ318Updated 2 weeks ago
- A LLM-based Agent that predict its tasks proactively.β299Updated last month
- LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QAβ465Updated last month
- Aguvis: Unified Pure Vision Agents for Autonomous GUI Interactionβ221Updated last month
- Build & Optimize your RAG.β397Updated this week
- Parsing-free RAG supported by VLMsβ590Updated last month
- An open-source framework for collaborative AI agents, enabling diverse, distributed agents to team up and tackle complex tasks through inβ¦β658Updated 4 months ago
- ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (IJCAI-24)β397Updated 2 months ago
- PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slidesβ594Updated this week
- β393Updated this week
- [ICLR 2025] The First Multimodal Seach Engine Pipeline and Benchmark for LMMsβ414Updated 3 weeks ago
- Agent S: an open agentic framework that uses computers like a humanβ808Updated 3 weeks ago
- "MiniRAG: Making RAG Simpler with Small and Free Language Models"β714Updated last week
- β¨β¨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interactionβ2,083Updated last week
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agentβ223Updated 2 weeks ago
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.β660Updated this week
- OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinkingβ390Updated last week
- π» A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.β490Updated 3 weeks ago
- The model, data and code for the visual GUI Agent SeeClickβ312Updated 2 months ago
- Repo for NAACL 2025 Paper "Unfolding the Headline: Iterative Self-Questioning for News Retrieval and Timeline Summarization"β234Updated 3 weeks ago
- Open-sourced, Fast and Context-aware Action Grounding from GUI Instructions for GUI/Computer-use Agentsβ320Updated last week
- Search-o1: Agentic Search-Enhanced Large Reasoning Modelsβ628Updated last week
- Profile-Based Long-Term Memory for AI Applicationsβ551Updated this week
- β308Updated 2 months ago
- An LLM-based Agent for the New Automation Paradigm - Agentic Process Automationβ820Updated last year
- Code and implementations for the paper "AgentGym: Evolving Large Language Model-based Agents across Diverse Environments" by Zhiheng Xi eβ¦β392Updated 2 months ago
- Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding"β321Updated 3 weeks ago