cxcscmu/Craw4LLM

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/cxcscmu/Craw4LLM)

cxcscmu / Craw4LLM

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

☆660

Alternatives and similar repositories for Craw4LLM

Users that are interested in Craw4LLM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Babelscape / LLM-Oasis
View on GitHub
This repository contains the resource introduced in the paper: "Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis"…
☆25Oct 15, 2025Updated 9 months ago
tianyaxiang / neurapress
View on GitHub
NeuraPress 是一个现代化的 Markdown 编辑器，专注于提供优质的微信公众号排版体验。响应式设计，支持移动设备。搭配 DeepSeek和微信公众号助手使用，碎片时间也能用手机发有排版的文章了。
☆1,816Apr 21, 2026Updated 3 months ago
thiswillbeyourgithub / wdoc
View on GitHub
Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, advanced RAG, advanced summaries, scriptable, …
☆519Jul 3, 2026Updated 3 weeks ago
ZongqianLi / ReasonGraph
View on GitHub
[ACL 2025 Demo] Repository for the demo and paper: ReasonGraph: Visualisation of Reasoning Paths
☆513Mar 9, 2026Updated 4 months ago
refly-ai / refly
View on GitHub
The first open-source agent skills builder. Define skills by vibe workflow, run on Claude Code, Cursor, Codex & more. Build Clawdbot 🦞· …
☆7,461Mar 25, 2026Updated 3 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
chrischoy / WhisperChain
View on GitHub
Speech to Text but with all the bells and whistles and most importantly AI! AI will clean up your filler words, edit and will refine what…
☆333Feb 9, 2025Updated last year
GitHamza0206 / simba
View on GitHub
OpenSource Production ready Customer service with built in Evals and monitoring
☆1,451Jun 18, 2026Updated last month
X-PLUG / MM_StoryAgent
View on GitHub
☆306Aug 23, 2024Updated last year
liyown / ai-trend-publish
View on GitHub
TrendPublish: 全自动 AI 内容生成与发布系统 | 微信公众号自动化 | 多源数据抓取 (Twitter/X、网站) | DeepseekAI、千问、讯飞模型 | 智能内容分析排序 | 定时发布 | 多模板支持 | Node.js | TypeScript |…
☆3,079Jun 14, 2026Updated last month
mwatkins1970 / SAE_Feature_Interpretability_Tool
View on GitHub
A tool to assist in the interpretation of learned features in sparse autoencoders (in particular the four SAE's trained by Joseph Bloom o…
☆19Oct 4, 2024Updated last year
chuanruihu / Level-Navi-Agent-Search
View on GitHub
The Level-Navi Agent, a framework that requires no training and utilizes large language models for deep query understanding and precise s…
☆81Dec 27, 2024Updated last year
chatmcp / mcp-server-chatsum
View on GitHub
Query and Summarize your chat messages.
☆1,030Dec 4, 2024Updated last year
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Updated this week
Aria-Zhangjl / StoryWeaver
View on GitHub
[AAAI 2025] StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization
☆227Jul 18, 2026Updated last week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
nicekate / AI-ContentCraft
View on GitHub
AI ContentCraft is an all-in-one content creation suite that helps creators generate stories, podcast scripts, and multimedia content usi…
☆396Jul 4, 2025Updated last year
Ji-Cather / GraphAgent
View on GitHub
Code for ACL25-findings. An LLM-based agent simulation framework that simulates human behavior and generates dynamic, text-based social g…
☆96Mar 15, 2026Updated 4 months ago
UCSC-VLAA / story-iter
View on GitHub
[ICLR 2026] A Training-free Iterative Framework for Long Story Visualization
☆959Apr 2, 2026Updated 3 months ago
rag-web-ui / rag-web-ui
View on GitHub
RAG Web UI is an intelligent dialogue system based on RAG (Retrieval-Augmented Generation) technology.
☆3,073Apr 6, 2026Updated 3 months ago
TeamWiseFlow / xiaobei
View on GitHub
为OPC/中小微企业量身打造的自媒体获客智能体
☆8,347Updated this week
microsoft / PIKE-RAG
View on GitHub
PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation
☆2,475Sep 10, 2025Updated 10 months ago
InternLM / MindSearch
View on GitHub
🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT)
☆6,897Jul 4, 2025Updated last year
stepfun-ai / Step-Audio
View on GitHub
☆34Mar 16, 2026Updated 4 months ago
ammaarreshi / Gemini-Search
View on GitHub
Perplexity style AI Search engine clone built with Gemini 2.0 Flash and Grounding
☆2,068Jan 4, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
zilliztech / deep-searcher
View on GitHub
Open Source Deep Research Alternative to Reason and Search on Private Data. Written in Python.
☆8,013Nov 19, 2025Updated 8 months ago
alex-oos / ai-wechat-bot
View on GitHub
将所有AI 产品接入你的微信，打造你个人AI 助理，帮助你解决更多生活日常。
☆423Mar 11, 2026Updated 4 months ago
allenai / olmocr
View on GitHub
Toolkit for linearizing PDFs for LLM datasets/training
☆19,171Mar 25, 2026Updated 3 months ago
mshumer / OpenDeepResearcher
View on GitHub
☆2,776May 2, 2025Updated last year
jina-ai / node-DeepResearch
View on GitHub
Keep searching, reading webpages, reasoning until it finds the answer (or exceeding the token budget)
☆5,201May 1, 2026Updated 2 months ago
tianyi-lab / C3PO
View on GitHub
[COLM 2025] "C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing"
☆21Apr 9, 2025Updated last year
NVIDIA-AI-Blueprints / pdf-to-podcast
View on GitHub
Transform PDFs into AI podcasts for engaging on-the-go audio content.
☆858Jun 26, 2026Updated 3 weeks ago
Huanshere / VideoLingo
View on GitHub
Netflix-level subtitle cutting, translation, alignment, and even dubbing - one-click fully automated AI video subtitle team | Netflix级字幕切…
☆17,837Jul 2, 2026Updated 3 weeks ago
jimmyliao / linebot
View on GitHub
LINEBot
☆13Apr 7, 2025Updated last year
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
yzfly / pocketpal-ai-zh
View on GitHub
口袋AI，将世界知识装进口袋。pocketpal-ai 中文版
☆587Jul 2, 2026Updated 3 weeks ago
akazwz / openchat-monorepo
View on GitHub
一个现代化的全栈 AI Chatbot 应用，使用 React 和 Cloudflare Workers 结合 Connect RPC 构建，通过 Tauri 支持 Web、移动 App 和桌面端
☆565Jun 6, 2025Updated last year
zjunlp / OmniThink
View on GitHub
[EMNLP 2025] OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
☆486Aug 23, 2025Updated 11 months ago
microsoft / OmniParser
View on GitHub
A simple screen parsing tool towards pure vision based GUI agent
☆25,186Updated this week
WangWenhao0716 / PDF-Embedding
View on GitHub
[NeurIPS 2024] The official implementation of "Image Copy Detection for Diffusion Models"
☆18Oct 1, 2024Updated last year
HiveNexus / HiveChat
View on GitHub
An AI chat bot for small and medium-sized teams, supporting models such as Deepseek, Open AI, Claude, and Gemini. 专为中小团队设计的 AI 聊天应用，支持 De…
☆1,158Sep 16, 2025Updated 10 months ago
PySpur-Dev / pyspur
View on GitHub
A visual playground for agentic workflows: Iterate over your agents 10x faster
☆5,764Jun 29, 2026Updated 3 weeks ago