adbar/trafilatura

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/adbar/trafilatura)

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

☆6,356

Alternatives and similar repositories for trafilatura

Users that are interested in trafilatura are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

scrapinghub / article-extraction-benchmark
View on GitHub
Article extraction benchmark: dataset and evaluation scripts
☆377May 29, 2026Updated 2 months ago
miso-belica / jusText
View on GitHub
Heuristic based boilerplate removal tool
☆820Feb 25, 2025Updated last year
dottxt-ai / outlines
View on GitHub
Structured Outputs
☆15,419Updated this week
adbar / courlan
View on GitHub
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
☆178Updated this week
Unstructured-IO / unstructured
View on GitHub
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…
☆15,210Updated this week
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
stanfordnlp / dspy
View on GitHub
DSPy: The framework for programming—not prompting—language models
☆36,460Updated this week
datalab-to / marker
View on GitHub
Convert PDF to markdown + JSON quickly with high accuracy
☆37,994Jul 20, 2026Updated last week
neuml / txtai
View on GitHub
💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows
☆12,765Updated this week
run-llama / llama_index
View on GitHub
LlamaIndex is the leading document agent and OCR platform
☆51,196Updated this week
BerriAI / litellm
View on GitHub
The fastest, litest AI Gateway. Rust core with Python SDK. Call 100+ LLM APIs in OpenAI (or native) format with cost tracking, guardrails…
☆54,835Updated this week
deepset-ai / haystack
View on GitHub
Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and a…
☆26,051Updated this week
guidance-ai / guidance
View on GitHub
A guidance language for controlling large language models.
☆21,696May 21, 2026Updated 2 months ago
unclecode / crawl4ai
View on GitHub
🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN
☆75,515Updated this week
codelucas / newspaper
View on GitHub
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
☆15,125Jul 21, 2026Updated last week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,239Jul 22, 2026Updated last week
ScrapeGraphAI / Scrapegraph-ai
View on GitHub
Python scraper based on AI
☆28,770Jul 20, 2026Updated last week
buriy / python-readability
View on GitHub
fast python port of arc90's readability tool, updated to match latest readability.js!
☆2,894Jan 26, 2026Updated 6 months ago
AndyTheFactory / newspaper4k
View on GitHub
📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
☆1,131Jul 19, 2026Updated last week
docling-project / docling
View on GitHub
Get your documents ready for gen AI
☆63,950Updated this week
567-labs / instructor
View on GitHub
structured outputs for llms
☆13,650Updated this week
mozilla / readability
View on GitHub
A standalone version of the readability lib
☆11,365Jul 9, 2026Updated 3 weeks ago
vllm-project / vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆87,317Updated this week
apify / crawlee-python
View on GitHub
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Dow…
☆9,373Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
unslothai / unsloth
View on GitHub
Unsloth is a local UI for training and running Gemma 4, Qwen3.6, DeepSeek, Kimi, GLM and other models.
☆69,060Updated this week
fhamborg / news-please
View on GitHub
news-please - an integrated web crawler and information extractor for news that just works
☆2,475Apr 14, 2026Updated 3 months ago
adbar / htmldate
View on GitHub
Fast and robust date extraction from web pages, with Python or on the command-line
☆154Jul 21, 2026Updated last week
qdrant / qdrant
View on GitHub
Qdrant - High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. Also available in the cl…
☆33,646Updated this week
weblyzard / inscriptis
View on GitHub
A python based HTML to text conversion library, command line client and Web service.
☆345Jul 23, 2026Updated last week
mem0ai / mem0
View on GitHub
Universal memory layer for AI Agents
☆61,841Updated this week
Aider-AI / aider
View on GitHub
aider is AI pair programming in your terminal
☆47,782May 22, 2026Updated 2 months ago
lancedb / lancedb
View on GitHub
Developer-friendly OSS embedded retrieval library for multimodal AI. Search More; Manage Less.
☆11,014Updated this week
google / langextract
View on GitHub
A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive vi…
☆37,920Updated this week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
jina-ai / reader
View on GitHub
Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
☆11,747May 22, 2026Updated 2 months ago
agno-agi / agno
View on GitHub
Build, run, and manage agent platforms.
☆41,489Updated this week
langchain-ai / langchain
View on GitHub
The agent engineering platform.
☆142,699Updated this week
matthewwithanm / python-markdownify
View on GitHub
Convert HTML to Markdown
☆2,229Jun 30, 2026Updated 3 weeks ago
searxng / searxng
View on GitHub
SearXNG is a free internet metasearch engine which aggregates results from various search services and databases. Users are neither track…
☆34,586Updated this week
argilla-io / argilla
View on GitHub
Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
☆5,063Updated this week
letta-ai / letta
View on GitHub
Platform for stateful agents: AI with advanced memory that can learn and self-improve over time.
☆24,011Jul 22, 2026Updated last week