dragnet-org/dragnet

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/dragnet-org/dragnet)

dragnet-org / dragnet

Just the facts -- web page content extraction

☆1,274

Alternatives and similar repositories for dragnet

Users that are interested in dragnet are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

seomoz / dragnet_data
View on GitHub
Training/test data for Dragnet
☆42Jan 29, 2015Updated 11 years ago
ziyan / spider
View on GitHub
Web Content Extraction Through Machine Learning
☆185Apr 4, 2014Updated 12 years ago
rodricios / eatiht
View on GitHub
An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.
☆430Jan 16, 2026Updated 6 months ago
miso-belica / jusText
View on GitHub
Heuristic based boilerplate removal tool
☆819Feb 25, 2025Updated last year
buriy / python-readability
View on GitHub
fast python port of arc90's readability tool, updated to match latest readability.js!
☆2,894Jan 26, 2026Updated 5 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
misja / python-boilerpipe
View on GitHub
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
☆542Jul 17, 2021Updated 5 years ago
currentslab / extractnet
View on GitHub
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…
☆299May 19, 2025Updated last year
codelucas / newspaper
View on GitHub
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
☆15,121Updated this week
dalab / web2text
View on GitHub
Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18
☆169Oct 28, 2021Updated 4 years ago
nikitautiu / learnhtml
View on GitHub
Web content extraction using machine learning
☆34Mar 3, 2021Updated 5 years ago
grangier / python-goose
View on GitHub
Html Content / Article Extractor, web scrapping lib in Python
☆4,101Mar 10, 2026Updated 4 months ago
datalib / libextract
View on GitHub
Extract data from websites using basic statistical magic
☆505Oct 2, 2020Updated 5 years ago
dragnet-org / dragnet_data
View on GitHub
code and data used to build a training dataset for dragnet models
☆10Nov 29, 2020Updated 5 years ago
kohlschutter / boilerpipe
View on GitHub
Work in progress transmit from Google Code
☆1,126Jan 3, 2018Updated 8 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
seomoz / mltk
View on GitHub
mltk - Moz Language Tool Kit
☆12Mar 6, 2015Updated 11 years ago
scrapinghub / webstruct
View on GitHub
NER toolkit for HTML data
☆259May 3, 2024Updated 2 years ago
srijiths / readabilityBUNDLE
View on GitHub
A bundle of html content extraction algorithms
☆121Mar 27, 2015Updated 11 years ago
scrapinghub / article-extraction-benchmark
View on GitHub
Article extraction benchmark: dataset and evaluation scripts
☆376May 29, 2026Updated last month
TeamHG-Memex / html-text
View on GitHub
Extract text from HTML
☆135Apr 8, 2026Updated 3 months ago
scrapinghub / extruct
View on GitHub
Extract embedded metadata from HTML markup
☆966Apr 1, 2026Updated 3 months ago
scrapy / scrapely
View on GitHub
A pure-python HTML screen-scraping library
☆1,884Apr 4, 2022Updated 4 years ago
MohamedHmini / iww
View on GitHub
AI based web-wrapper for web-content-extraction
☆102Feb 6, 2023Updated 3 years ago
adbar / trafilatura
View on GitHub
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…
☆6,334Updated this week
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
goose3 / goose3
View on GitHub
A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
☆912Updated this week
scrapinghub / aile
View on GitHub
Automatic Item List Extraction
☆85Jun 15, 2016Updated 10 years ago
seomoz / vocab
View on GitHub
Vocabulary using n-grams
☆16Mar 30, 2018Updated 8 years ago
Webhose / article-date-extractor
View on GitHub
Automatically extracts and normalizes an online article or blog post publication date
☆120Aug 10, 2023Updated 2 years ago
fhamborg / news-please
View on GitHub
news-please - an integrated web crawler and information extractor for news that just works
☆2,472Apr 14, 2026Updated 3 months ago
tomazk / Text-Extraction-Evaluation
View on GitHub
Framework for evaluating text extraction algorithms implemented as web services
☆42Jun 30, 2012Updated 14 years ago
weblyzard / inscriptis
View on GitHub
A python based HTML to text conversion library, command line client and Web service.
☆345Updated this week
seomoz / qdr
View on GitHub
Query-Document Relevance
☆42Feb 6, 2015Updated 11 years ago
nik0spapp / sdalg
View on GitHub
Web page segmentation and noise removal
☆55Feb 4, 2024Updated 2 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
explosion / spaCy
View on GitHub
💫 Industrial-strength Natural Language Processing (NLP) in Python
☆33,772May 19, 2026Updated 2 months ago
rodricios / crawl-to-the-future
View on GitHub
An attempt at creating a gold standard dataset for backtesting yesterday & today's content-extractors
☆35Mar 19, 2015Updated 11 years ago
scrapinghub / frontera
View on GitHub
A scalable frontier for web crawlers
☆1,332Jun 6, 2025Updated last year
pydepta / pydepta
View on GitHub
A python implementation of DEPTA
☆84Jan 14, 2017Updated 9 years ago
mozilla / readability
View on GitHub
A standalone version of the readability lib
☆11,356Jul 9, 2026Updated 2 weeks ago
scrapinghub / page_finder
View on GitHub
Find which links on a web page are pagination links
☆29Jan 12, 2017Updated 9 years ago
xtannier / WebAnnotator
View on GitHub
WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…
☆48Dec 17, 2021Updated 4 years ago