scrapinghub/webstruct

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/scrapinghub/webstruct)

scrapinghub / webstruct

NER toolkit for HTML data

☆259

Alternatives and similar repositories for webstruct

Users that are interested in webstruct are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

scrapinghub / webpager
View on GitHub
Paginating the web
☆37Feb 11, 2014Updated 12 years ago
scrapinghub / mdr
View on GitHub
A python library detect and extract listing data from HTML page.
☆110May 5, 2017Updated 9 years ago
pydepta / pydepta
View on GitHub
A python implementation of DEPTA
☆84Jan 14, 2017Updated 9 years ago
scrapinghub / aile
View on GitHub
Automatic Item List Extraction
☆85Jun 15, 2016Updated 10 years ago
TeamHG-Memex / Formasaurus
View on GitHub
Formasaurus tells you the type of an HTML form and its fields using machine learning
☆121Apr 8, 2026Updated 3 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
xtannier / WebAnnotator
View on GitHub
WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…
☆48Dec 17, 2021Updated 4 years ago
TeamHG-Memex / html-text
View on GitHub
Extract text from HTML
☆135Apr 8, 2026Updated 3 months ago
scrapinghub / shub
View on GitHub
Scrapinghub Command Line Client
☆129Updated this week
scrapinghub / arche
View on GitHub
Analyze scraped data
☆47Dec 9, 2019Updated 6 years ago
scrapinghub / page_finder
View on GitHub
Find which links on a web page are pagination links
☆29Jan 12, 2017Updated 9 years ago
scrapinghub / extruct
View on GitHub
Extract embedded metadata from HTML markup
☆966Apr 1, 2026Updated 3 months ago
scrapinghub / flatson
View on GitHub
Tool to flatten stream of JSON-like objects, configured via schema
☆33Oct 19, 2019Updated 6 years ago
scrapy / scrapely
View on GitHub
A pure-python HTML screen-scraping library
☆1,884Apr 4, 2022Updated 4 years ago
scrapinghub / aduana
View on GitHub
Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…
☆54May 21, 2024Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
mariuspodean / CJ-PYTHON-01
View on GitHub
☆11Jul 6, 2020Updated 6 years ago
scrapinghub / page_clustering
View on GitHub
A simple algorithm for clustering web pages, suitable for crawlers
☆33Mar 6, 2017Updated 9 years ago
scrapinghub / skinfer
View on GitHub
Skinfer is a tool for inferring and merging JSON schemas
☆141Apr 24, 2024Updated 2 years ago
TeamHG-Memex / deep-deep
View on GitHub
Adaptive crawler which uses Reinforcement Learning methods
☆167Apr 8, 2026Updated 3 months ago
TeamHG-Memex / autopager
View on GitHub
Detect and classify pagination links
☆107Apr 8, 2026Updated 3 months ago
scrapinghub / exporters
View on GitHub
Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
☆39May 21, 2024Updated 2 years ago
scrapinghub / scrapyrt
View on GitHub
HTTP API for Scrapy spiders
☆882Jun 29, 2026Updated 3 weeks ago
dragnet-org / dragnet
View on GitHub
Just the facts -- web page content extraction
☆1,274Jul 8, 2025Updated last year
ParallelProcessingLab / fogfaas
View on GitHub
FoGFaaS: Add serverless computing (faas) to ifogsim
☆22Mar 30, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
povilasb / scrapy-html-storage
View on GitHub
Scrapy downloader middleware that stores response HTMLs to disk.
☆18Apr 14, 2026Updated 3 months ago
scrapinghub / scrapy-poet
View on GitHub
Page Object pattern for Scrapy
☆127Jun 8, 2026Updated last month
TeamHG-Memex / MaybeDont
View on GitHub
A component that tries to avoid downloading duplicate content
☆28Apr 8, 2026Updated 3 months ago
stummjr / scrapy-fieldstats
View on GitHub
A Scrapy extension to log items coverage when the spider shuts down
☆18Apr 11, 2020Updated 6 years ago
scrapinghub / js2xml
View on GitHub
Convert Javascript code to an XML document
☆188Mar 14, 2022Updated 4 years ago
scrapinghub / spidermon
View on GitHub
Scrapy Extension for monitoring spiders execution.
☆561May 28, 2026Updated last month
scrapy-plugins / scrapy-pagestorage
View on GitHub
A scrapy extension to store requests and responses information in storage service
☆27Mar 11, 2022Updated 4 years ago
scrapinghub / scaws
View on GitHub
Extensions for using Scrapy on Amazon AWS
☆32Dec 5, 2012Updated 13 years ago
seagatesoft / webdext
View on GitHub
Intelligent Web Data Extractor
☆74Dec 5, 2022Updated 3 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
scrapy / itemloaders
View on GitHub
Library to populate items using XPath and CSS with a convenient API
☆49Updated this week
TeamHG-Memex / arachnado
View on GitHub
Web Crawling UI and HTTP API, based on Scrapy and Tornado
☆162Apr 8, 2026Updated 3 months ago
dbpedia / dbpedia-docs
View on GitHub
A tutorial about DBpedia and Linked Data in general
☆24Nov 7, 2014Updated 11 years ago
TeamHG-Memex / extract-html-diff
View on GitHub
extract difference between two html pages
☆33Apr 8, 2026Updated 3 months ago
scrapinghub / price-parser
View on GitHub
Extract price amount and currency symbol from a raw text string
☆346Mar 19, 2026Updated 4 months ago
peterwaksman / Narwhal
View on GitHub
Narwhal is a keyword and KEY NARRATIVE manager that creates language-aware classes. Because Narhwal does not use NLP it avoids complexity…
☆12Oct 16, 2018Updated 7 years ago
scrapinghub / splash
View on GitHub
Lightweight, scriptable browser as a service with an HTTP API
☆4,190Aug 2, 2024Updated last year