TeamHG-Memex/html-text

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/TeamHG-Memex/html-text)

TeamHG-Memex / html-text

Extract text from HTML

☆135

Alternatives and similar repositories for html-text

Users that are interested in html-text are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

zytedata / zyte-autoextract
View on GitHub
Python clients for Zyte AutoExtract API
☆41Jan 17, 2022Updated 4 years ago
TeamHG-Memex / extract-html-diff
View on GitHub
extract difference between two html pages
☆33Apr 8, 2026Updated 3 months ago
scrapinghub / webstruct
View on GitHub
NER toolkit for HTML data
☆259May 3, 2024Updated 2 years ago
TeamHG-Memex / Formasaurus
View on GitHub
Formasaurus tells you the type of an HTML form and its fields using machine learning
☆121Apr 8, 2026Updated 3 months ago
mariuspodean / CJ-PYTHON-01
View on GitHub
☆11Jul 6, 2020Updated 6 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
TeamHG-Memex / url-summary
View on GitHub
Show summary of a large number of URLs in a Jupyter Notebook
☆19Apr 8, 2026Updated 3 months ago
TeamHG-Memex / MaybeDont
View on GitHub
A component that tries to avoid downloading duplicate content
☆28Apr 8, 2026Updated 3 months ago
TeamHG-Memex / soft404
View on GitHub
A classifier for detecting soft 404 pages
☆65Apr 8, 2026Updated 3 months ago
TeamHG-Memex / deep-deep
View on GitHub
Adaptive crawler which uses Reinforcement Learning methods
☆167Apr 8, 2026Updated 3 months ago
rmax / scrapy-inline-requests
View on GitHub
A decorator to write coroutine-like spider callbacks.
☆109Dec 26, 2022Updated 3 years ago
TeamHG-Memex / undercrawler
View on GitHub
A generic crawler
☆81Apr 8, 2026Updated 3 months ago
scrapinghub / scrapy-poet
View on GitHub
Page Object pattern for Scrapy
☆127Jun 8, 2026Updated last month
scrapinghub / product-extraction-benchmark
View on GitHub
☆16Apr 10, 2026Updated 3 months ago
scrapy / itemloaders
View on GitHub
Library to populate items using XPath and CSS with a convenient API
☆49Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
TeamHG-Memex / autologin
View on GitHub
A project to attempt to automatically login to a website given a single seed
☆129Apr 8, 2026Updated 3 months ago
scrapy / pypydispatcher
View on GitHub
A fork of http://pydispatcher.sourceforge.net/ with PyPy support
☆16Jul 3, 2017Updated 9 years ago
scrapinghub / andi
View on GitHub
Library for annotation-based dependency injection
☆24Updated this week
zytedata / zyte-spider-templates
View on GitHub
Spider templates for automatic crawlers.
☆35Mar 26, 2026Updated 3 months ago
scrapinghub / aile
View on GitHub
Automatic Item List Extraction
☆85Jun 15, 2016Updated 10 years ago
TeamHG-Memex / autopager
View on GitHub
Detect and classify pagination links
☆107Apr 8, 2026Updated 3 months ago
zytedata / html-text
View on GitHub
☆19Oct 6, 2025Updated 9 months ago
stummjr / scrapy-fieldstats
View on GitHub
A Scrapy extension to log items coverage when the spider shuts down
☆18Apr 11, 2020Updated 6 years ago
scrapinghub / extruct
View on GitHub
Extract embedded metadata from HTML markup
☆966Apr 1, 2026Updated 3 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
scrapinghub / page_finder
View on GitHub
Find which links on a web page are pagination links
☆29Jan 12, 2017Updated 9 years ago
scrapinghub / shublang
View on GitHub
Pluggable DSL that uses pipes to perform a series of linear transformations to extract data
☆16Jul 9, 2024Updated 2 years ago
TeamHG-Memex / tor-proxy
View on GitHub
a tor socks proxy docker image
☆12Apr 8, 2026Updated 3 months ago
TeamHG-Memex / scrapy-rotating-proxies
View on GitHub
use multiple proxies with Scrapy
☆775Apr 8, 2026Updated 3 months ago
croqaz / Stones
View on GitHub
🗿Stones: Persistent key-value containers, compatible with Python dict
☆17Jul 15, 2024Updated 2 years ago
scrapinghub / mdr
View on GitHub
A python library detect and extract listing data from HTML page.
☆110May 5, 2017Updated 9 years ago
scrapinghub / price-parser
View on GitHub
Extract price amount and currency symbol from a raw text string
☆346Mar 19, 2026Updated 4 months ago
TeamHG-Memex / autologin-middleware
View on GitHub
Scrapy middleware for the autologin
☆36Apr 8, 2026Updated 3 months ago
scrapinghub / web-poet
View on GitHub
Web scraping Page Objects core library
☆107Jul 10, 2026Updated last week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
dragnet-org / dragnet
View on GitHub
Just the facts -- web page content extraction
☆1,274Jul 8, 2025Updated last year
rkrzr / dataset-popular
View on GitHub
A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.
☆15Feb 9, 2014Updated 12 years ago
alecxe / scrapy-beautifulsoup
View on GitHub
Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
☆22Sep 26, 2016Updated 9 years ago
scrapinghub / page_clustering
View on GitHub
A simple algorithm for clustering web pages, suitable for crawlers
☆33Mar 6, 2017Updated 9 years ago
dogancanbakir / soft-404
View on GitHub
A classifier for detecting soft 404 pages
☆17Sep 10, 2022Updated 3 years ago
TeamHG-Memex / arachnado
View on GitHub
Web Crawling UI and HTTP API, based on Scrapy and Tornado
☆162Apr 8, 2026Updated 3 months ago
bkj / wit
View on GitHub
Algorithms for "schema matching"
☆26Jul 6, 2016Updated 10 years ago