zytedata / clear-htmlLinks
Remove DIVs, style stuff and normalize HTML preserving structure information
☆11Updated 2 months ago
Alternatives and similar repositories for clear-html
Users that are interested in clear-html are comparing it to the libraries listed below
Sorting:
- ☆40Updated 7 months ago
- Datasette plugin for searching all searchable tables at once☆27Updated last month
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- LLM plugin for embeddings using sentence-transformers☆73Updated 8 months ago
- Loadable spellfix1 extension for sqlite as python package☆26Updated last year
- Web scraping Page Objects core library☆104Updated last week
- Via Text Density Simple Web Crawler With Go☆13Updated 2 years ago
- Common crawl extractor☆84Updated last year
- ☆20Updated 8 months ago
- Python client for Zyte API☆27Updated 2 months ago
- Python JSON benchmarking and "correctness".☆36Updated 2 years ago
- Remote web browser automation.☆23Updated last year
- Fast and robust date extraction from web pages, with Python or on the command-line☆142Updated last month
- Dockerized FastAPI wrapper around the recognize-anything image recognition models☆25Updated last year
- pyppeteer stealth plugin, attempts to look like a normal browser☆24Updated last year
- A Python interface for the Chrome DevTools Protocol. Enables direct control of Chrome without external automation drivers.☆20Updated 3 weeks ago
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.☆54Updated last month
- Flatten, format, and export any JSON-like data to CSV (or any other string output).☆17Updated 4 years ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆154Updated last week
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated 3 months ago
- A helper library full of URL-related heuristics.☆73Updated 3 months ago
- The Web Scraping Club Free Repository☆156Updated last month
- The Architecture of a Web Crawler: Building a Google-Inspired Distributed Web Crawler☆123Updated last year
- Spider ported to Python☆99Updated 10 months ago
- Spider templates for automatic crawlers.☆33Updated 2 weeks ago
- Tools for running OCR against files stored in S3☆120Updated 3 years ago
- A polite and user-friendly downloader for Common Crawl data☆63Updated 4 months ago
- Tools to create simple and consistent interfaces to complicated and varied data sources.☆13Updated last month
- Create an LLM XML context document from an llms.txt file☆23Updated last year
- List of free and checked http, https, socks4 and socks5 proxies☆17Updated 2 weeks ago