zytedata / clear-htmlLinks
Remove DIVs, style stuff and normalize HTML preserving structure information
☆11Updated 3 weeks ago
Alternatives and similar repositories for clear-html
Users that are interested in clear-html are comparing it to the libraries listed below
Sorting:
- ☆34Updated 4 months ago
- Autogenerated CDP utilities that enable Python to control Chrome directly, without external automation drivers.☆17Updated 5 months ago
- Remote web browser automation.☆20Updated last year
- Common crawl extractor☆80Updated last year
- aiohttp-like interface to chromium. based on selenium_driverless to bypass cloudflare☆57Updated last month
- The Web Scraping Club Free Repository☆151Updated 4 months ago
- ☆26Updated last year
- Python SDK for Permit.io: Plug & Play Application Level Authorization☆14Updated this week
- Dockerized FastAPI wrapper around the recognize-anything image recognition models☆25Updated last year
- Neural search engine for discovering semantically similar Python repositories on GitHub☆28Updated last year
- httpx transport for curl_cffi (python bindings for curl-impersonate)☆23Updated last month
- Spider ported to Python☆94Updated 8 months ago
- Datasette plugin for searching all searchable tables at once☆25Updated last year
- Create an LLM XML context document from an llms.txt file☆22Updated last year
- A polite and user-friendly downloader for Common Crawl data☆57Updated last month
- Create a static website with Fly - HTML from the example☆21Updated last year
- Scrapfly Python SDK for headless browsers and proxy rotation☆47Updated 2 weeks ago
- LLM plugin for embeddings using sentence-transformers☆72Updated 5 months ago
- pyppeteer stealth plugin, attempts to look like a normal browser☆23Updated 11 months ago
- Scrape various open data directories to create an index of what's available out there☆37Updated 7 months ago
- This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Larg…☆24Updated 6 months ago
- Generate embeddings for images and text using CLIP with LLM☆74Updated last year
- TextractAI: Extract and process text from PDFs using Python, OpenAI API, and OCR techniques.☆14Updated last year
- GO GO EXPERIMENTAL LAB☆16Updated last week
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆148Updated 8 months ago
- Zyte API integration for Scrapy☆38Updated last month
- A microservice for document conversion at scale☆79Updated this week
- Minimal set of tools to conduct stealthy scraping.☆159Updated 2 years ago
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated 3 weeks ago
- Create "perfect" snapshots of web pages☆33Updated last month