zytedata / clear-htmlLinks
Remove DIVs, style stuff and normalize HTML preserving structure information
☆11Updated 2 weeks ago
Alternatives and similar repositories for clear-html
Users that are interested in clear-html are comparing it to the libraries listed below
Sorting:
- Create an LLM XML context document from an llms.txt file☆22Updated last year
- Datasette plugin for searching all searchable tables at once☆27Updated last week
- Web scraping Page Objects core library☆102Updated 2 weeks ago
- Remote web browser automation.☆22Updated last year
- LLM plugin for embeddings using sentence-transformers☆72Updated 6 months ago
- Common crawl extractor☆80Updated last year
- Run embedding models using ONNX☆35Updated last year
- Spider templates for automatic crawlers.☆32Updated last month
- YouTube Transcript Cleaner is a simple web-based application that improves the readability of YouTube transcripts.☆26Updated 8 months ago
- Multi-agent workflows and complex Agent interactions, both via YAML manifest and programmatic usage. MCP & ACP (Agent Client Protocol) s…☆29Updated last week
- ☆19Updated last year
- Unleash the full potential of exascale LLMs on consumer-class GPUs, proven by extensive benchmarks, with no long-term adjustments and min…☆25Updated last year
- LLM access to pplx-api☆31Updated 2 weeks ago
- ☆26Updated last year
- Spider ported to Python☆97Updated 9 months ago
- Neural search engine for discovering semantically similar Python repositories on GitHub☆26Updated last year
- Library that helps use puppeteer in scrapy.☆52Updated 3 months ago
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated 2 months ago
- A FastAPI extension for integrating common AI agent frameworks.☆45Updated 9 months ago
- Chrome Extension for exploring Hugging Face datasets 🔎☆49Updated last year
- GO GO EXPERIMENTAL LAB☆16Updated 2 weeks ago
- agenty☆43Updated 8 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆149Updated last week
- estela, an elastic web scraping cluster 🕸☆191Updated this week
- Python wrapper for Ferret☆43Updated 3 years ago
- This repository contains a Retrieval-Augmented Generation (RAG) framework developed in C++ for high performance and scalability, with CUD…☆106Updated 2 months ago
- ☆20Updated 7 months ago
- Git scrapers for scraping the fediverse☆16Updated this week
- Extract structured data from any unstructured web page☆41Updated last year
- Page Object pattern for Scrapy☆123Updated 3 weeks ago