matiskay / html-clusterLinks
A command line tool to cluster html pages based on structural and style similarity.
☆20Updated 2 weeks ago
Alternatives and similar repositories for html-cluster
Users that are interested in html-cluster are comparing it to the libraries listed below
Sorting:
- Simple heuristic for measuring web page similarity (& data set)☆90Updated 7 years ago
- Compare html similarity using structural and style metrics☆212Updated 2 years ago
- Broad crawler for domain discovery☆19Updated 7 years ago
- A component that tries to avoid downloading duplicate content☆27Updated 7 years ago
- Pre-built Scrapy spiders for AutoExtract☆19Updated last year
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Updated last year
- extract difference between two html pages☆32Updated 7 years ago
- Spell correct entire sentences using nltk freqdist and symspell☆19Updated 7 years ago
- Package to facilitate URL clustering☆68Updated 9 years ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- Python clients for Zyte AutoExtract API☆40Updated 3 years ago
- Analyze scraped data☆46Updated 5 years ago
- Show summary of a large number of URLs in a Jupyter Notebook