iipc / urlcanon
url canonicalization library for python and java
☆33Updated 2 years ago
Related projects ⓘ
Alternatives and complementary repositories for urlcanon
- Sort-friendly URI Reordering Transform (SURT) python module☆40Updated 3 months ago
- Centralised repository for WARC usage specifications.☆100Updated this week
- Trough: Big data, small databases.☆40Updated 3 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆42Updated 6 years ago
- CDXJ Indexing of WARC/ARCs☆21Updated last week
- Utility library to turn country names into ISO two-letter codes☆66Updated this week
- Web archive index server based on RocksDB☆32Updated this week
- Webrecorders DevTools Protocol Automation Library☆17Updated 2 years ago
- A classifier for detecting soft 404 pages☆57Updated last year
- A component that tries to avoid downloading duplicate content☆27Updated 6 years ago
- ☆18Updated 8 years ago
- WARC and ARC indexing and discovery tools.☆117Updated 3 months ago
- Common web archive utility code.☆50Updated last month
- A pure-Python robots.txt parser with support for modern conventions.☆55Updated this week
- utility to fetch provenance information from Internet Archive's Wayback Machine☆13Updated 2 years ago
- A Memento Aggregator CLI and Server in Go☆57Updated 6 months ago
- Command line tool for digging into WARC files☆34Updated 3 weeks ago
- A queue-controlled browser automation tool for improving web crawl quality☆60Updated 4 years ago
- A search interface and wayback machine for the UKWA Solr based warc-indexer framework.☆102Updated last week
- Web application for distributed compute analysis of Archive-It web archive collections.☆15Updated 2 months ago
- Performance-focused replacement for Python urllib☆21Updated 6 years ago
- Wikipedia citation tool for Google Books, New York Times, ISBN, DOI and more☆21Updated 8 years ago
- python library for extracting html microdata☆165Updated last year
- Grabbing all news.☆62Updated 4 years ago
- Pluggable DSL that uses pipes to perform a series of linear transformations to extract data☆15Updated 4 months ago
- A scrapy extension to store requests and responses information in storage service☆26Updated 2 years ago
- ☆11Updated last year
- https://mimesniff.spec.whatwg.org/ implementation for Python☆14Updated 10 months ago
- extract difference between two html pages☆32Updated 6 years ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆25Updated 3 months ago