iipc / urlcanonLinks
url canonicalization library for python and java
☆34Updated 3 years ago
Alternatives and similar repositories for urlcanon
Users that are interested in urlcanon are comparing it to the libraries listed below
Sorting:
- Sort-friendly URI Reordering Transform (SURT) python module☆42Updated 10 months ago
- Trough: Big data, small databases.☆42Updated 11 months ago
- Webrecorders DevTools Protocol Automation Library☆17Updated 2 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- The god of human readable numbers☆13Updated 5 years ago
- Centralised repository for WARC usage specifications.☆115Updated 7 months ago
- Python library for reading and writing warc files☆241Updated 3 years ago
- Python implementation of WHATWG URL Living Standard☆21Updated last year
- Tools for helping you work with web platform archive downloads.☆17Updated 5 years ago
- CDXJ Indexing of WARC/ARCs☆26Updated 6 months ago
- ☆11Updated last year
- Python WSGI Middleware for adding HTTP/S proxy support to any WSGI Application☆24Updated 4 years ago
- URL Transformation, Sanitization☆103Updated last year
- A queue-controlled browser automation tool for improving web crawl quality☆61Updated 3 months ago
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆191Updated 3 years ago
- WARC and ARC indexing and discovery tools.☆124Updated 3 months ago
- A pure-Python robots.txt parser with support for modern conventions.☆69Updated this week
- Nondestructive warc-in-tar to warc conversion☆26Updated 12 years ago
- Common web archive utility code.☆55Updated last month
- IMAP server based on Twitter statuses☆55Updated 15 years ago
- A component that tries to avoid downloading duplicate content☆27Updated 7 years ago
- A helper library full of URL-related heuristics.☆69Updated 2 weeks ago
- A list of tools related to W(eb)ARC(hive)☆62Updated 10 years ago
- Wikipedia citation tool for Google Books, New York Times, ISBN, DOI and more☆22Updated 8 years ago
- Microformats2 parser written in Python☆106Updated 7 months ago
- A Memento Aggregator CLI and Server in Go☆65Updated 3 months ago
- Parsing and validation of URIs (RFC 3896) and IRIs (RFC 3987)☆46Updated last year
- A client for the Archive-It And Webrecorder WASAPI Data Transfer API☆16Updated 5 years ago
- Scrapy schema validation pipeline and Item builder using JSON Schema☆44Updated 4 years ago
- utility to fetch provenance information from Internet Archive's Wayback Machine☆13Updated 3 years ago