iipc / urlcanonLinks
url canonicalization library for python and java
☆36Updated 3 years ago
Alternatives and similar repositories for urlcanon
Users that are interested in urlcanon are comparing it to the libraries listed below
Sorting:
- Sort-friendly URI Reordering Transform (SURT) python module☆44Updated 2 months ago
- Python library for reading and writing warc files☆244Updated 3 years ago
- Centralised repository for WARC usage specifications.☆118Updated last month
- Sickle: OAI-PMH for Humans☆114Updated 2 years ago
- A pure-Python robots.txt parser with support for modern conventions.☆72Updated last week
- Create and edit WARC and WACZ files☆17Updated 11 months ago
- Trough: Big data, small databases.☆40Updated last year
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆191Updated 3 years ago
- A tiny library for Python text normalisation. Useful for ad-hoc text processing.☆157Updated 2 months ago
- Webrecorders DevTools Protocol Automation Library☆17Updated 3 years ago
- Utility library to turn country names into ISO two-letter codes☆71Updated 3 months ago
- Streaming WARC/ARC library for fast web archive IO☆438Updated 11 months ago
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆130Updated 3 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- URL normalization for Python☆99Updated 6 months ago
- CDXJ Indexing of WARC/ARCs☆30Updated 11 months ago
- A queue-controlled browser automation tool for improving web crawl quality☆63Updated 3 months ago
- python library for extracting html microdata☆166Updated 2 years ago
- track changes to the news, where news is anything with an RSS feed☆179Updated 5 years ago
- An experimental Python parser for MediaWiki syntax with a focus on extensibility and comprehensibility☆60Updated 3 years ago
- A Memento Client Library in Python☆26Updated 7 years ago
- The oaipmh module is a Python implementation of an "Open Archives$ Initiative Protocol for Metadata Harvesting"☆86Updated 2 years ago
- Python implementation of WHATWG URL Living Standard☆21Updated last year
- Serving content from a WARC☆62Updated 12 years ago
- Github mirror - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing)☆37Updated last year
- A client for the Archive-It And Webrecorder WASAPI Data Transfer API☆16Updated 6 years ago
- Python package for harvesting records from OAI-PMH provider(s).☆64Updated 3 years ago
- Seeder - Czech webarchive curating tool and public site☆17Updated 2 months ago
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆26Updated 2 years ago
- Command line tool for digging into WARC files☆47Updated this week