iipc / urlcanonLinks
url canonicalization library for python and java
☆36Updated 3 years ago
Alternatives and similar repositories for urlcanon
Users that are interested in urlcanon are comparing it to the libraries listed below
Sorting:
- Sort-friendly URI Reordering Transform (SURT) python module☆44Updated 3 months ago
- Python library for reading and writing warc files☆246Updated 3 years ago
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆131Updated last month
- Streaming WARC/ARC library for fast web archive IO☆441Updated last year
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆191Updated 3 years ago
- A Memento Client Library in Python☆26Updated 7 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- A tiny library for Python text normalisation. Useful for ad-hoc text processing.☆157Updated 3 months ago
- A pure-Python robots.txt parser with support for modern conventions.☆75Updated 2 weeks ago
- Trough: Big data, small databases.☆41Updated last year
- URL normalization for Python☆99Updated 8 months ago
- Create and edit WARC and WACZ files☆20Updated last year
- CDXJ Indexing of WARC/ARCs☆31Updated last year
- Python implementation of WHATWG URL Living Standard☆21Updated last year
- Centralised repository for WARC usage specifications.☆120Updated 2 months ago
- Utility library to turn country names into ISO two-letter codes☆71Updated 4 months ago
- track changes to the news, where news is anything with an RSS feed☆179Updated 5 years ago
- Sickle: OAI-PMH for Humans☆115Updated 2 years ago
- python library for extracting html microdata☆166Updated 2 years ago
- Grabbing all news.☆62Updated 6 years ago
- Pure Python wrapper to the Yajl C Library☆85Updated last year
- A list of tools related to W(eb)ARC(hive)☆66Updated 11 years ago
- WASAPI data transfer APIs☆48Updated 3 years ago
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆26Updated 3 years ago
- [OBSOLETE] Replaced by https://gitlab.wikimedia.org/toolforge-repos/python-toolforge☆22Updated 2 years ago
- ☆11Updated last month
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆155Updated 2 months ago
- URL Transformation, Sanitization☆103Updated last year
- ϲοnfuѕаblе_һοmоɡlyphs☆162Updated last year
- A queue-controlled browser automation tool for improving web crawl quality☆63Updated 4 months ago