iipc / urlcanonLinks
url canonicalization library for python and java
☆36Updated 3 years ago
Alternatives and similar repositories for urlcanon
Users that are interested in urlcanon are comparing it to the libraries listed below
Sorting:
- Sort-friendly URI Reordering Transform (SURT) python module☆44Updated last month
- Python library for reading and writing warc files☆243Updated 3 years ago
- Centralised repository for WARC usage specifications.☆117Updated 2 weeks ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆130Updated 3 months ago
- Trough: Big data, small databases.☆40Updated last year
- A tiny library for Python text normalisation. Useful for ad-hoc text processing.☆155Updated last month
- Create and edit WARC and WACZ files☆17Updated 10 months ago
- python library for extracting html microdata☆166Updated 2 years ago
- Streaming WARC/ARC library for fast web archive IO☆434Updated 10 months ago
- Webrecorders DevTools Protocol Automation Library☆17Updated 3 years ago
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆191Updated 3 years ago
- track changes to the news, where news is anything with an RSS feed☆179Updated 5 years ago
- A component that tries to avoid downloading duplicate content☆27Updated 7 years ago
- Web archive index server based on RocksDB