iipc / urlcanonLinks
url canonicalization library for python and java
☆36Updated 3 years ago
Alternatives and similar repositories for urlcanon
Users that are interested in urlcanon are comparing it to the libraries listed below
Sorting:
- Sort-friendly URI Reordering Transform (SURT) python module☆44Updated 4 months ago
- Python library for reading and writing warc files☆247Updated 3 years ago
- Trough: Big data, small databases.☆41Updated last year
- Create and edit WARC and WACZ files☆20Updated last year
- Utility library to turn country names into ISO two-letter codes☆71Updated 5 months ago
- Centralised repository for WARC usage specifications.☆120Updated 3 months ago
- track changes to the news, where news is anything with an RSS feed☆182Updated 5 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆192Updated 3 years ago
- A queue-controlled browser automation tool for improving web crawl quality☆64Updated 5 months ago
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆132Updated last month
- Python implementation of WHATWG URL Living Standard☆21Updated last year
- Webrecorders DevTools Protocol Automation Library☆18Updated 3 years ago
- A tiny library for Python text normalisation. Useful for ad-hoc text processing.☆157Updated 4 months ago
- ☆11Updated last month
- Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)☆168Updated 4 months ago
- CDXJ Indexing of WARC/ARCs☆31Updated last year
- A pure-Python robots.txt parser with support for modern conventions.☆76Updated last month
- A component that tries to avoid downloading duplicate content☆27Updated last week
- Python WSGI Middleware for adding HTTP/S proxy support to any WSGI Application☆24Updated 5 years ago
- Scrapy schema validation pipeline and Item builder using JSON Schema☆45Updated 4 years ago
- Tools for helping you work with web platform archive downloads.☆18Updated 5 years ago
- python library for extracting html microdata☆167Updated 2 years ago
- Convert Directories, Files and ZIP Files to Web Archives (WARC)☆91Updated 8 months ago
- WASAPI data transfer APIs☆48Updated 3 years ago
- Serving content from a WARC☆62Updated 13 years ago
- Extract text from HTML☆134Updated last week
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆156Updated 3 months ago
- CSS Selectors for Python☆305Updated last month
- URL normalization for Python☆99Updated 8 months ago