edgi-govdata-archiving / wayback
A Python API to the Internet Archive Wayback Machine
☆71Updated 7 months ago
Alternatives and similar repositories for wayback:
Users that are interested in wayback are comparing it to the libraries listed below
- A helper library full of URL-related heuristics.☆69Updated last week
- Wayback Machine API interface & a command-line tool☆516Updated last year
- Converts WARC files to static HTML☆44Updated 9 months ago
- The little things give you away... A collection of various small helper stuff – Mirror repo only, no longer kept in sync, refer to gitea.…☆23Updated 4 years ago
- A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service☆177Updated 5 months ago
- etl pipeline, graphical explorer and general toolbox for investigations with follow the money data☆16Updated last year
- A maximum-strength name parser for record linkage.☆36Updated last month
- Extract networks of entities from journalistic reporting☆48Updated last year
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆26Updated 2 years ago
- Pre-built Scrapy spiders for AutoExtract☆19Updated 11 months ago
- Alternative robots parser module for Python☆17Updated 3 weeks ago
- A PDF classifier ensemble with REST API service☆23Updated 4 years ago
- Dataset: BuzzFeed News “Trending” Strip, 2018–2023☆19Updated last year
- Python based Wikidata framework for easy dataframe extraction☆43Updated last year
- A Python implementation of Lunr.js 🌖☆196Updated 3 weeks ago
- Sort-friendly URI Reordering Transform (SURT) python module☆41Updated 8 months ago
- 🌬️urlExpander is a Python package for expanding shortened links (urls).☆73Updated 2 years ago
- A modern Python library for writing maintainable web scrapers.☆247Updated 8 months ago
- A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.☆113Updated last year
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.☆39Updated this week
- Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.☆149Updated 2 months ago
- API client for Aleph, supports bulk entity and document upload.☆28Updated 5 months ago
- Some tools to help analyze the twitter archive☆62Updated 7 months ago
- an experimental implementation of Burrow's delta in Python 3☆21Updated 3 years ago
- America's most comprehensive dictionary of campaign finance jargon. A free resource created by and for data journalists.☆17Updated last month
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- Tag news stories based on models trained on the NYT corpus.☆42Updated 2 years ago
- Extract text from HTML☆135Updated 4 years ago
- how hard is it to get a list of all local news sites in the United States (LOL)☆8Updated 4 years ago
- A framework for quick web archiving; canonical repository: https://gitea.arpa.li/JustAnotherArchivist/qwarc☆27Updated 3 years ago