edgi-govdata-archiving / wayback
A Python API to the Internet Archive Wayback Machine
☆71Updated 8 months ago
Alternatives and similar repositories for wayback:
Users that are interested in wayback are comparing it to the libraries listed below
- Wayback Machine API interface & a command-line tool☆523Updated last year
- A helper library full of URL-related heuristics.☆69Updated last month
- A framework for quick web archiving; canonical repository: https://gitea.arpa.li/JustAnotherArchivist/qwarc☆27Updated 3 years ago
- The little things give you away... A collection of various small helper stuff – Mirror repo only, no longer kept in sync, refer to gitea.…☆23Updated 4 years ago
- Taupe takes a downloaded Twitter archive ZIP file, extracts the URLs corresponding to tweets, retweets, replies, quote tweets, and liked …☆33Updated 2 years ago
- Parse government documents into well formed JSON☆68Updated 2 months ago
- ☆62Updated 3 months ago
- Support for writing WARC files with Scrapy☆21Updated 5 years ago
- UNOFFICIAL Python API to interface with Parler.com☆53Updated 9 months ago
- The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords☆74Updated last year
- Some tools to help analyze the twitter archive☆62Updated 8 months ago
- A Python module for clustering creators of social media content into networks☆74Updated 3 years ago
- Template repository and README for submissions to Bellingcat's Global Hackathon☆16Updated 2 years ago
- Alternative robots parser module for Python☆17Updated last month
- Save an RSS or ATOM feed to a SQLite database☆50Updated 2 years ago
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.☆40Updated 2 weeks ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆26Updated 8 months ago
- 🌬️urlExpander is a Python package for expanding shortened links (urls).☆73Updated 2 years ago
- A set of utilities for processing MediaWiki XML dump data.☆53Updated 2 months ago
- Common crawl extractor☆75Updated 11 months ago
- Python based Wikidata framework for easy dataframe extraction☆44Updated last year
- Add website scraping abilities to Datasette☆62Updated 2 years ago
- Dataset: BuzzFeed News “Trending” Strip, 2018–2023☆19Updated last year
- A PDF classifier ensemble with REST API service☆23Updated 4 years ago
- python functions for applied use of schema.org☆36Updated 3 years ago
- Tag news stories based on models trained on the NYT corpus.☆42Updated 2 years ago
- A Python tool to search for and remove duplicated files in messy datasets☆16Updated 4 months ago
- Extract place names from a URL or text, and add context to those names -- for example distinguishing between a country, region or city.☆129Updated last year
- WARC and ARC indexing and discovery tools.☆123Updated last month
- etl pipeline, graphical explorer and general toolbox for investigations with follow the money data☆21Updated last year