macbre / mediawiki-dumpLinks
Python package for working with MediaWiki XML content dumps
☆25Updated 2 weeks ago
Alternatives and similar repositories for mediawiki-dump
Users that are interested in mediawiki-dump are comparing it to the libraries listed below
Sorting:
- A set of utilities for processing MediaWiki XML dump data.☆61Updated 11 months ago
- Translate files using Argos Translate☆30Updated 2 weeks ago
- A simple RAG chatbot that can retrieve from a mediawiki data dump☆22Updated last year
- A Memento Aggregator CLI and Server in Go☆76Updated 11 months ago
- A tool for detecting viruses and NSFW material in WARC files☆17Updated last month
- Libzim binding for Python: read/write ZIM files in Python☆97Updated 2 months ago
- wabac.js - Web Archive Browsing Augmentation Client☆122Updated last week
- Converts WARC files to static HTML☆51Updated 4 months ago
- Streaming WARC/ARC library for fast web archive IO☆446Updated last year
- A polite and user-friendly downloader for Common Crawl data☆67Updated 5 months ago
- ☆56Updated last year
- A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service☆188Updated 3 weeks ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆156Updated 4 months ago
- A Wikimedia Toolforge tool for exporting ebooks from Wikisources. [This repo has moved to https://gitlab.wikimedia.org/toolforge-repos/w…☆87Updated 5 months ago
- Convert Directories, Files and ZIP Files to Web Archives (WARC)☆92Updated 9 months ago
- Python library for reading and writing warc files☆247Updated 3 years ago
- A list of things related to software, literature, and other content for 🕣 Memento☆105Updated last week
- A post-processing tool for scanned sheets of paper.☆85Updated last year
- A Python library to parse MediaWiki WikiText☆317Updated 8 months ago
- Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)☆169Updated 5 months ago
- Command line tool for digging into WARC files☆50Updated last week
- A toolchain of tasks for sequencing and fingerprinting book fulltext☆46Updated last year
- A python based HTML to text conversion library, command line client and Web service.☆334Updated 2 months ago
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆132Updated 2 months ago
- A Tool To Push Web Resources Into Web Archives☆429Updated 2 years ago
- A tool for collection archival slivers of the web and web archives☆17Updated 11 months ago
- Scraper for downloading the entire ebooks repository of project Gutenberg☆155Updated last week
- Specifications developed and maintained by the Webrecorder community.☆140Updated 3 months ago
- Arquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PW…☆53Updated 3 months ago
- Centralised repository for WARC usage specifications.☆124Updated 3 months ago