A queue-controlled browser automation tool for improving web crawl quality
☆64Aug 13, 2025Updated 6 months ago
Alternatives and similar repositories for umbra
Users that are interested in umbra are comparing it to the libraries listed below
Sorting:
- Using social media to steer web archiving and curation.☆18Nov 20, 2015Updated 10 years ago
- Web archive index server based on RocksDB☆38Updated this week
- Collects multimedia content shared through social networks.☆19Feb 18, 2015Updated 11 years ago
- "Old SFM" -- manage rules and streams from social data sources, starting with twitter.☆86Aug 10, 2023Updated 2 years ago
- Trough: Big data, small databases.☆41Jul 25, 2024Updated last year
- Sort-friendly URI Reordering Transform (SURT) python module☆45Sep 11, 2025Updated 5 months ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆156Oct 8, 2025Updated 4 months ago
- a framework and language for exploring and analyzing feeds of social media data.☆23Jan 25, 2012Updated 14 years ago
- common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text☆35Sep 30, 2016Updated 9 years ago
- LINKED DATA QUALITY REPORTS☆41May 20, 2022Updated 3 years ago
- Site Hound (previously THH) is a Domain Discovery Tool☆23Feb 10, 2026Updated 3 weeks ago
- ☆10Jun 10, 2016Updated 9 years ago
- A semantic web crawler☆20Sep 20, 2010Updated 15 years ago
- This project deals with hierarchical classification of web pages based on dmoz dataset.☆14Apr 10, 2014Updated 11 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Dec 4, 2017Updated 8 years ago
- Network Defender Toolkit☆18Jun 11, 2013Updated 12 years ago
- A set of tools and experimental scripts used to achieve multimodal learning with nonnegative matrix factorization (NMF).☆18Jul 22, 2016Updated 9 years ago
- A platform for collecting, analyzing, and visualizing social media data.☆13Dec 27, 2020Updated 5 years ago
- A re-useable, stand-alone version of LittleSis network storytelling tool☆12Jan 30, 2016Updated 10 years ago
- ☆11Nov 21, 2025Updated 3 months ago
- Graphical analysis of PDF structure.☆13Jan 9, 2017Updated 9 years ago
- Browser-based annotation tool for Framenet☆16Jan 27, 2015Updated 11 years ago
- Embedr.eu - Image Embedding Service (IES) with support for IIIF, OEmbed, zoomable viewer in an iFrame☆15Dec 5, 2015Updated 10 years ago
- Convert real-time bidding (RTB) models to the AppNexus Bonsai language☆15Oct 17, 2017Updated 8 years ago
- A PHP class that examines websites to learn about the software used.☆22Oct 1, 2020Updated 5 years ago
- A recommender system for GitHub repositories☆14Jun 21, 2014Updated 11 years ago
- Tweets annotated with coarse-grained sense labels (supersenses)☆13Jun 13, 2014Updated 11 years ago
- Presentation for the NYU Data Lab December 2015☆14Dec 2, 2015Updated 10 years ago
- HbbTV Application Template☆18Nov 13, 2014Updated 11 years ago
- A set of distinct value estimators that give probabilistic bounds on a sets cardinality☆22Dec 9, 2019Updated 6 years ago
- Parallelized web crawler written in Golang☆15Oct 2, 2018Updated 7 years ago
- JavaScript based graph visualization library with emphasis on customization and modularity.☆13Mar 21, 2019Updated 6 years ago
- Archive Research Services Workshop☆31Sep 29, 2017Updated 8 years ago
- Check out https://github.com/webrecorder/webrecorder for newer version matching https://webrecorder.io☆38Oct 16, 2015Updated 10 years ago
- Numeric Fu for the command line☆110Oct 2, 2020Updated 5 years ago
- A design prototype for DocNow to learn with☆14Apr 8, 2017Updated 8 years ago
- ReproZip for the Preservation of Web Applications☆17May 6, 2024Updated last year
- ☆23Mar 7, 2015Updated 10 years ago
- ❗ This repository is no longer maintained ❗☆15May 29, 2020Updated 5 years ago