jvanz / libwarcLinks
C++ library to parse WARC files
☆11Updated 6 years ago
Alternatives and similar repositories for libwarc
Users that are interested in libwarc are comparing it to the libraries listed below
Sorting:
- Classifying the content of domains☆56Updated 2 years ago
- A PDF classifier ensemble with REST API service☆23Updated 4 years ago
- A place to collect and share knowledge about liberating data from PDFs☆54Updated 3 years ago
- Tools to work with the Google DNS over HTTPS API in R☆25Updated 5 years ago
- extract difference between two html pages☆32Updated 7 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆25Updated 7 years ago
- ☆11Updated 6 years ago
- Command line tool to convert spreadsheets to databases, made for the UK's Office for National Statistics.☆80Updated last year
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 4 years ago
- Using your audience as a hive mind for deep learning☆18Updated 6 years ago
- A queue-controlled browser automation tool for improving web crawl quality☆61Updated 4 months ago
- A service that provides archive-aware oEmbed-compatible embeddable surrogates (social cards, thumbnails, etc.) for archived web pages (me…☆14Updated 3 years ago
- Trough: Big data, small databases.☆42Updated last year
- Fast filtering and animation of large dynamic networks☆39Updated 9 years ago
- Data on newspaper presidential endorsements☆30Updated 4 years ago
- Mad (╯°□°)╯'ing☆10Updated 2 years ago
- JavaScript based graph visualization library with emphasis on customization and modularity.☆13Updated 6 years ago
- scraper for facebook, gab, google and tiktok☆21Updated last month
- Machine assisted dossiers☆19Updated 7 years ago
- Source code that reproduces the results from the paper "Who Let The Trolls Out? Towards Understanding State-Sponsored Trolls" (https://ar…☆20Updated 6 years ago
- Download data on all of Donald Trump's (@realDonaldTrump) tweets☆42Updated 6 years ago
- code to remove "noise" from hOCR output of Tesseract OCR.☆14Updated 8 years ago
- PageOneX. Analyzing front pages☆52Updated 8 months ago
- Dexter document monitor for MMA☆17Updated last year
- twitter archives of political figures☆81Updated 8 years ago
- R client for the Virustotal Public API. Virustotal is a Google service that analyzes files and URLs for viruses etc.☆12Updated 2 years ago
- A script for rapidly sampling a proportion of lines from a file☆19Updated 10 years ago
- A tool for the geospatial analysis, literary network visualization, and plot mapping of ancient texts☆14Updated 6 years ago
- A repository of materials for a proposed class on automated story bots.☆49Updated 6 years ago