jvanz / libwarcLinks
C++ library to parse WARC files
☆11Updated 6 years ago
Alternatives and similar repositories for libwarc
Users that are interested in libwarc are comparing it to the libraries listed below
Sorting:
- Open source software for image correlation, distance and analysis☆61Updated 2 years ago
- A queue-controlled browser automation tool for improving web crawl quality☆63Updated 4 months ago
- Esper instance for TV news analysis☆40Updated 3 years ago
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆25Updated 8 years ago
- ☆11Updated 6 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 8 years ago
- Trough: Big data, small databases.☆40Updated last year
- A component that tries to avoid downloading duplicate content☆27Updated 7 years ago
- Fast filtering and animation of large dynamic networks☆39Updated 9 years ago
- Exploring internet domain names with deep learning using vector embeddings☆20Updated 7 years ago
- mltk - Moz Language Tool Kit☆12Updated 10 years ago
- Simhashing in C++☆135Updated 2 years ago
- Implementation of perceptual image hash calculation in Python☆133Updated 2 years ago
- extract difference between two html pages☆32Updated 7 years ago
- A service that provides archive-aware oEmbed-compatible embeddable surrogates (social cards, thumbnails, etc.) for archived web pages (me…☆14Updated 4 years ago
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 4 years ago
- scraper for facebook, gab, google and tiktok☆21Updated 6 months ago
- Serving content from a WARC☆62Updated 12 years ago
- Download *ALL* the submissions from Hacker News☆51Updated 11 years ago
- Word lists for analyzing media reporting☆23Updated 7 years ago
- Sort-friendly URI Reordering Transform (SURT) python module☆44Updated 3 months ago
- Classifying the content of domains☆58Updated 3 months ago
- Browsertrix: Containerized High-Fidelity Browser-Based Automated Crawling + Behavior System☆87Updated 4 years ago
- Mad (╯°□°)╯'ing☆10Updated 3 years ago
- Tools to work with the Google DNS over HTTPS API in R☆24Updated 5 years ago
- Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation fo…☆106Updated 2 years ago
- Locality-sensitive hashing algorithm for text similarity comparisons☆59Updated 8 months ago
- Grabbing all news.☆62Updated 5 years ago
- Search for similar short strings☆53Updated 5 years ago
- A PDF classifier ensemble with REST API service☆23Updated 4 years ago