jvanz / libwarcLinks
C++ library to parse WARC files
☆11Updated 7 years ago
Alternatives and similar repositories for libwarc
Users that are interested in libwarc are comparing it to the libraries listed below
Sorting:
- A queue-controlled browser automation tool for improving web crawl quality☆64Updated 5 months ago
- Fast filtering and animation of large dynamic networks☆39Updated 9 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- A LevelDB backed URL unshortening microservice written in JavaScript☆31Updated 3 years ago
- Python library for reading and writing warc files☆247Updated 3 years ago
- A service that provides archive-aware oEmbed-compatible embeddable surrogates (social cards, thumbnails, etc.) for archived web pages (me…☆14Updated 4 years ago
- mltk - Moz Language Tool Kit☆12Updated 10 years ago
- C++ implementation of hamming distance algorithm HmSearch using Kyoto Cabinet☆42Updated 9 years ago
- ☆11Updated 6 years ago
- ☆24Updated 10 years ago
- This repository contains tool and collections dataset for detecting off-topic pages from Web archived collections.☆18Updated 10 years ago
- Trough: Big data, small databases.☆41Updated last year
- A place to collect and share knowledge about liberating data from PDFs☆55Updated 4 years ago
- A PDF classifier ensemble with REST API service☆23Updated 4 years ago
- A toolkit for clustering web pages based on various similarity measures.☆34Updated 4 years ago
- Classifying the content of domains☆58Updated 4 months ago
- Using your audience as a hive mind for deep learning☆18Updated 7 years ago
- Simhashing in C++☆136Updated 2 years ago
- Sort-friendly URI Reordering Transform (SURT) python module☆44Updated 4 months ago
- SimString☆113Updated 4 years ago
- Mad (╯°□°)╯'ing☆10Updated 3 years ago
- Weighted MinHash implementation on CUDA (multi-gpu).☆121Updated 2 years ago
- Open-Source Information Retrieval Reproducibility Challenge☆50Updated 10 years ago
- WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…☆48Updated 4 years ago
- Traptor -- A distributed Twitter feed☆26Updated 3 years ago
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆25Updated 8 years ago
- Roaring Bitmap in Cython☆82Updated last year
- Images of Text to Text: Call Tesseract from Python and OCR a directory of pdfs☆16Updated 6 years ago
- (Mental) maps of texts with kernel density estimation and force-directed networks.☆108Updated 10 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 8 years ago