jvanz / libwarc
C++ library to parse WARC files
☆12Updated 5 years ago
Alternatives and similar repositories for libwarc:
Users that are interested in libwarc are comparing it to the libraries listed below
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆25Updated 7 years ago
- Sort-friendly URI Reordering Transform (SURT) python module☆41Updated 5 months ago
- Serving content from a WARC☆61Updated 12 years ago
- Trough: Big data, small databases.☆40Updated 5 months ago
- A queue-controlled browser automation tool for improving web crawl quality☆60Updated 4 years ago
- Inverted file indexing and retrieval optimized for short texts. Supports auto-suggest and query segment classification.☆33Updated last year
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆43Updated 7 years ago
- Mad (╯°□°)╯'ing☆10Updated 2 years ago
- A platform for collecting, analyzing, and visualizing social media data.☆12Updated 4 years ago
- WASAPI data transfer APIs☆43Updated 2 years ago
- code to remove "noise" from hOCR output of Tesseract OCR.☆14Updated 8 years ago
- extract difference between two html pages☆32Updated 6 years ago
- A component that tries to avoid downloading duplicate content☆27Updated 6 years ago
- A workflow system for Natural Language Processing.☆21Updated 5 years ago
- A toolkit for clustering web pages based on various similarity measures.☆33Updated 3 years ago
- Globally optimal geometric matching.☆10Updated 8 years ago
- RESTful API around the PETRARCH coding software☆10Updated 3 years ago
- C++11 library for fast fuzzy searching☆14Updated 9 years ago
- A PDF classifier ensemble with REST API service☆23Updated 3 years ago
- Machine assisted dossiers☆19Updated 7 years ago
- This repository contains tool and collections dataset for detecting off-topic pages from Web archived collections.☆18Updated 9 years ago
- ☆11Updated last year
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 3 years ago
- ☆24Updated 9 years ago
- Examples of corrupt CSV files and how they trick various parsers☆10Updated 8 years ago
- Datasette plugin for serving media based on a SQL query☆19Updated 2 years ago
- stoplists for African languages generated from the ASP corpus☆14Updated 9 years ago
- Tools for exploring the contents of web archive files.☆39Updated 4 years ago
- A dockerized, queued high fidelity web archiver based on Squidwarc☆56Updated 6 months ago