jvanz / libwarc
C++ library to parse WARC files
☆12Updated 5 years ago
Related projects ⓘ
Alternatives and complementary repositories for libwarc
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆24Updated 7 years ago
- Webrecorders DevTools Protocol Automation Library☆17Updated 2 years ago
- Trough: Big data, small databases.☆40Updated 3 months ago
- Corpus Build OCR platform☆8Updated last year
- A queue-controlled browser automation tool for improving web crawl quality☆60Updated 4 years ago
- This repository contains tool and collections dataset for detecting off-topic pages from Web archived collections.☆18Updated 9 years ago
- ☆12Updated 5 years ago
- code to remove "noise" from hOCR output of Tesseract OCR.☆14Updated 8 years ago
- Tools to analyze web archives☆20Updated 8 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆42Updated 6 years ago
- A C++ library implementing fast language models estimation using the 1-Sort algorithm.☆17Updated last year
- RESTful API around the PETRARCH coding software☆10Updated 3 years ago
- stoplists for African languages generated from the ASP corpus☆14Updated 8 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- An index data structure for approximate string search.☆23Updated 5 years ago
- ☆24Updated 9 years ago
- A PDF classifier ensemble with REST API service☆23Updated 3 years ago
- R library for common information retrieval metrics☆13Updated last year
- Examples of corrupt CSV files and how they trick various parsers☆10Updated 8 years ago
- An interactive 3D web viewer of up to million points on one screen that represent data. Provides interaction for viewing high-dimensional…☆26Updated 6 years ago
- WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…☆48Updated 2 years ago
- WASAPI data transfer APIs☆42Updated 2 years ago
- A place to collect and share knowledge about liberating data from PDFs☆53Updated 2 years ago
- A service that provides archive-aware oEmbed-compatible embeddable surrogates (social cards, thumbnails, etc.) for archived web pages (me…☆15Updated 3 years ago
- Anytime Ranking for Impact-Ordered Indexes☆12Updated 7 years ago
- Sort-friendly URI Reordering Transform (SURT) python module☆40Updated 3 months ago
- Tools for working with Optical Character Recognition output☆16Updated 10 years ago