jvanz / libwarc
C++ library to parse WARC files
☆12Updated 6 years ago
Alternatives and similar repositories for libwarc:
Users that are interested in libwarc are comparing it to the libraries listed below
- code to remove "noise" from hOCR output of Tesseract OCR.☆14Updated 8 years ago
- Data science tools from Moz☆22Updated 8 years ago
- Tools for working with Optical Character Recognition output☆16Updated 11 years ago
- A C++ library implementing fast language models estimation using the 1-Sort algorithm.☆17Updated last year
- An index data structure for approximate string search.☆23Updated 5 years ago
- A queue-controlled browser automation tool for improving web crawl quality☆60Updated this week
- Inverted file indexing and retrieval optimized for short texts. Supports auto-suggest and query segment classification.☆33Updated last year
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆44Updated 7 years ago
- An efficient data structure for fast string similarity searches☆22Updated 4 years ago
- ☆24Updated 9 years ago
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆25Updated 7 years ago
- Trough: Big data, small databases.☆40Updated 7 months ago
- Images of Text to Text: Call Tesseract from Python and OCR a directory of pdfs☆15Updated 5 years ago
- extract difference between two html pages☆32Updated 6 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- A PDF classifier ensemble with REST API service☆23Updated 4 years ago
- ☆21Updated 9 years ago
- Google books word frequencies for words in the CMU Pronunciation Dictionary☆14Updated 7 years ago
- Linked SDMX☆17Updated 10 years ago
- A powerful, tagset-independent and theory-neutral meta model and API for storing, manipulating, and representing nearly all types of ling…☆15Updated last year
- Mad (╯°□°)╯'ing☆10Updated 2 years ago
- Serving content from a WARC☆61Updated 12 years ago
- C++11 library for fast fuzzy searching☆14Updated 9 years ago
- produce a stream of citiation data coming off wikimedia☆12Updated 7 years ago
- Wikipedia Data Analysis Toolkit☆26Updated 8 years ago
- Collects multimedia content shared through social networks.☆19Updated 10 years ago
- PLOS Subject Area Thesaurus☆40Updated 4 months ago
- ☆12Updated 5 years ago
- Documentation and research output for Depsy (see https://github.com/impactstory/depsy for source of Depsy itself)☆22Updated 8 years ago
- Part of eMOP: Franken+ tool for creating font training for Tesseract OCR engine from page images.☆24Updated 9 years ago