internetarchive / archive-pdf-tools
Fast PDF generation and compression. Deals with millions of pages daily.
☆112Updated 7 months ago
Alternatives and similar repositories for archive-pdf-tools:
Users that are interested in archive-pdf-tools are comparing it to the libraries listed below
- Efficient hOCR tooling☆42Updated last month
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆187Updated last month
- Master repository which includes most other OCR-D repositories as submodules☆72Updated last month
- Conversions between various OCR formats☆74Updated last year
- A post-processing tool for scanned sheets of paper.☆80Updated last year
- smoothscan is a tool to convert scanned text into a vectorized output form.☆67Updated 11 years ago
- The hOCR Embedded OCR Workflow and Output Format☆74Updated 7 months ago
- Automatic de-keystoning for single camera DIY book scanners.☆48Updated 4 years ago
- Tools to process books in a cloud based pipeline system☆57Updated last week
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- Documentation and use cases for ALTO XML☆41Updated 6 years ago
- Web based JavaScript GUI library for proofreading/editing hOCR☆95Updated 6 years ago
- An OCR evaluation tool☆65Updated last month
- Industry-based resolutions for issues and errata reported against any PDF-related specification☆69Updated last month
- Perseus Treebank Data☆72Updated 9 months ago
- Specifications developed and maintained by the Webrecorder community.☆128Updated 2 months ago
- OCR evaluation brought to you by University of Alicante☆67Updated 2 years ago
- ☆42Updated 11 months ago
- Centralised repository for WARC usage specifications.☆109Updated 4 months ago
- A list of things related to software, literature, and other content for 🕣 Memento☆95Updated 9 months ago
- A Wikimedia Toolforge tool for exporting ebooks from Wikisources.☆82Updated last week
- CDXJ Indexing of WARC/ARCs☆25Updated 3 months ago
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.☆39Updated last week
- PhiloLogic4☆38Updated 3 months ago
- Comparing warc files☆17Updated 6 years ago
- Convert Directories, Files and ZIP Files to Web Archives (WARC)☆85Updated this week
- Convert between Tesseract hOCR and ALTO XML using XSL stylesheets☆55Updated 8 months ago
- Note: the repo has been moved to https://gitlab.com/readcoop/Transkribus/TranskribusCore☆37Updated 4 years ago
- A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service☆177Updated 5 months ago
- Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user ac…☆52Updated last month