measuresforjustice / textricatorLinks
Textricator is a tool to extract text from documents and generate structured data.
☆348Updated 4 months ago
Alternatives and similar repositories for textricator
Users that are interested in textricator are comparing it to the libraries listed below
Sorting:
- Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…☆270Updated 2 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆191Updated 2 months ago
- A cross-platform command line tool for parallelised content extraction and analysis.☆247Updated last week
- Ergonomic line-by-line transcription of scanned text.☆53Updated 4 years ago
- A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!☆292Updated 2 months ago
- 🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based☆326Updated last year
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…☆294Updated 3 years ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆395Updated 11 months ago
- Run Overview on your own system☆125Updated 4 years ago
- A simple OpenRefine reconciliation service that runs on top of a CSV file☆121Updated 10 years ago
- Shell script to run OpenRefine in batch mode (import, transform, export). It orchestrates OpenRefine (server) and a python client that co…☆86Updated last year
- Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Tex…☆1,059Updated 3 months ago
- Data Curator - share usable open data☆276Updated 3 years ago
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.☆144Updated last year
- A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR …☆65Updated last year
- Web based JavaScript GUI library for proofreading/editing hOCR☆95Updated 6 years ago
- Ocular is a state-of-the-art historical OCR system.☆264Updated last year
- ☆210Updated 4 years ago
- ☆102Updated 3 weeks ago
- An online annotation platform for teaching and learning in the humanities.☆108Updated last week
- PDF to XML ALTO file converter☆248Updated 3 weeks ago
- python app/framework for 'all things ISBN' including metadata, descriptions, covers...☆229Updated 2 years ago
- A framework for creating web-based knowledge maps☆204Updated this week
- Tool that does layout analysis and/or text recognition using tesseract and outputs the result in Page XML format☆46Updated 4 months ago
- The hOCR Embedded OCR Workflow and Output Format☆74Updated 11 months ago
- Fast PDF generation and compression. Deals with millions of pages daily.☆118Updated last month
- ALTO XML schema - latest and all former versions☆53Updated last year
- The smart and simple way to automate document assembly☆408Updated 7 years ago
- A scraping command line tool for the modern web☆261Updated 8 years ago
- Open Semantic Visual Linked Data Graph Explorer: Open Source tool (web app) and user interace (UI) for discovery, exploration and visuali…☆84Updated 5 years ago