measuresforjustice / textricatorLinks
Textricator is a tool to extract text from documents and generate structured data.
☆345Updated 3 months ago
Alternatives and similar repositories for textricator
Users that are interested in textricator are comparing it to the libraries listed below
Sorting:
- Extract tables from PDF pages.☆292Updated 5 years ago
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…☆294Updated 3 years ago
- Scripts and results from our OCR roundup, available on Source☆150Updated 6 years ago
- A post-processing tool for scanned sheets of paper.☆82Updated last year
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 7 years ago
- PDF to XML ALTO file converter☆244Updated 2 weeks ago
- Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…☆268Updated 2 years ago
- The hOCR Embedded OCR Workflow and Output Format☆73Updated 10 months ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆189Updated last month
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- An online annotation platform for teaching and learning in the humanities.☆108Updated this week
- A cross-platform command line tool for parallelised content extraction and analysis.☆245Updated 3 weeks ago
- A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR …☆65Updated last year
- Ergonomic line-by-line transcription of scanned text.☆52Updated 4 years ago
- A framework for creating web-based knowledge maps☆204Updated this week
- A simple OpenRefine reconciliation service that runs on top of a CSV file☆120Updated 9 years ago
- ☆136Updated 2 years ago
- A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.☆450Updated last year
- MOVED TO https://gitlab.com/crossref/pdfextract☆509Updated 7 years ago
- Shell script to run OpenRefine in batch mode (import, transform, export). It orchestrates OpenRefine (server) and a python client that co…☆85Updated last year
- Extract tables from PDF files☆357Updated 9 years ago
- A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!☆290Updated last month
- Automatic alignment of books between HathiTrust, Internet Archive, Google Books, etc.☆35Updated 2 months ago
- 🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based☆322Updated last year
- A curated list of resources around PDF files☆134Updated 10 months ago
- A post-processing tool for scanned sheets of paper.☆1,084Updated 11 months ago
- Industry supported, open source PDF/A validation library☆292Updated last week
- Ocular is a state-of-the-art historical OCR system.☆263Updated last year
- Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.☆31Updated 6 months ago
- veraPDF GUI, CLI and installer☆87Updated this week