measuresforjustice / textricator
Textricator is a tool to extract text from documents and generate structured data.
☆345Updated 9 months ago
Related projects: ⓘ
- Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…☆254Updated last year
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆176Updated last month
- Scripts and results from our OCR roundup, available on Source☆150Updated 5 years ago
- Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Tex…☆957Updated last year
- A self-hosted search engine for documents.☆585Updated this week
- Web based JavaScript GUI library for proofreading/editing hOCR☆89Updated 6 years ago
- web interface for recoll desktop search☆268Updated 4 years ago
- Ergonomic line-by-line transcription of scanned text.☆47Updated 3 years ago
- A post-processing tool for scanned sheets of paper.☆1,010Updated 2 months ago
- High-level build project for all LAPDF-Text submodules☆103Updated 9 years ago
- A cross-platform command line tool for parallelised content extraction and analysis.☆237Updated last week
- Extract tables from PDF pages.☆274Updated 4 years ago
- An application that brings humanities research methods to data visualization.☆170Updated 3 years ago
- ☆73Updated last week
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆363Updated last month
- PDF to XML ALTO file converter☆209Updated this week
- A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!☆266Updated 7 months ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆215Updated 4 years ago
- tool for collectively summarizing large discussions☆141Updated last year
- A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…☆294Updated 2 years ago
- Run Overview on your own system☆124Updated 3 years ago
- A framework for creating web-based knowledge maps☆194Updated this week
- 🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based☆290Updated 11 months ago
- Lightweight web scraping toolkit for documents and structured data.☆311Updated 8 months ago
- A web interface to extract tabular data from PDFs☆1,557Updated 4 months ago
- An expandable and scalable OCR pipeline☆86Updated 6 years ago
- Extract tables from PDF files☆354Updated 8 years ago
- Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.☆28Updated 9 months ago
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 6 years ago
- Industry supported, open source PDF/A validation library☆270Updated 2 weeks ago