apache / tika-docker
Convenience Docker images for Apache Tika Server
☆137Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for tika-docker
- Apache Tika Server as a Docker Image☆170Updated 2 years ago
- ☆162Updated 3 weeks ago
- PDF to XML ALTO file converter☆216Updated 2 months ago
- Official Dockerfile for Apache Solr☆27Updated last month
- GROBID extension for identifying and normalizing physical quantities.☆75Updated 2 months ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆370Updated 3 months ago
- Benchmarking PDF libraries☆226Updated last year
- A basic tool that extracts the structure from the PDF files of scientific articles.☆74Updated 2 years ago
- A component orchestration engine☆27Updated 11 months ago
- Demonstration of searching PDF document with Solr, Tika, and Tesseract☆30Updated last month
- Weaviate Web UI☆22Updated last year
- A Helm chart to deploy Apache Tika on Kubernetes.☆26Updated last week
- Python bindings to PDFium☆427Updated 3 weeks ago
- Apache Tika Server with Tesseract 4 Docker Setup☆21Updated 3 years ago
- A suite of Machine Learning / Deep Learning Dockerfiles to allow Apache Tika to extract objects and to produce textual captions for image…☆21Updated 5 months ago
- Elasticsearch/Solr Sandbox for exploring explain information and tweaking☆135Updated 8 months ago
- Tools to construct and process webgraphs from Common Crawl data☆80Updated this week
- A spaCy wrapper for GliNER☆91Updated 4 months ago
- Article extraction benchmark: dataset and evaluation scripts☆289Updated 6 months ago
- Software that makes labeling PDFs easy.☆391Updated 6 months ago
- 🦦 weasel: A small and easy workflow system☆67Updated 4 months ago
- A high performance bibliographic information service: https://biblio-glutton.readthedocs.io☆127Updated 2 months ago
- 🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based☆298Updated last year
- A Python library to chunk/group your texts based on semantic similarity.☆85Updated 4 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆126Updated 3 weeks ago
- ☆585Updated 3 weeks ago
- A machine learning tool for fishing entities☆248Updated last week
- A curated list of awesome data annotation tools☆194Updated 2 years ago
- Logical structure analysis for visually structured documents☆83Updated 2 years ago
- Open Source REST API for named entity extraction, named entity linking, named entity disambiguation, recommendation & reconciliation of e…☆180Updated 2 years ago