apache / tika-docker
Convenience Docker images for Apache Tika Server
☆126Updated 2 months ago
Related projects: ⓘ
- Apache Tika Server as a Docker Image☆170Updated 2 years ago
- Entity resolution for Elasticsearch.☆156Updated last month
- Demonstration of searching PDF document with Solr, Tika, and Tesseract☆30Updated last year
- Official Dockerfile for Apache Solr☆26Updated last week
- Towards an open source stack for e-commerce search☆141Updated this week
- GROBID extension for identifying and normalizing physical quantities.☆72Updated last week
- Apache Tika Server with Tesseract 4 Docker Setup☆21Updated 3 years ago
- A bundle of useful Elasticsearch plugins☆110Updated 5 months ago
- Improve your Elasticsearch, OpenSearch, Solr, Vectara, Algolia and Custom Search search quality.☆279Updated this week
- PDF to XML ALTO file converter☆209Updated this week
- Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.☆368Updated last week
- A basic tool that extracts the structure from the PDF files of scientific articles.☆70Updated 2 years ago
- Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…☆254Updated last year
- A natural language search microservice☆95Updated 3 years ago
- A curated list of resources around PDF files☆89Updated last month
- Open Source, Distributed, Big Data Enterprise Search Engine☆68Updated this week
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 6 years ago
- Elasticsearch/Solr Sandbox for exploring explain information and tweaking☆135Updated 6 months ago
- 🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.☆276Updated 3 years ago
- Ergonomic line-by-line transcription of scanned text.☆47Updated 3 years ago
- Question Answering annotation platform - Plateforme d'annotation☆87Updated 3 years ago
- Index Common Crawl archives in tabular format☆105Updated last week
- Tesseract 4 OCR Runtime Environment - Docker Container☆97Updated 5 years ago
- Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access…☆100Updated 4 months ago
- Elasticsearch lemmatizer for 15 languages☆104Updated 3 months ago
- 🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based☆290Updated 11 months ago
- A spaCy wrapper for GliNER☆77Updated 2 months ago
- A machine learning tool for fishing entities☆239Updated last week
- LexPredict Legal Dictionaries☆107Updated 2 years ago
- A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-test…☆61Updated last month