apache / tika
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
☆2,985Updated this week
Alternatives and similar repositories for tika
Users that are interested in tika are comparing it to the libraries listed below
Sorting:
- Mirror of Apache PDFBox☆2,820Updated last week
- Elasticsearch File System Crawler (FS Crawler)☆1,394Updated this week
- Apache Nutch is an extensible and scalable web crawler☆3,013Updated last month
- Mirror of Apache POI☆2,016Updated this week
- Apache Lucene and Solr open-source search software☆4,379Updated 7 months ago
- Apache OpenNLP☆1,509Updated this week
- Apache Solr open-source search software☆1,382Updated last week
- Apache NiFi☆5,319Updated this week
- iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with …☆2,099Updated this week
- Logstash - transport and process your logs, events, or other data☆14,483Updated this week
- Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.☆1,586Updated last month
- Apache Lucene open-source search software☆2,965Updated this week
- Java JNA wrapper for Tesseract OCR API☆1,672Updated 3 months ago
- JODConverter automates document conversions using LibreOffice or Apache OpenOffice.☆1,473Updated this week
- JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files☆2,207Updated last week
- Apache ActiveMQ Classic☆2,351Updated last week
- Apache Drill is a distributed MPP query layer for self describing data☆1,972Updated last week
- Apache Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or produ…☆5,828Updated this week
- Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on…☆6,325Updated this week
- Capturing JVM- and application-level metrics. So you know what's going on.☆7,847Updated last week
- Drools is a rule engine, DMN engine and complex event processing (CEP) engine for Java.☆6,021Updated this week
- MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java dat…☆4,975Updated 11 months ago
- Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.☆2,969Updated this week
- Convenience Docker images for Apache Tika Server☆188Updated last month
- Code for Quartz Scheduler☆6,499Updated 3 weeks ago
- documents4j is a Java library for converting documents into another document format☆573Updated 3 months ago
- jOOQ is the best way to write SQL in Java☆6,377Updated last week
- Elasticsearch Java Rest Client.☆2,116Updated 2 years ago
- JSON to JSON transformation library written in Java.☆1,609Updated 9 months ago
- This is mavenised Luke: Lucene Toolbox Project☆1,544Updated 5 years ago