chrismattmann/tika-python

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/chrismattmann/tika-python)

chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

☆1,661

Alternatives and similar repositories for tika-python

Users that are interested in tika-python are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

chrismattmann / tika-similarity
View on GitHub
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
☆108Jun 2, 2026Updated last month
fedelemantuano / tika-app-python
View on GitHub
Python bindings for Apache Tika
☆24Aug 20, 2020Updated 5 years ago
deanmalmgren / textract
View on GitHub
extract text from any document. no muss. no fuss.
☆4,655Updated this week
apache / tika
View on GitHub
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
☆3,843Updated this week
LogicalSpark / docker-tikaserver
View on GitHub
Apache Tika Server as a Docker Image
☆172Jul 17, 2022Updated 3 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
aptivate / python-tika
View on GitHub
Python wrapper for Apache Tika, made to be easy_installed
☆26Apr 17, 2012Updated 14 years ago
pdfminer / pdfminer.six
View on GitHub
Community maintained fork of pdfminer - we fathom PDF
☆7,001Mar 13, 2026Updated 3 months ago
explosion / spaCy
View on GitHub
💫 Industrial-strength Natural Language Processing (NLP) in Python
☆33,732May 19, 2026Updated last month
euske / pdfminer
View on GitHub
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
☆5,285Dec 7, 2022Updated 3 years ago
py-pdf / pypdf
View on GitHub
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
☆10,116Jun 30, 2026Updated last week
flairNLP / flair
View on GitHub
A very simple framework for state-of-the-art Natural Language Processing (NLP)
☆14,380Oct 27, 2025Updated 8 months ago
chrismattmann / nutch-python
View on GitHub
Nutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit
☆39Apr 15, 2016Updated 10 years ago
seatgeek / fuzzywuzzy
View on GitHub
Fuzzy String Matching in Python
☆9,260Feb 24, 2023Updated 3 years ago
chartbeat-labs / textacy
View on GitHub
NLP, before and after spaCy
☆2,241Sep 22, 2023Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
piskvorky / gensim
View on GitHub
Topic Modelling for Humans
☆16,461Nov 1, 2025Updated 8 months ago
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,513Jun 17, 2026Updated 3 weeks ago
chrismattmann / trec-dd-polar
View on GitHub
A dataset downloaded from the deep and scientific web across three major Polar data centers for use in research.
☆13Sep 8, 2017Updated 8 years ago
snorkel-team / snorkel
View on GitHub
A system for quickly generating training data with weak supervision
☆5,987Jun 8, 2026Updated last month
chezou / tabula-py
View on GitHub
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
☆2,314Dec 5, 2024Updated last year
deepset-ai / haystack
View on GitHub
Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and a…
☆25,836Updated this week
jalan / pdftotext
View on GitHub
☆1,062Jun 28, 2026Updated 2 weeks ago
vi3k6i5 / flashtext
View on GitHub
Extract Keywords from sentence or Replace keywords in sentences.
☆5,713Apr 13, 2025Updated last year
pymupdf / PyMuPDF
View on GitHub
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
☆10,196Updated this week
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
spotify / luigi
View on GitHub
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, vis…
☆18,745Jul 1, 2026Updated last week
USCDataScience / NLTKRest
View on GitHub
This is a REST Server endpoint built using Flask and Python.
☆24Nov 16, 2022Updated 3 years ago
khundman / marve
View on GitHub
For extracting measurements and related entities from text
☆58May 6, 2020Updated 6 years ago
pcodding / hadoop_ctakes
View on GitHub
Hadoop integration code for working with with Apache cTAKES
☆10Feb 11, 2014Updated 12 years ago
atlanhq / camelot
View on GitHub
Camelot: PDF Table Extraction for Humans
☆3,716Jan 5, 2023Updated 3 years ago
neuml / txtai
View on GitHub
💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows
☆12,705Jul 2, 2026Updated last week
camelot-dev / camelot
View on GitHub
A Python library to extract tabular data from PDFs
☆3,773Jul 4, 2026Updated last week
Layout-Parser / layout-parser
View on GitHub
A Unified Toolkit for Deep Learning Based Document Image Analysis
☆5,755Aug 15, 2024Updated last year
doccano / doccano
View on GitHub
Open source annotation tool for machine learning practitioners.
☆10,691Apr 14, 2026Updated 2 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
miso-belica / sumy
View on GitHub
Module for automatic summarization of text documents and HTML pages.
☆3,693Jun 23, 2026Updated 2 weeks ago
allenai / allennlp
View on GitHub
An open-source NLP research library, built on PyTorch.
☆11,886Nov 22, 2022Updated 3 years ago
facebookresearch / pytext
View on GitHub
A natural language modeling framework based on PyTorch
☆6,296Oct 17, 2022Updated 3 years ago
dask / dask
View on GitHub
Parallel computing with task scheduling
☆13,860Jul 1, 2026Updated last week
chrismattmann / etllib
View on GitHub
This is the ETL lib package. It provides an API to munge and prepare JSON, TSV and other data using Apache Tika and JSON parsing/loading …
☆18Jan 27, 2024Updated 2 years ago
JohnSnowLabs / spark-nlp
View on GitHub
State of the Art Natural Language Processing
☆4,140Updated this week
grobidOrg / grobid
View on GitHub
A machine learning software for extracting information from scholarly documents
☆4,985Updated this week