ShayHill / docx2python
Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
☆164Updated this week
Related projects ⓘ
Alternatives and complementary repositories for docx2python
- 80x faster and 95% accurate language identification with Fasttext☆138Updated 9 months ago
- Python bindings to PDFium☆419Updated last week
- Python binding to Poppler-cpp pdf library☆97Updated 2 months ago
- Streamlit PDF viewer☆107Updated 2 weeks ago
- A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.☆172Updated 3 months ago
- Demos, examples and utilities using PyMuPDF☆566Updated 4 months ago
- A Python tool to help extracting information from structured PDFs.☆379Updated last week
- Simplify DOCX files to JSON☆219Updated last month
- Convert html to docx☆73Updated 4 months ago
- ☆160Updated 2 weeks ago
- Benchmarking PDF libraries☆222Updated last year
- Logical structure analysis for visually structured documents☆82Updated 2 years ago
- Python API for PDF documents☆116Updated 2 months ago
- A pure python based utility to extract text and images from docx files.☆512Updated last year
- Software that makes labeling PDFs easy.☆390Updated 5 months ago
- The Python docx package cannot read paragraphs, tables and images in document order. It can only render all the paragraphs at once or all…☆72Updated 7 months ago
- A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The servic…☆170Updated last week
- A Python asyncio wrapper for Tesseract-OCR.☆20Updated 2 weeks ago
- Adobe PDFServices python SDK Samples☆131Updated this week
- Pure-Python full-text search library☆580Updated 10 months ago
- Run OCR, extract information from documents and classify them. In addition, annotate documents and build custom NLP and computer vision m…☆62Updated this week
- Python3 bindings for the Compact Language Detector v3 (CLD3)☆149Updated last year
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆173Updated last year
- Python interface to Apache PDFBox command-line tools.☆75Updated last year
- Streamlit Named Entity Recognition (NER) annotation custom component☆39Updated 2 years ago
- `pdfstructure` detects, splits and organizes the documents text content into its natural structure as envisioned by the author.☆100Updated 7 months ago
- A python library for extracting text from PDFs without losing the formatting of the PDF content.☆72Updated 2 years ago
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis☆265Updated last year
- Streamlit Annotation Tools is a Streamlit component that gives you access to various annotation tools (labeling, highlighting, etc.) for …☆77Updated 10 months ago
- ☆25Updated last year