JonathanLink / PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
☆1,581Updated last year
Alternatives and similar repositories for PDFLayoutTextStripper:
Users that are interested in PDFLayoutTextStripper are comparing it to the libraries listed below
- Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.☆1,272Updated 4 years ago
- (Java)A Method to Extract Tabular Content from PDF Files☆332Updated last year
- Extract tables from PDF files☆1,876Updated last month
- 🤖 A Node queue API for generating PDFs using headless Chrome. Comes with a CLI, S3 storage and webhooks for notifying subscribers about …☆2,628Updated 10 months ago
- The magic of Google Autocomplete while you're typing. Anywhere.☆1,540Updated last year
- Generates a quiz for a Wikipedia page using parts of speech and text chunking.☆804Updated 4 years ago
- The Berkeley Document Summarizer is a learning-based, single-document summarization system that extracts source document content, exploit…☆741Updated 5 years ago
- Capture website screenshots with optional device and network emulation as jpg, png or pdf (with web fonts!) using Electron / Chrome.☆548Updated 3 years ago
- Lightweight time management CLI tool☆363Updated 8 years ago
- A business card in LaTeX.☆683Updated 2 years ago
- Scan, index, and archive all of your paper documents (acquired by Mayan EDMS)☆2,562Updated 6 years ago
- Uses Microsoft Computer Vision API to caption images in an HTML file and fills out its alternative text attributes with the related capti…☆623Updated 7 years ago
- A template for self-hosted bookmarks using HTML & jQuery.☆662Updated 5 years ago
- Send your stdin to google sheets☆546Updated 4 years ago
- using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.☆143Updated last year
- Make a self hosted OpenVPN server in 15 minutes☆808Updated 7 years ago
- Problem Solving☆900Updated 5 years ago
- Fast C based HTML 5 parsing for python☆682Updated 4 months ago
- Python script to do PDF OCR conversion using Tesseract☆373Updated last year
- +2600 developer-related blogs and publications.☆636Updated 7 years ago
- 👨🏭Set up your Linux server with plain shell scripts☆1,173Updated 3 years ago
- Neural network OCR.☆1,129Updated 8 years ago
- A PDF comparison utility in Python.☆463Updated last month
- Evaluating the performance and accuracy of ABBYY FineReader's OCR on Senate Financial Disclosure scanned forms☆130Updated 8 years ago
- Run your own OCR-as-a-Service using Tesseract and Docker☆1,349Updated last year
- Visualisation Markdown☆663Updated 2 years ago
- Population based metaheuristic for password cracking. Siga(Simple genetic algorithm)☆414Updated 7 years ago
- Personal document manager (Linux/Windows) -- Moved to Gnome's Gitlab☆2,432Updated 6 years ago
- Client library for Minimal Chat☆677Updated 2 years ago
- A simple browser/client-side web scraper.☆241Updated 7 years ago