Unstructured-IO / pipeline-paddleocrLinks
Pipeline for converting PDFs to raw text with PaddleOCR
☆23Updated 2 years ago
Alternatives and similar repositories for pipeline-paddleocr
Users that are interested in pipeline-paddleocr are comparing it to the libraries listed below
Sorting:
- Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provi…☆41Updated 10 months ago
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆80Updated this week
- Open-source observability for your LLM application.☆53Updated last year
- ☆201Updated this week
- Self-host llmapi server, make it really easy for accessing LLMs !☆37Updated 2 years ago
- Data extraction with Donut ML model☆57Updated last year
- ☆40Updated 2 years ago
- An open-source cloud-native of large multi-modal models (LMMs) serving framework.☆165Updated 2 years ago
- Repository for deepdoctection tutorial notebooks☆50Updated last month
- A set of tools to create synthetically-generated data from documents☆39Updated 5 months ago
- hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six☆198Updated last year
- Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.☆29Updated 2 years ago
- Private ChatGPT/Perplexity. Securely unlocks knowledge from confidential business information.☆77Updated last year
- Keyword Extraction and Analysis Pipeline & Application with KeyBERT and Taipy☆17Updated 2 years ago
- Demo example of consumer goods categorization☆30Updated 2 years ago
- Taking Normal Text as Input and Generating SQL commands using the OpenAI's GPT-3☆15Updated 5 years ago
- Excel spreadsheet crawler and table parser for data extraction and querying☆164Updated 11 months ago
- Demo app with Loguru logging, async middleware to generate X-request-Id. Works with Gunicorn or Uvicorn, and is safe to use with async/th…☆10Updated 4 years ago
- The code for the Sales Dashboard demo☆16Updated 8 months ago
- Multimodal RAG with PyMuPDF☆43Updated last year
- 🛤️ Pathik - High-Performance Web Crawler ⚡☆30Updated 10 months ago
- Docling core data types and transformations☆225Updated this week
- A Prodigy plugin for PDF annotation☆37Updated 6 months ago
- ☆20Updated last year
- This is a proof-of-concept of using an LLM to find and extract meaningful data without parsing the html too much.☆30Updated 2 years ago
- DocLLM: A layout-aware generative language model for multimodal document understanding☆137Updated 2 years ago
- Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular…☆83Updated last month
- A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GRO…☆53Updated 10 months ago
- A library to extract the main content from html. Developed for information on LLM and for feeding data into LangChain and LlamaIndex.☆52Updated last year
- An enterprise-grade AI retriever designed to streamline AI integration into your applications, ensuring cutting-edge accuracy.☆292Updated 7 months ago