Unstructured-IO / pipeline-paddleocr
Pipeline for converting PDFs to raw text with PaddleOCR
☆23Updated last year
Alternatives and similar repositories for pipeline-paddleocr
Users that are interested in pipeline-paddleocr are comparing it to the libraries listed below
Sorting:
- Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provi…☆36Updated 2 months ago
- ☆22Updated last month
- A library to extract the main content from html. Developed for information on LLM and for feeding data into LangChain and LlamaIndex.☆39Updated last year
- Create fast graph language models from converted PDF documents for knowledge extraction and Q&A.☆50Updated 3 months ago
- ☆22Updated last year
- A python library to define and validate data types in Docling.☆134Updated this week
- Open-source observability for your LLM application.☆51Updated 4 months ago
- A Python library to chunk/group your texts based on semantic similarity.☆97Updated 10 months ago
- A microframework for creating simple AI agents.☆91Updated 9 months ago
- A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GRO…☆50Updated 2 months ago
- Multimodal LLM Application with PyMuPDF4LLM☆36Updated 7 months ago
- An JS web client for connecting to Pipecat bots with voice and vision☆44Updated 4 months ago
- DocLLM: A layout-aware generative language model for multimodal document understanding☆126Updated last year
- CLIP as a service - Embed image and sentences, object recognition, visual reasoning, image classification and reverse image search☆61Updated last year
- Elasticsearch integration into LangChain☆56Updated 3 months ago
- ☆122Updated 2 months ago
- A simple Next.js frontend to explore your local weaviate collections and data☆21Updated last month
- A repository for creating, and sample code for consuming an ONNX embedding model☆30Updated last year
- GLiNER model in a FastAPI microservice.☆44Updated 5 months ago
- TextEmbed is a REST API crafted for high-throughput and low-latency embedding inference. It accommodates a wide variety of embedding mode…☆23Updated 8 months ago
- Diff filtering, text mapping, and windowed transforms for LLM apps☆15Updated 3 weeks ago
- ☆57Updated last year
- This project enhances the construction of RAG applications by addressing challenges, improving accessibility, scalability, and managing d…☆143Updated last year
- Data extraction with Donut ML model☆57Updated 9 months ago
- Nougat is a Meta AI's revolutionary OCR model designed to transcribe scientific PDFs into an easy-to-use Markdown format.☆22Updated last year
- I have explained how to create superior RAG pipeline for complex pdfs using LlamaParse. We can extract text and tables from pdf and QA on…☆45Updated last year
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆66Updated 6 months ago
- ☆30Updated 2 weeks ago
- DSPY on action with OpenSource LLMs.☆71Updated last year
- Hybrid Search (BM25 & Vector) with SQLite☆15Updated 9 months ago