Unstructured-IO / pipeline-paddleocr
Pipeline for converting PDFs to raw text with PaddleOCR
☆23Updated last year
Alternatives and similar repositories for pipeline-paddleocr:
Users that are interested in pipeline-paddleocr are comparing it to the libraries listed below
- Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provi…☆35Updated last month
- LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing, optimized for cloud deployment.☆49Updated 6 months ago
- ☆22Updated last year
- GLiNER model in a FastAPI microservice.☆41Updated 4 months ago
- A library to extract the main content from html. Developed for information on LLM and for feeding data into LangChain and LlamaIndex.☆38Updated 11 months ago
- DocLLM: A layout-aware generative language model for multimodal document understanding☆125Updated last year
- Efficient few-shot learning with cross-encoders.☆51Updated last year
- Open-source observability for your LLM application.☆51Updated 3 months ago
- ☆177Updated last week
- Data extraction with Donut ML model☆57Updated 8 months ago
- Repository for deepdoctection tutorial notebooks☆44Updated 5 months ago
- ☆60Updated last year
- A library integrating embedding and reranker models from OpenAI, SentenceTransformers etc for semantic search in vector database.☆39Updated 3 weeks ago
- Universal text classifier for generative models☆24Updated 9 months ago
- ☆23Updated 3 weeks ago
- An integration of Qdrant ANN vector database backend with txtai☆24Updated 8 months ago
- Benchmark study on LanceDB, an embedded vector DB, for full-text search and vector search☆24Updated last year
- 🦦 weasel: A small and easy workflow system☆83Updated 9 months ago
- A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GRO…☆50Updated last month
- ☆121Updated last month
- Python API for https://vespa.ai, the open big data serving engine☆121Updated this week
- Application configuration and scripts for search on https://docs.vespa.ai/☆12Updated last month
- Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity.☆11Updated last year
- AnyModal is a Flexible Multimodal Language Model Framework for PyTorch☆91Updated 4 months ago
- 🚀 Scale your RAG pipeline using Ragswift: A scalable centralized embeddings management platform☆38Updated last year
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafte…☆67Updated 5 months ago
- Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.☆27Updated 2 years ago
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆66Updated 5 months ago
- Generate pydantic models from JSON Schema☆22Updated last year
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Models☆105Updated 2 weeks ago