VikParuchuri / pdftextLinks

Extract structured text from pdfs quickly

☆485

Alternatives and similar repositories for pdftext

Users that are interested in pdftext are comparing it to the libraries listed below

Sorting:

poloclub / unitable
UniTable: Towards a Unified Table Foundation Model
☆473Updated last year
VikParuchuri / tabled
Detect and extract tables to markdown and csv
☆747Updated 4 months ago
conjuncts / gmft
Lightweight, performant, deep table extraction
☆466Updated this week
isaacus-dev / semchunk
A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.
☆315Updated 2 months ago
AnswerDotAI / byaldi
Use late-interaction multi-modal models such as ColPali in just a few lines of code.
☆789Updated 4 months ago
tonywu71 / colpali-cookbooks
Recipes for learning, fine-tuning, and adapting ColPali to your multimodal RAG use cases. 👨🏻‍🍳
☆292Updated this week
Unstructured-IO / unstructured-inference
☆183Updated this week
getomni-ai / benchmark
OCR Benchmark
☆495Updated last week
ai8hyf / TF-ID
TF-ID: Table/Figure IDentifier for academic papers
☆235Updated 10 months ago
AnswerDotAI / rerankers
A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
☆1,436Updated last week
docling-project / docling-parse
Simple package to extract text with coordinates from programmatic PDFs
☆126Updated last month
aurelio-labs / semantic-chunkers
☆225Updated 6 months ago
PragmaticMachineLearning / docai
Structured information extraction from documents
☆315Updated 8 months ago
enoch3712 / ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
☆1,265Updated last week
MinishLab / semhash
Fast Semantic Text Deduplication & Filtering
☆697Updated last week
pypdfium2-team / pypdfium2
Python bindings to PDFium
☆578Updated last week
superlinear-ai / raglite
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL
☆992Updated this week
pymupdf / RAG
RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
☆937Updated 3 weeks ago
adithya-s-k / VARAG
Vision-Augmented Retrieval and Generation (VARAG) - Vision first RAG Engine
☆461Updated 4 months ago
xhluca / bm25s
Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy
☆1,193Updated this week
docling-project / docling-ibm-models
☆122Updated this week
docling-project / docling-core
A python library to define and validate data types in Docling.
☆137Updated last week
jina-ai / late-chunking
Code for explaining and evaluating late chunking (chunked pooling)
☆396Updated 5 months ago
diicellman / dspy-rag-fastapi
FastAPI wrapper around DSPy
☆243Updated last year
dleemiller / WordLlama
Things you can do with the token embeddings of an LLM
☆1,443Updated 2 months ago
benbrandt / text-splitter
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from R…
☆430Updated this week
LlmKira / fast-langdetect
⚡️ 80x faster Fasttext language detection out of the box | Split text by language
☆205Updated 2 months ago
alvin-r / databonsai
clean & curate your data with LLMs.
☆492Updated 11 months ago
PrithivirajDamodaran / FlashRank
Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cro…
☆809Updated 6 months ago
nlmatics / llmsherpa
Developer APIs to Accelerate LLM Projects
☆1,675Updated 7 months ago