clulab / pdf2txtLinks
Convert PDF files to TXT
☆35Updated last year
Alternatives and similar repositories for pdf2txt
Users that are interested in pdf2txt are comparing it to the libraries listed below
Sorting:
- multimodal document analysis☆166Updated last year
- DocLLM: A layout-aware generative language model for multimodal document understanding☆129Updated last year
- ☆62Updated last year
- Code and data for "StructLM: Towards Building Generalist Models for Structured Knowledge Grounding" (COLM 2024)☆75Updated last year
- A Python library to chunk/group your texts based on semantic similarity.☆99Updated last year
- Logical structure analysis for visually structured documents☆92Updated 3 years ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆179Updated 2 years ago
- ☆197Updated last week
- 80x faster and 95% accurate language identification with Fasttext☆161Updated last year
- Deployment a light and full OpenAI API for production with vLLM to support /v1/embeddings with all embeddings models.☆44Updated last year
- Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset☆27Updated 2 years ago
- Guideline following Large Language Model for Information Extraction☆409Updated last year
- GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extraction☆80Updated last year
- Create fast graph language models from converted PDF documents for knowledge extraction and Q&A.☆57Updated 9 months ago
- An index of PDF-centric corpora☆144Updated 4 months ago
- Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)☆448Updated last year
- Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on task…☆179Updated last year
- A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents☆27Updated 2 years ago
- Python API for https://vespa.ai, the open big data serving engine☆146Updated this week
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆110Updated last year
- [TACL, EMNLP 2025 Oral] Code, datasets, and checkpoints for the paper "CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Thr…☆32Updated last year
- We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datas…☆80Updated 2 years ago
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platform☆91Updated 2 months ago
- The code and data for "StructGPT: A general framework for Large Language Model to Reason on Structured Data"☆102Updated last year
- EnriCo: Enriched Representation and Globally Constrained Inference for Entity and Relation Extraction☆26Updated last year
- ☆22Updated last year
- ☆28Updated last year
- Code accompanying "How I learned to start worrying about prompt formatting".☆110Updated 5 months ago
- FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes☆212Updated last month
- minimal pytorch implementation of bm25 (with sparse tensors)☆104Updated 2 weeks ago