kermitt2 / grobidLinks

A machine learning software for extracting information from scholarly documents

☆4,538

Alternatives and similar repositories for grobid

Users that are interested in grobid are comparing it to the libraries listed below

Sorting:

kermitt2 / grobid-client-python
Python client for GROBID Web services
☆385Updated last week
allenai / science-parse
Science Parse parses scientific papers (in PDF form) and returns them in structured form.
☆685Updated last year
allenai / pdffigures2
Given a scholarly PDF, extract figures, tables, captions, and section titles.
☆712Updated last year
CeON / CERMINE
Content ExtRactor and MINEr
☆509Updated 3 years ago
titipata / scipdf_parser
Python PDF parser for scientific publications: content and figures
☆446Updated last year
allenai / s2orc-doc2json
Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)
☆452Updated last year
deepdoctection / deepdoctection
A Repo For Document AI
☆3,112Updated this week
microsoft / table-transformer
Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the o…
☆2,819Updated last year
allenai / papermage
library supporting NLP and CV research on scientific papers
☆785Updated last year
urschrei / pyzotero
Pyzotero: a Python client for the Zotero API
☆1,152Updated 2 weeks ago
scholarly-python-package / scholarly
Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
☆1,759Updated 8 months ago
jsvine / pdfplumber
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆9,433Updated this week
camelot-dev / camelot
A Python library to extract tabular data from PDFs
☆3,560Updated this week
neuml / paperai
📄 🤖 AI for medical and scientific papers
☆1,663Updated 5 months ago
camelot-dev / excalibur
A web interface to extract tabular data from PDFs
☆1,787Updated 11 months ago
pymupdf / PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
☆8,750Updated this week
adbar / trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…
☆5,121Updated 3 months ago
lukasschwab / arxiv.py
Python wrapper for the arXiv API
☆1,422Updated last month
CrossRef / rest-api-doc
Documentation for Crossref's REST API. For questions or suggestions, see https://community.crossref.org/
☆783Updated last year
facebookresearch / nougat
Implementation of Nougat Neural Optical Understanding for Academic Documents
☆9,766Updated 10 months ago
run-llama / llama_cloud_services
Knowledge Agents and Management in the Cloud
☆4,223Updated 3 weeks ago
allenai / scibert
A BERT model for scientific text.
☆1,665Updated 3 years ago
chrismattmann / tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
☆1,640Updated 8 months ago
ferru97 / PyPaperBot
PyPaperBot is a Python tool for downloading scientific papers using Google Scholar, Crossref, SciHub, and SciDB.
☆590Updated last year
Future-House / paper-qa
High accuracy RAG for answering questions from scientific documents with citations
☆7,952Updated this week
inukshuk / anystyle
Fast citation reference parsing
☆1,198Updated 7 months ago
explosion / spacy-llm
🦙 Integrating LLMs into structured NLP pipelines
☆1,358Updated 11 months ago
nlmatics / llmsherpa
Developer APIs to Accelerate LLM Projects
☆1,743Updated last year
allenai / scispacy
A full spaCy pipeline and models for scientific/biomedical documents.
☆1,910Updated 3 weeks ago
chartbeat-labs / textacy
NLP, before and after spaCy
☆2,235Updated 2 years ago