IllDepence / unarXive
A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
☆256Updated 11 months ago
Related projects: ⓘ
- Pretraining Efficiently on S2ORC!☆133Updated last year
- Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)☆332Updated 5 months ago
- multimodal document analysis☆159Updated 3 months ago
- ☆78Updated 4 months ago
- Dataset accompanying the SPECTER model☆127Updated last year
- SciRepEval benchmark training and evaluation scripts☆67Updated 4 months ago
- This repository provides details and links to the ACL anthology corpus/collection including .bib, .pdf and grobid extractions of the pdfs☆167Updated 11 months ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆166Updated last year
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆143Updated 2 months ago
- A set of scripts to grab public datasets from resources related to arXiv☆399Updated 3 months ago
- Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)☆63Updated last year
- S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/☆802Updated 4 months ago
- Data and models for the SciFact verification task.☆222Updated 11 months ago
- Get answers to research questions from 200M+ papers. Link to demo -☆203Updated 8 months ago
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆174Updated last week
- ☆183Updated 4 months ago
- Science-parse version 2☆228Updated 4 years ago
- Reverse Instructions to generate instruction tuning data with corpus examples☆201Updated 6 months ago
- SPECTER: Document-level Representation Learning using Citation-informed Transformers☆508Updated last year