IllDepence / unarXive
A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
☆259Updated last month
Related projects ⓘ
Alternatives and complementary repositories for unarXive
- Pretraining Efficiently on S2ORC!☆136Updated 3 weeks ago
- Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)☆348Updated 7 months ago
- ☆82Updated 6 months ago
- Dataset accompanying the SPECTER model☆127Updated last year
- multimodal document analysis☆160Updated 5 months ago
- This repository provides details and links to the ACL anthology corpus/collection including .bib, .pdf and grobid extractions of the pdfs☆167Updated last year
- Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)☆64Updated 2 years ago
- Get answers to research questions from 200M+ papers. Link to demo -☆203Updated 10 months ago
- SciRepEval benchmark training and evaluation scripts☆67Updated 6 months ago
- SPECTER: Document-level Representation Learning using Citation-informed Transformers☆517Updated last year
- A set of scripts to grab public datasets from resources related to arXiv☆410Updated 6 months ago
- ☆185Updated 6 months ago
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆173Updated last year
- The original implementation of Min et al. "Nonparametric Masked Language Modeling" (paper https//arxiv.org/abs/2212.01349)☆156Updated last year
- ☆179Updated last year
- Data and models for the SciFact verification task.☆225Updated last year
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆149Updated 4 months ago
- Tk-Instruct is a Transformer model that is tuned to solve many NLP tasks by following instructions.☆177Updated 2 years ago
- S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/☆830Updated 6 months ago
- A framework for few-shot evaluation of autoregressive language models.☆101Updated last year
- The autoregressive information extraction system GenIE (Generative Information Extraction) implemented in PyTorch.☆99Updated last year
- Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"☆431Updated last year
- ☆134Updated 2 years ago
- [ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links☆419Updated 2 years ago
- Code and model release for the paper "Task-aware Retrieval with Instructions" by Asai et al.☆160Updated last year
- Scalable training for dense retrieval models.☆271Updated last year
- Pipeline for pulling and processing online language model pretraining data from the web☆174Updated last year
- Repo for Aspire - A scientific document similarity model based on matching fine-grained aspects of scientific papers.☆50Updated last year
- Reverse Instructions to generate instruction tuning data with corpus examples☆206Updated 8 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆122Updated 8 months ago