arXiv / zzzArchived_arxiv-fulltext
arXiv plain text extraction
☆41Updated last year
Related projects: ⓘ
- A file utility for accessing both local and remote files through a unified interface.☆36Updated last month
- Documentation effort for the BookCorpus dataset☆30Updated 3 years ago
- 💫 A spaCy package for Yohei Tamura's Rust tokenizations library☆27Updated 10 months ago
- A library for squeakily cleaning and filtering language datasets.☆45Updated last year
- A diff tool for language models☆42Updated 8 months ago
- ☆75Updated 9 months ago
- One stop shop for all things carp☆58Updated 2 years ago
- Bayesian Assessment of Hypotheses☆24Updated last year
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆18Updated last year
- Tokenization across languages. Useful as preprocessing for subword tokenization.☆21Updated last year
- Search through Facebook Research's PyTorch BigGraph Wikidata-dataset with the Weaviate vector search engine☆31Updated 2 years ago
- Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 q…☆84Updated 6 months ago
- Vespa application making an index of the CORD-19 dataset.☆39Updated 2 weeks ago
- Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)☆59Updated last year
- StAtutory Reasoning Assessment☆11Updated last year
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆42Updated 10 months ago
- ☆44Updated 2 months ago
- ☆86Updated 2 years ago
- Embedding Recycling for Language models☆38Updated last year
- This repository contains code for cleaning your training data of benchmark data to help combat data snooping.☆25Updated last year
- This repo contains data and code for the paper "Reasoning over Public and Private Data in Retrieval-Based Systems."☆46Updated 2 months ago
- A summarization dataset consisting of over 17k open access business journal articles.☆9Updated 3 years ago
- Source code and data for Like a Good Nearest Neighbor☆28Updated 7 months ago
- website for MS Marco☆27Updated 4 months ago
- ☆31Updated last year
- ☆41Updated last year
- 🤗 Disaggregators: Curated data labelers for in-depth analysis.☆66Updated last year
- Detecting gibberish as a type of sentiment analysis with GPT2☆24Updated 3 years ago
- Open source library for few shot NLP☆78Updated last year
- A python tool for building large scale Wikipedia-based Information Retrieval datasets☆44Updated 3 years ago