webis-de / webis-tldr-17-corpusLinks
Code for constructing TLDR corpus from Reddit dataset
☆26Updated 3 years ago
Alternatives and similar repositories for webis-tldr-17-corpus
Users that are interested in webis-tldr-17-corpus are comparing it to the libraries listed below
Sorting:
- ☆90Updated 3 years ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆84Updated last year
- 🤗 Disaggregators: Curated data labelers for in-depth analysis.☆67Updated 2 years ago
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆44Updated 5 years ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated 2 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆33Updated last year
- multimodal document analysis☆166Updated last year
- Pipeline for pulling and processing online language model pretraining data from the web☆177Updated 2 years ago
- ☆79Updated last year
- ☆43Updated 2 years ago
- Developing tools to automatically analyze datasets☆75Updated 10 months ago
- No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval☆29Updated 2 years ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆49Updated last year
- Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal …☆32Updated 4 years ago
- ☆44Updated 10 months ago
- GenieNLP: A versatile codebase for any NLP task☆89Updated last year
- Finding semantically meaningful and accurate prompts.☆48Updated last year
- For experiments involving instruct gpt. Currently used for documenting open research questions.☆71Updated 2 years ago
- Search through Facebook Research's PyTorch BigGraph Wikidata-dataset with the Weaviate vector search engine☆31Updated 3 years ago
- ☆184Updated 2 years ago
- ☆14Updated 11 months ago
- Code for SaGe subword tokenizer (EACL 2023)☆26Updated 9 months ago
- Code for Relevance-guided Supervision for OpenQA with ColBERT (TACL'21)☆41Updated 4 years ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆73Updated last year
- Tools for managing datasets for governance and training.☆83Updated last month
- A dataset for pretraining language models targeted for legal tasks.☆139Updated 3 years ago
- Open source library for few shot NLP☆79Updated 2 years ago
- ☆101Updated 2 years ago
- LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development☆20Updated 2 years ago
- Documentation effort for the BookCorpus dataset☆34Updated 4 years ago