bigscience-workshop/metadata

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/bigscience-workshop/metadata)

bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

☆29

Alternatives and similar repositories for metadata

Users that are interested in metadata are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

huggingface / that_is_good_data
View on GitHub
☆65Aug 7, 2023Updated 2 years ago
joeljang / temporalwiki
View on GitHub
[EMNLP 2022] TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models
☆75May 15, 2024Updated 2 years ago
nlpsoc / Style-Embeddings
View on GitHub
☆42Oct 3, 2024Updated last year
chkla / NLP2CSS-Tutorial
View on GitHub
Tutorial on Transformers 🤖, HuggingFace 🤗 and Social Science Applications 👥 @ IC2S2
☆17Aug 8, 2021Updated 4 years ago
cjbarrie / blueskyr
View on GitHub
☆14Sep 27, 2023Updated 2 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
koheiw / wordvector
View on GitHub
Train word and document vectors using quanteda
☆16Apr 6, 2026Updated 3 months ago
philschmid / deep-learning-remote-runner
View on GitHub
☆16Aug 10, 2022Updated 3 years ago
norakassner / mlama
View on GitHub
☆25Jan 22, 2024Updated 2 years ago
naverlabseurope / ALPS2024-MT-LAB
View on GitHub
CD20200004 from 01/01/2021 to 31/12/2023 - LIG UGA - Python Notebook and Models for the MT Lab @ ALPS 2022
☆13Apr 1, 2024Updated 2 years ago
SALT-NLP / multi-value
View on GitHub
Complete set of English dialect transformation rules and evaluation code
☆16Jun 7, 2024Updated 2 years ago
MeLeLBGU / tokenizers_intrinsic_benchmark
View on GitHub
Code for the paper "Greed is All You Need: An Evaluation of Tokenizer Inference Methods"
☆13Nov 26, 2024Updated last year
bigscience-workshop / data-preparation
View on GitHub
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆318Mar 20, 2023Updated 3 years ago
easonnie / ChaosNLI
View on GitHub
[EMNLP 2020] Collective HumAn OpinionS on Natural Language Inference Data
☆42Apr 7, 2022Updated 4 years ago
Receiling / PSPE
View on GitHub
Pretrained Span and span Pair Encoder, code for "Pre-training Entity Relation Encoder with Intra-span and Inter-spanInformation.", EMNLP2…
☆18Jan 26, 2022Updated 4 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
dykang / xslue
View on GitHub
ACL 2021 paper "Style is NOT a single variable: Case Studies for Cross-Style Language Understanding " by Dongyeop Kang and Eduard Hovy
☆15Jul 19, 2021Updated 5 years ago
aidos-lab / magnipy
View on GitHub
Metric Space Magnitude Computations
☆15Jun 30, 2026Updated 2 weeks ago
UIC-Liu-Lab / DGA
View on GitHub
[EMNLP 2022] Adapting a Language Model While Preserving its General Knowledge
☆21Feb 12, 2023Updated 3 years ago
liangstein / ByteNet-Keras
View on GitHub
French to English translator on character level implemented by Keras
☆10Jun 15, 2017Updated 9 years ago
kristinagligoric / confidence-driven-inference
View on GitHub
☆17Jul 23, 2025Updated 11 months ago
facebookresearch / dynabench
View on GitHub
Dynamic Adversarial Benchmarking platform
☆26Jun 22, 2022Updated 4 years ago
sbera7 / Dialogue-act-classification
View on GitHub
Dialogue Act classification
☆18Jan 15, 2024Updated 2 years ago
ijmarshall / cochrane-nlp
View on GitHub
files for systematic review automation project
☆17May 17, 2016Updated 10 years ago
wuningxi / Talks
View on GitHub
Slides from previous talks.
☆29Nov 23, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
simison / couchspinner
View on GitHub
Couchsurfing profile importer and previewer.
☆14Jan 8, 2023Updated 3 years ago
ielab / SIGIR2017-SysRev-Collection
View on GitHub
A Test Collection for Evaluating Retrieval of Studies for Inclusion in Systematic Reviews
☆12Sep 22, 2023Updated 2 years ago
facebookresearch / dynalab
View on GitHub
The Python library with command line tools to interact with Dynabench(https://dynabench.org/), such as uploading models.
☆56Jun 23, 2022Updated 4 years ago
INESCTEC / kep
View on GitHub
Keyphase Extraction Package
☆10Aug 24, 2020Updated 5 years ago
pacotvj99 / testsampleR
View on GitHub
☆14Jan 25, 2026Updated 5 months ago
hipstas / AudiAnnotate
View on GitHub
Workflows for generating AV editions and exhibits using IIIF manifests by HiPSTAS and Brumfield Labs.
☆17Nov 17, 2024Updated last year
joaopalotti / cmu_67300
View on GitHub
This is the repository for the CMU course 67-300: Search Engines
☆11Nov 8, 2023Updated 2 years ago
fedenanni / Computational-Text-Analysis-2018-19
View on GitHub
2018 Computational Text Analysis Notebooks, University of Mannheim
☆13Nov 22, 2018Updated 7 years ago
carmanzhang / LAGOS-AND
View on GitHub
LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation
☆11Dec 8, 2022Updated 3 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
m-clark / R-models
View on GitHub
A quick reference for how to run many models in R.
☆13May 19, 2018Updated 8 years ago
begab / mamus
View on GitHub
Source code accompanying the ICLR2020 publication 'Massively Multilingual Sparse Word Representations' https://openreview.net/forum?id=Hy…
☆12Aug 15, 2023Updated 2 years ago
POSTECH-CVLab / daily-reading-group
View on GitHub
☆14May 8, 2022Updated 4 years ago
chkla / CSS-Events
View on GitHub
Summer/ winter schools, workshops and conferences in computational social science 🫂
☆46Dec 9, 2025Updated 7 months ago
kudkudak / python-for-data-processing
View on GitHub
Lab for Jagiellonian University course
☆10Jul 2, 2016Updated 10 years ago
kenlimmj / fightin-words
View on GitHub
A scikit-learn compliant implementation of Monroe et al.'s Fightin' Words analysis method.
☆11May 26, 2026Updated last month
WuraolaOyewusi / How-to-use-ScispaCy-for-Biomedical-Named-Entity-Recognition-Abbreviation-Resolution-and-link-UMLS
View on GitHub
☆10Aug 11, 2019Updated 6 years ago