bigscience-workshop/data_sourcing

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/bigscience-workshop/data_sourcing)

bigscience-workshop / data_sourcing

This directory gathers the tools developed by the Data Sourcing Working Group

☆31

Alternatives and similar repositories for data_sourcing

Users that are interested in data_sourcing are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

MagedSaeed / generate-sequences
View on GitHub
A python package made to generate sequences (greedy and beam-search) from Pytorch (not necessarily HF transformers) models.
☆19Dec 12, 2025Updated 7 months ago
DDMAL / IIIF-AV-player
View on GitHub
IIIF Audio/Video Player
☆14Oct 26, 2023Updated 2 years ago
kaisdukes / quran-neural-chunker
View on GitHub
A data preprocessor for the Quranic Treebank using neural networks. Divides longer verses into smaller chunks.
☆12Jul 4, 2023Updated 3 years ago
liviniuk / GANwriting
View on GitHub
☆10Sep 5, 2020Updated 5 years ago
BKHMSI / Font-To-Sketch
View on GitHub
☆16Aug 22, 2023Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
wietsedv / gpt2-recycle
View on GitHub
As good as new. How to successfully recycle English GPT-2 to make models for other languages (ACL Findings 2021)
☆48Aug 2, 2021Updated 4 years ago
cahya-wirawan / artificial-commonvoice
View on GitHub
Common Voice Generator using Speech Synthesizer
☆14Jul 28, 2021Updated 4 years ago
NoelDeMartin / Japanese-Character-Recognition
View on GitHub
Sample application integrating android and tensorflow
☆12Feb 5, 2021Updated 5 years ago
IqbalLx / Hanacaraka-AI
View on GitHub
Image classification for javanese script. This project is our final project for Google Bangkit Academy
☆12Feb 22, 2021Updated 5 years ago
cldf / pycldf
View on GitHub
python package to read and write CLDF datasets
☆21Updated this week
zaidalyafeai / Browser-Sentiment-Classification
View on GitHub
Sentiment Classification in the browser using TensorFlow.js
☆25Apr 17, 2018Updated 8 years ago
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 3 months ago
ctylim / rhuffle
View on GitHub
Line shuffler for huge text file which does not fit in memory
☆13Dec 1, 2022Updated 3 years ago
rcourivaud / SymSpellCompound
View on GitHub
SymSpell Compound implementation in Python
☆11Feb 6, 2018Updated 8 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
quran-lyric / lyrics
View on GitHub
Quran lyric database
☆16May 20, 2018Updated 8 years ago
caarlos0-graveyard / github-vacations
View on GitHub
Automagically ignore all notifications related to work when you are on vacations
☆21Aug 21, 2020Updated 5 years ago
qurator-spk / neat
View on GitHub
Named entity annotation tool
☆28Jul 6, 2023Updated 3 years ago
nmntz / website-monitor
View on GitHub
A tool written in Go that helps you monitor a collection of websites using various metrics.
☆12Nov 9, 2021Updated 4 years ago
CeciPani / DrXAI
View on GitHub
☆13Nov 22, 2022Updated 3 years ago
Aazhar / keras2tensorflow
View on GitHub
Tutorial on running keras model in C++ and python tensorflow
☆11Oct 30, 2018Updated 7 years ago
eth-cscs / UserLabDay
View on GitHub
CSCS User Lab Day – Meet the Swiss National Supercomputing Centre
☆13Sep 12, 2025Updated 10 months ago
sweetpeach / hummingbird
View on GitHub
Code and Hummingbird dataset for EMNLP 2021 paper "Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica"
☆14Apr 13, 2022Updated 4 years ago
hoelzro / Act
View on GitHub
A Conference Toolkit (Git conversion of the Subversion repository)
☆19Aug 24, 2012Updated 13 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
Harsh120 / Ancient-Tamil-Script-Recognition
View on GitHub
☆15Mar 2, 2026Updated 4 months ago
DH-Center-Tuebingen / ThesauRex
View on GitHub
Web based editor for SKOS based ontologies
☆27Jun 12, 2026Updated last month
IIIF / iiif-stories
View on GitHub
Community repository for documenting stories and use cases related to uses of the International Image Interoperability Framework.
☆23Mar 1, 2017Updated 9 years ago
stefan-it / gc4lm
View on GitHub
GC4LM: A Colossal (Biased) language model for German
☆13May 2, 2021Updated 5 years ago
bellabf / dimensional-reduction
View on GitHub
☆13Jun 20, 2022Updated 4 years ago
bigcode-project / bigcode-inference-benchmark
View on GitHub
☆19Aug 10, 2024Updated last year
h-munakata / Lighthouse-Wrapper-for-Audio-Moment-Retrieval
View on GitHub
☆13Mar 23, 2026Updated 3 months ago
duyichao / NPDA-KNN-ST
View on GitHub
Official implementation of EMNLP'2022 paper "Non-Parametric Domain Adaptation for End-to-End Speech Translation"
☆11Oct 26, 2022Updated 3 years ago
timvieira / rl
View on GitHub
Reference implementation of algorithms for reinforcement learning and Markov decision processes.
☆12Jan 28, 2021Updated 5 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
jienagu / rpivotTableMD
View on GitHub
This is a Shiny app to fetch users' activity and interact with Rmarkdown (pdf/word) report
☆17Apr 22, 2019Updated 7 years ago
EdAbati / outlines-haystack
View on GitHub
Use `outlines` generators with Haystack.
☆14Updated this week
dasayan05 / stroke-ae
View on GitHub
Bezier AE approach to sketch generation
☆31Jul 7, 2020Updated 6 years ago
bigscience-workshop / data_tooling
View on GitHub
Tools for managing datasets for governance and training.
☆91May 25, 2026Updated last month
kawalpemilu / kawalpemilu2019-www
View on GitHub
Public facing site of KawalPemilu 2019
☆16Jun 30, 2023Updated 3 years ago
fawwaz / yes-i-am-a-github-pro
View on GitHub
A chrome plugin to put github's pro badge on your profile
☆18Feb 25, 2019Updated 7 years ago
meenmo / Forecasting_Stock_Returns_via_Supervised_Learning
View on GitHub
Stat 479 Project
☆12Dec 22, 2018Updated 7 years ago