Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
☆68Jan 7, 2026Updated last month
Alternatives and similar repositories for web-languages
Users that are interested in web-languages are comparing it to the libraries listed below
Sorting:
- Source stories from the African Storybook Project in Markdown format☆22Jan 25, 2026Updated last month
- LLM-aided data filtering☆14Dec 3, 2024Updated last year
- Shan Natural Language Processing tools inspired by PythaiNLP☆14Updated this week
- 🕸 GlotWeb: Web Indexing for Minority Languages (WWW 2026)☆17Updated this week
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Aug 5, 2023Updated 2 years ago
- Chaos-Engineering-Style CI Pipelines to make sure Weaviate handles whatever the real world throws at it.☆23Feb 25, 2026Updated last week
- 🕸 GlotCC Dataset and Pipline -- NeurIPS 2024☆20Apr 6, 2025Updated 10 months ago
- Hugging Face and Pyserini interoperability☆19May 18, 2023Updated 2 years ago
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆74Apr 1, 2025Updated 11 months ago
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resource…☆26Feb 16, 2026Updated 2 weeks ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆26Nov 25, 2024Updated last year
- ☆12Jul 26, 2021Updated 4 years ago
- Translator in a box☆31Feb 1, 2026Updated last month
- Code for SaGe subword tokenizer (EACL 2023)☆27Nov 30, 2024Updated last year
- a benckmark for evaluating logical reasoning of LLMs☆23Jan 25, 2024Updated 2 years ago
- A PHP-based application to create and manage anonymous surveys with restricted access for selected participants.☆10Nov 20, 2024Updated last year
- [NeurIPS 2025] MergeBench: A Benchmark for Merging Domain-Specialized LLMs☆43Feb 11, 2026Updated 3 weeks ago
- A Directory of Online Newspaper Sources for 70+ Languages☆31Apr 15, 2021Updated 4 years ago
- ☆43Apr 26, 2025Updated 10 months ago
- Repositorio de aplicaciones de finanzas cuantitativas con python☆23Jan 20, 2026Updated last month
- ☆19Updated this week
- Model Openness Tool☆44Jan 28, 2026Updated last month
- A Python wrapper for libhackrf☆12Jul 10, 2023Updated 2 years ago
- A library for probing Stockfish's NNUEs. The code for reading parameters and forward propagation is taken from Stockfish☆12Nov 18, 2025Updated 3 months ago
- Easy Setup, File-based, Offline Capable Federated Learning and Computations☆22Feb 11, 2026Updated 3 weeks ago
- MATLAB/Octave generator of Hamming ECC coding. Output format is Verilog HDL.☆12Dec 27, 2022Updated 3 years ago
- COVID-19 Cases in New York City☆11Mar 26, 2021Updated 4 years ago
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆227Nov 16, 2024Updated last year
- [NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…☆36Jun 8, 2023Updated 2 years ago
- Python library for Myanmar language☆38Feb 14, 2024Updated 2 years ago
- [CVPR 2025 🔥] ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the…☆47May 26, 2025Updated 9 months ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆96Feb 9, 2023Updated 3 years ago
- ☆17Jan 11, 2025Updated last year
- `dev`, `build`, and `preview` scripts like Vite to generate static HTML websites from React/JSX☆11Mar 19, 2025Updated 11 months ago
- PwnHub is a CTF collaboration platform written in Bash, originally built as a response to a joke about the Bash Stack by yousuckatprogram…☆20Jun 13, 2025Updated 8 months ago
- [ICLR 2024 Spotlight] Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Communi…☆11Mar 29, 2024Updated last year
- ☆12Sep 27, 2024Updated last year
- A repository for resources relating to NLP in the Balochi language☆19Jun 3, 2023Updated 2 years ago
- Residual Quantization Autoencoder, used for interpreting LLMs☆14Jan 1, 2025Updated last year