commoncrawl / web-languagesLinks
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
β69Updated last month
Alternatives and similar repositories for web-languages
Users that are interested in web-languages are comparing it to the libraries listed below
Sorting:
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β186Updated 2 months ago
- Libraries, Archives and Museums (LAM)β88Updated 3 years ago
- Small python package to measure OCR quality and other related metrics.β26Updated last year
- β67Updated last year
- Efficiently find the best-suited language model (LM) for your NLP taskβ134Updated 6 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.β76Updated last week
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.β111Updated last year
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific wayβ18Updated 3 months ago
- Code for collecting, processing, and preparing datasets for the Common Pileβ249Updated 4 months ago
- Layout Analysis Dataset with Segmonto (LADaS)β23Updated 6 months ago
- A BERT-based application for reusable text classification at scaleβ38Updated 2 years ago
- Pre-train Static Word Embeddingsβ94Updated 4 months ago
- β43Updated 3 weeks ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.β26Updated last year
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.β59Updated last year
- Notebooks for training universal 0-shot classifiers on many different tasksβ139Updated last year
- Code for SaGe subword tokenizer (EACL 2023)β27Updated last year
- ParaNames: A multilingual resource for parallel namesβ39Updated last year
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β35Updated this week
- Python library to use Pleias-RAG modelsβ68Updated 9 months ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β64Updated last year
- The pipeline for the OSCAR corpusβ176Updated 2 months ago
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.β81Updated 2 years ago
- Generalist and Lightweight Model for Text Classificationβ169Updated last week
- My NER Experiments with ModernBERT and Ettinβ26Updated 6 months ago
- πΈ GlotCC Dataset and Pipline -- NeurIPS 2024β20Updated 10 months ago
- πΊοΈ Data Cleaning and Textual Data Visualization πΊοΈβ199Updated 8 months ago
- State-of-the-art paired encoder and decoder models (17M-1B params)β58Updated 6 months ago
- multimodal document analysisβ166Updated 2 months ago
- πΈ GlotWeb: Web Indexing for Low-Resource Languages -- under construction.β17Updated 5 months ago