commoncrawl / web-languagesLinks
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
β58Updated last week
Alternatives and similar repositories for web-languages
Users that are interested in web-languages are comparing it to the libraries listed below
Sorting:
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β162Updated 4 months ago
- Small python package to measure OCR quality and other related metrics.β25Updated last year
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β61Updated last year
- Efficiently find the best-suited language model (LM) for your NLP taskβ127Updated 2 months ago
- Libraries, Archives and Museums (LAM)β87Updated 3 years ago
- The robust European language model benchmark.β129Updated this week
- Layout Analysis Dataset with Segmonto (LADaS)β21Updated 3 months ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.β109Updated last year
- β67Updated last year
- Code for collecting, processing, and preparing datasets for the Common Pileβ233Updated last month
- Trully flash implementation of DeBERTa disentangled attention mechanism.β66Updated 3 weeks ago
- πΊοΈ Data Cleaning and Textual Data Visualization πΊοΈβ189Updated 4 months ago
- πΈ GlotCC Dataset and Pipline -- NeurIPS 2024β20Updated 6 months ago
- Datamodels for hugging face tokenizersβ77Updated 3 weeks ago
- A BERT-based application for reusable text classification at scaleβ38Updated 2 years ago
- Python library to use Pleias-RAG modelsβ63Updated 5 months ago
- β112Updated 10 months ago
- Notebooks for training universal 0-shot classifiers on many different tasksβ136Updated 9 months ago
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.β59Updated last year
- Pre-train Static Word Embeddingsβ87Updated last month
- β42Updated 3 months ago
- π’ Work with static vector modelsβ30Updated 5 months ago
- Let's build better datasets, together!β262Updated 9 months ago
- A list of awesome open source projects in the machine learning field, who's developers are mainly based in Germanyβ47Updated last year
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2β¦β68Updated 2 years ago
- Generalist and Lightweight Model for Text Classificationβ163Updated 4 months ago
- Code for the MTEB Arenaβ23Updated 3 months ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, includingβ¦β68Updated 3 months ago
- Code for SaGe subword tokenizer (EACL 2023)β26Updated 10 months ago
- πΈ GlotWeb: Web Indexing for Low-Resource Languages -- under construction.β15Updated 2 months ago