commoncrawl / web-languagesLinks
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
β50Updated this week
Alternatives and similar repositories for web-languages
Users that are interested in web-languages are comparing it to the libraries listed below
Sorting:
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β147Updated 2 months ago
- πΊοΈ Data Cleaning and Textual Data Visualization πΊοΈβ183Updated 2 months ago
- Efficiently find the best-suited language model (LM) for your NLP taskβ125Updated 2 weeks ago
- Notebooks for training universal 0-shot classifiers on many different tasksβ133Updated 7 months ago
- β67Updated last year
- A BERT-based application for reusable text classification at scaleβ38Updated 2 years ago
- A library for working with prompt templates locally or on the Hugging Face Hub.β48Updated 5 months ago
- Libraries, Archives and Museums (LAM)β85Updated 2 years ago
- π’ Work with static vector modelsβ28Updated 3 months ago
- Code for collecting, processing, and preparing datasets for the Common Pileβ216Updated 2 weeks ago
- The robust European language model benchmark.β114Updated this week
- Pre-train Static Word Embeddingsβ85Updated 2 months ago
- Generalist and Lightweight Model for Text Classificationβ148Updated last month
- β104Updated 7 months ago
- A list of awesome open source projects in the machine learning field, who's developers are mainly based in Germanyβ44Updated 10 months ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β59Updated last year
- Small python package to measure OCR quality and other related metrics.β25Updated last year
- FastFit β‘ When LLMs are Unfit Use FastFit β‘ Fast and Effective Text Classification with Many Classesβ210Updated 3 months ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, includingβ¦β66Updated last month
- Trully flash implementation of DeBERTa disentangled attention mechanism.β62Updated 2 months ago
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.β80Updated last year
- Let's build better datasets, together!β260Updated 7 months ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.β108Updated last year
- The pipeline for the OSCAR corpusβ171Updated last year
- π¦ Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data β¦β206Updated this week
- Python library to use Pleias-RAG modelsβ61Updated 3 months ago
- The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching oβ¦β142Updated 3 weeks ago
- Fine-tune ModernBERT on a large Dataset with Custom Tokenizer Trainingβ67Updated 6 months ago
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.β59Updated last year
- Python API for https://vespa.ai, the open big data serving engineβ135Updated this week