commoncrawl / web-languagesLinks
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
☆55Updated last week
Alternatives and similar repositories for web-languages
Users that are interested in web-languages are comparing it to the libraries listed below
Sorting:
- Libraries, Archives and Museums (LAM)☆85Updated 2 years ago
- A BERT-based application for reusable text classification at scale☆38Updated 2 years ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆160Updated 3 months ago
- Efficiently find the best-suited language model (LM) for your NLP task☆127Updated last month
- ☆67Updated last year
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆68Updated 2 years ago
- Layout Analysis Dataset with Segmonto (LADaS)☆21Updated 2 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆63Updated 2 weeks ago
- A list of awesome open source projects in the machine learning field, who's developers are mainly based in Germany☆46Updated last year
- Pre-train Static Word Embeddings☆85Updated 2 weeks ago
- Code for the MTEB Arena☆23Updated 2 months ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆110Updated last year
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆61Updated last year
- Generalist and Lightweight Model for Text Classification☆159Updated 3 months ago
- Python library to use Pleias-RAG models☆62Updated 4 months ago
- The robust European language model benchmark.☆123Updated this week
- A library for working with prompt templates locally or on the Hugging Face Hub.☆50Updated 6 months ago
- A polite and user-friendly downloader for Common Crawl data☆57Updated last month
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific way☆14Updated 2 months ago
- 🔢 Work with static vector models☆30Updated 5 months ago
- 🗺️ Data Cleaning and Textual Data Visualization 🗺️☆186Updated 4 months ago
- A collection of datasets and other resources for legal text processing.☆122Updated 2 weeks ago
- Code for collecting, processing, and preparing datasets for the Common Pile☆227Updated last week
- ☆110Updated 9 months ago
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.☆80Updated 2 years ago
- Robust and fast topic models with sentence-transformers.☆80Updated this week
- GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extraction☆80Updated last year
- Fact checking baseline combining dense retrieval and textual entailment☆30Updated last month
- MAFAND-MT☆58Updated last year