[NeurIPS 2024] πΈ GlotCC Dataset and Pipline
β20Apr 6, 2025Updated last year
Alternatives and similar repositories for GlotCC
Users that are interested in GlotCC are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ACL 2025] π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Apr 6, 2025Updated last year
- [WWW 2026] πΈ GlotWeb: Web Indexing for Minority Languagesβ17Apr 14, 2026Updated 3 weeks ago
- β10Oct 2, 2024Updated last year
- [NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretrainingβ18Nov 26, 2023Updated 2 years ago
- KnowMAN: Weakly Supervised Multinomial Adversarial Networksβ12Nov 9, 2021Updated 4 years ago
- Managed hosting for WordPress and PHP on Cloudways β’ AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Repository for "Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages"β15Oct 4, 2024Updated last year
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific wayβ18Nov 4, 2025Updated 6 months ago
- EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMsβ47Sep 19, 2025Updated 7 months ago
- The SETimes.HR+ Croatian dependency treebankβ16Dec 27, 2016Updated 9 years ago
- Gather pagegraph data from all over the internetβ32Updated this week
- A RAG that can scale π§π»β π»β11May 28, 2024Updated last year
- Klexikon: A German Dataset for Joint Summarization and Simplificationβ16Oct 5, 2022Updated 3 years ago
- Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numbaβ38Oct 16, 2025Updated 6 months ago
- β17Jan 5, 2023Updated 3 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Official Repository for "Hypencoder: Hypernetworks for Information Retrieval"β35Sep 20, 2025Updated 7 months ago
- Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMsβ13Feb 13, 2024Updated 2 years ago
- π€ HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)β17Mar 20, 2024Updated 2 years ago
- β45Feb 11, 2026Updated 2 months ago
- Sparse Embedding Compression for Scalable Retrieval in Recommender Systemsβ35Nov 21, 2025Updated 5 months ago
- A missing piece of the Python multitask (both threads and processes) API: An extension that supports stateful worker pools & size-aware iβ¦β29Mar 8, 2026Updated last month
- β25Apr 28, 2020Updated 6 years ago
- Starbucks: Improved Training for 2D Matryoshka Embeddingsβ23Jun 30, 2025Updated 10 months ago
- π Fine-tune OpenAI models for text classification, question answering, and moreβ17May 1, 2023Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- A proposed standard `NOCK` for a Parquet format that supports efficient distributed serialization of multiple kinds of graph technologiesβ21Apr 27, 2026Updated last week
- Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"β32Jun 20, 2023Updated 2 years ago
- Efficient encoder-decoder architecture for small language models (β€1B parameters) with cross-architecture knowledge distillation and visiβ¦β32Feb 7, 2025Updated last year
- TyDiP Multilingual Politeness dataset and codeβ12Oct 15, 2023Updated 2 years ago
- Code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.β36Apr 1, 2025Updated last year
- β43May 27, 2025Updated 11 months ago
- Tool for sentiment analysis annotationβ13Mar 26, 2025Updated last year
- Featurize words into orthographic and phonological vectors.β42May 20, 2023Updated 2 years ago
- LVAS-Agent Code Baseβ20Apr 15, 2025Updated last year
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Finite-state script normalization and processing utilitiesβ47Apr 16, 2026Updated 3 weeks ago
- Code and data for the WSDM '19 paper "Crosslingual Document Embedding as Reduced-Rank Ridge Regression (Cr5)"β30Aug 17, 2019Updated 6 years ago
- The Code and Script of "David's Slingshot: A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis"β34Jun 13, 2025Updated 10 months ago
- Keyphrase Extraction Prototypesβ15Nov 24, 2016Updated 9 years ago
- A neural network that jointly part-of-speech tags and lemmatizes sentences, boosting accuracy for morphologically-rich languages (Czech, β¦β34Apr 5, 2019Updated 7 years ago
- Python library to use Pleias-RAG modelsβ71May 1, 2025Updated last year
- β23May 22, 2024Updated last year