A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
☆18Nov 26, 2023Updated 2 years ago
Alternatives and similar repositories for ofa
Users that are interested in ofa are comparing it to the libraries listed below
Sorting:
- ☆10Sep 13, 2022Updated 3 years ago
- #인권코퍼스☆31Oct 6, 2023Updated 2 years ago
- ☆23Oct 30, 2023Updated 2 years ago
- Difference-based Contrastive Learning for Korean Sentence Embeddings☆23Feb 24, 2026Updated last week
- PyTorch implementation of NAACL 2021 paper "Multi-view Subword Regularization"☆26Jun 2, 2021Updated 4 years ago
- mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models☆11Jan 19, 2024Updated 2 years ago
- ☆10Dec 17, 2020Updated 5 years ago
- 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment☆11Apr 6, 2025Updated 10 months ago
- ☆10Dec 28, 2023Updated 2 years ago
- 🕸 GlotWeb: Web Indexing for Minority Languages (WWW 2026)☆17Aug 13, 2025Updated 6 months ago
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆88Sep 12, 2024Updated last year
- ☆12Dec 6, 2024Updated last year
- Bias, Hate classification with KoELECTRA 👿☆27Jun 12, 2023Updated 2 years ago
- PathPiece tokenizer☆13Nov 10, 2024Updated last year
- Getting interpretable dimensions in word embedding spaces.☆15Jul 6, 2023Updated 2 years ago
- MINERS ⛏️: The semantic retrieval benchmark for evaluating multilingual language models. (EMNLP 2024 Findings)☆14Oct 3, 2024Updated last year
- ☆15Mar 8, 2024Updated last year
- Code for Zero-Shot Tokenizer Transfer☆143Jan 14, 2025Updated last year
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"☆36Jun 7, 2025Updated 8 months ago
- Enhaced version of Wikiextrator: A wikipedia dumps extractor☆28Sep 17, 2025Updated 5 months ago
- Google 공식 Rouge Implementation을 한국어에서 사용할 수 있도록 처리☆18Jan 3, 2024Updated 2 years ago
- This repository includes the masking vocabulary used in the ICLR 2021 spotlight PMI-Masking paper☆14Aug 9, 2021Updated 4 years ago
- ☆36Oct 4, 2023Updated 2 years ago
- 🕸 GlotCC Dataset and Pipline -- NeurIPS 2024☆20Apr 6, 2025Updated 10 months ago
- LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)☆18May 10, 2023Updated 2 years ago
- ☆59Jan 2, 2024Updated 2 years ago
- Code for ACL 2023 Paper: ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER☆21Jul 19, 2023Updated 2 years ago
- 🖋 Resource and Tool for Writing System Identification (Unicode 17.0) -- LREC 2024☆21Feb 17, 2026Updated 2 weeks ago
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆188Nov 19, 2025Updated 3 months ago
- Korean Nested Named Entity Corpus☆20May 13, 2023Updated 2 years ago
- ☆17Dec 16, 2022Updated 3 years ago
- AutoRAG example about benchmarking Korean embeddings.☆43Oct 2, 2024Updated last year
- ☆20Apr 28, 2021Updated 4 years ago
- PyTorch source code of NAACL 2021 paper "Improving the Lexical Ability of Pretrained Language Models for Unsupervised Neural Machine Tran…☆18Oct 18, 2022Updated 3 years ago
- CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean☆48Dec 23, 2024Updated last year
- MeCab model trained with OpenKorPos.☆23Jun 19, 2022Updated 3 years ago
- A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.☆64Jul 6, 2025Updated 7 months ago
- ☆20Dec 16, 2020Updated 5 years ago
- [ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision☆96Oct 30, 2024Updated last year