retarfi/language-pretraining

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/retarfi/language-pretraining)

retarfi / language-pretraining

Pre-training Language Models for Japanese

☆50

Alternatives and similar repositories for language-pretraining

Users that are interested in language-pretraining are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

megagonlabs / cocosum
View on GitHub
Code & Data for Comparative Opinion Summarization via Collaborative Decoding (Iso et al; Findings of ACL 2022)
☆23Mar 3, 2025Updated last year
kzinmr / transformers_ner_ja
View on GitHub
Japanese NER with Transformers + PyTorch-Lightning + MLflow Tracking
☆15Nov 20, 2022Updated 3 years ago
DaisukeBekki / JSeM
View on GitHub
Japanese semantic test suite (FraCaS counterpart and extensions)
☆13Apr 21, 2026Updated 3 months ago
WorksApplications / SudachiTra
View on GitHub
Japanese tokenizer for Transformers
☆80Dec 15, 2023Updated 2 years ago
daac-tools / python-vaporetto
View on GitHub
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. (Python wrapper)
☆21May 30, 2026Updated last month
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
hppRC / bert-classification-tutorial-2024
View on GitHub
【2024年版】BERTによるテキスト分類
☆30Jul 8, 2024Updated 2 years ago
nobu-g / cohesion-analysis
View on GitHub
Code for COLING 2020 Paper
☆13Feb 3, 2026Updated 5 months ago
singletongue / wikipedia-utils
View on GitHub
Utility scripts for preprocessing Wikipedia texts for NLP
☆78Apr 9, 2024Updated 2 years ago
Nikkei / semantic-shift-stability
View on GitHub
Implementation of Semantic Shift Stability (AACL 2022, IC2S2 2023, JNLP)
☆17Dec 24, 2024Updated last year
cl-tohoku / AIO2_DPR_baseline
View on GitHub
https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
☆16Aug 4, 2022Updated 3 years ago
octanove / shiba
View on GitHub
Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
☆89Nov 3, 2023Updated 2 years ago
tylerachang / word-acquisition-language-models
View on GitHub
Word acquisition in neural language models (TACL 2022).
☆21Jan 30, 2025Updated last year
Nikkei / fast-mia
View on GitHub
A framework designed to streamline the evaluation of Membership Inference Attacks (MIA) against Large Language Models (LLMs). By leveragi…
☆15Updated this week
wwwcojp / ja_sentence_segmenter
View on GitHub
japanese sentence segmentation library for python
☆75Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
himkt / awesome-bert-japanese
View on GitHub
📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
☆132Mar 15, 2023Updated 3 years ago
kajyuuen / daaja
View on GitHub
This repository has implementations of data augmentation for NLP for Japanese.
☆64Feb 16, 2023Updated 3 years ago
yahoojapan / JGLUE
View on GitHub
JGLUE: Japanese General Language Understanding Evaluation
☆346Mar 31, 2025Updated last year
daac-tools / vaporetto
View on GitHub
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
☆295Updated this week
cl-tohoku / keigo_transfer_task
View on GitHub
敬語変換タスクにおける評価用データセット
☆21Nov 24, 2022Updated 3 years ago
utanaka2000 / fairseq
View on GitHub
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
☆25Mar 16, 2021Updated 5 years ago
ujiuji1259 / shinra-attribute-extraction
View on GitHub
☆11Sep 7, 2021Updated 4 years ago
laboroai / Laboro-ParaCorpus
View on GitHub
Scripts for creating a Japanese-English parallel corpus and training NMT models
☆19Nov 9, 2021Updated 4 years ago
megagonlabs / jrte-corpus
View on GitHub
Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
☆77Jun 23, 2023Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
izuna385 / Wikia-and-Wikipedia-EL-Dataset-Creator
View on GitHub
You can create datasets from Wikia/Wikipedia that can be used for entity recognition and Entity Linking. Dumps for ja-wiki and VTuber-wik…
☆18May 2, 2021Updated 5 years ago
oleg-panichev / WiDS-Datathon-2020-Second-place-solution
View on GitHub
WiDS Datathon 2020 Second place solution
☆10Jul 6, 2023Updated 3 years ago
colorfulscoop / sbert-ja
View on GitHub
Code to train Sentence BERT Japanese model for Hugging Face Model Hub
☆11Aug 8, 2021Updated 4 years ago
AtsunoriFujita / sagemaker_nlp_examples
View on GitHub
NLP examples(almost Japanese) on AWS
☆12May 31, 2022Updated 4 years ago
megagonlabs / UD_Japanese-GSD
View on GitHub
Japanese data from the Google UDT 2.0.
☆28Mar 24, 2023Updated 3 years ago
HojiChar / HojiChar
View on GitHub
The robust text processing pipeline framework enabling customizable, efficient, and metric-logged text preprocessing.
☆128Jul 17, 2026Updated last week
stockmarkteam / ner-wikipedia-dataset
View on GitHub
Wikipediaを用いた日本語の固有表現抽出データセット
☆143Sep 2, 2023Updated 2 years ago
kajyuuen / funer
View on GitHub
Funer is Rule based Named Entity Recognition tool.
☆22Apr 21, 2022Updated 4 years ago
yahoojapan / yskip
View on GitHub
Incremental Skip-gram Model with Negative Sampling
☆69Jun 30, 2019Updated 7 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
verypluming / JaNLI
View on GitHub
☆17May 31, 2023Updated 3 years ago
ndl-lab / ndlngramdata
View on GitHub
デジタル化資料から作成したOCRテキストデータのngram頻度統計情報のデータセット
☆17Jan 10, 2023Updated 3 years ago
aws-samples / bokete-denshosen
View on GitHub
ボケて電笑戦 (bokete DENSHOSEN) Workshop
☆43May 16, 2022Updated 4 years ago
intellygenta / InteractiveParallelCoordinates
View on GitHub
Python code for interactive parallel coordinates visualization on jupyter notebook.
☆12Sep 8, 2019Updated 6 years ago
SkelterLabsInc / JaQuAD
View on GitHub
JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)
☆110Mar 2, 2022Updated 4 years ago
yagays / nayose-wikipedia-ja
View on GitHub
Wikipediaから作成した日本語名寄せデータセット
☆35Mar 10, 2020Updated 6 years ago
hppRC / defsent
View on GitHub
DefSent: Sentence Embeddings using Definition Sentences
☆23Aug 5, 2021Updated 4 years ago