singletongue/wikipedia-utils

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/singletongue/wikipedia-utils)

singletongue / wikipedia-utils

Utility scripts for preprocessing Wikipedia texts for NLP

☆78

Alternatives and similar repositories for wikipedia-utils

Users that are interested in wikipedia-utils are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

megagonlabs / UD_Japanese-GSD
View on GitHub
Japanese data from the Google UDT 2.0.
☆28Mar 24, 2023Updated 3 years ago
nobu-g / cohesion-analysis
View on GitHub
Code for COLING 2020 Paper
☆13Feb 3, 2026Updated 5 months ago
megagonlabs / instruction_ja
View on GitHub
Japanese instruction data (日本語指示データ)
☆24Jul 13, 2023Updated 3 years ago
inspection-ai / japanese-toxic-dataset
View on GitHub
☆22Jan 11, 2023Updated 3 years ago
HojiChar / HojiChar
View on GitHub
The robust text processing pipeline framework enabling customizable, efficient, and metric-logged text preprocessing.
☆128Jul 17, 2026Updated last week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
utanaka2000 / fairseq
View on GitHub
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
☆25Mar 16, 2021Updated 5 years ago
wwwcojp / ja_sentence_segmenter
View on GitHub
japanese sentence segmentation library for python
☆75Jul 18, 2026Updated last week
WorksApplications / uzushio
View on GitHub
☆24Mar 18, 2026Updated 4 months ago
megagonlabs / bunkai
View on GitHub
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
☆200Mar 26, 2024Updated 2 years ago
WorksApplications / SudachiTra
View on GitHub
Japanese tokenizer for Transformers
☆81Dec 15, 2023Updated 2 years ago
AUGMXNT / shisa
View on GitHub
☆43Mar 30, 2024Updated 2 years ago
retarfi / language-pretraining
View on GitHub
Pre-training Language Models for Japanese
☆50Jul 2, 2023Updated 3 years ago
yahoojapan / JGLUE
View on GitHub
JGLUE: Japanese General Language Understanding Evaluation
☆346Mar 31, 2025Updated last year
aiishii / JEMHopQA
View on GitHub
☆30Apr 10, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
EhimeNLP / AcademicRoBERTa
View on GitHub
☆10Sep 3, 2024Updated last year
samuelbroscheit / wikiextractor-wikimentions
View on GitHub
A tool for extracting plain text and internal Wikipedia links from Wikipedia dumps
☆11Apr 18, 2019Updated 7 years ago
ndl-lab / huriganacorpus-aozora
View on GitHub
青空文庫及びサピエの点字データから作成した振り仮名コーパスのデータセット
☆22Jan 17, 2024Updated 2 years ago
megagonlabs / ginza-transformers
View on GitHub
Use custom tokenizers in spacy-transformers
☆16Aug 9, 2022Updated 3 years ago
KoichiYasuoka / SuPar-UniDic
View on GitHub
Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
☆21Feb 28, 2026Updated 4 months ago
himkt / awesome-bert-japanese
View on GitHub
📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
☆132Mar 15, 2023Updated 3 years ago
sonoisa / t5-japanese
View on GitHub
日本語T5モデル
☆118Sep 15, 2025Updated 10 months ago
nknytk / albert-japanese-tinysegmenter
View on GitHub
Pretrained models, codes and guidances to pretrain official ALBERT(https://github.com/google-research/albert) on Japanese Wikipedia
☆13Sep 26, 2023Updated 2 years ago
ku-nlp / WikipediaAnnotatedCorpus
View on GitHub
☆30Jul 1, 2026Updated 3 weeks ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
daac-tools / find-simdoc
View on GitHub
Finding all pairs of similar documents time- and memory-efficiently
☆62Mar 13, 2025Updated last year
DaisukeBekki / JSeM
View on GitHub
Japanese semantic test suite (FraCaS counterpart and extensions)
☆13Apr 21, 2026Updated 3 months ago
kajyuuen / daaja
View on GitHub
This repository has implementations of data augmentation for NLP for Japanese.
☆64Feb 16, 2023Updated 3 years ago
WorksApplications / chikkarpy
View on GitHub
Japanese synonym library
☆55Feb 7, 2022Updated 4 years ago
cl-tohoku / keigo_transfer_task
View on GitHub
敬語変換タスクにおける評価用データセット
☆21Nov 24, 2022Updated 3 years ago
BandaiNamcoResearchInc / DistilBERT-base-jp
View on GitHub
☆161Oct 19, 2020Updated 5 years ago
Language-Media-Lab / commonsense-moral-ja
View on GitHub
☆15Nov 20, 2025Updated 8 months ago
yuzu-ai / japanese-llm-ranking
View on GitHub
☆50Apr 10, 2024Updated 2 years ago
kunishou / do-not-answer-ja
View on GitHub
☆24Dec 15, 2023Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
nttcslab / japanese-dialog-transformers
View on GitHub
Code for evaluating Japanese pretrained models provided by NTT Ltd.
☆246Jun 21, 2023Updated 3 years ago
hppRC / bert-classification-tutorial
View on GitHub
【2023年版】BERTによるテキスト分類
☆234May 28, 2024Updated 2 years ago
hppRC / defsent
View on GitHub
DefSent: Sentence Embeddings using Definition Sentences
☆23Aug 5, 2021Updated 4 years ago
shimo-lab / Universal-Geometry-with-ICA
View on GitHub
Discovering Universal Geometry in Embeddings with ICA (Published in EMNLP 2023)
☆22Jun 17, 2025Updated last year
stockmarkteam / ner-wikipedia-dataset
View on GitHub
Wikipediaを用いた日本語の固有表現抽出データセット
☆143Sep 2, 2023Updated 2 years ago
sonoisa / clip-japanese
View on GitHub
日本語CLIPモデル
☆13Sep 15, 2025Updated 10 months ago
junya-takayama / DIRECT
View on GitHub
DIRECT: Direct and Indirect REsponses in Conversational Text Corpus
☆17Jul 1, 2021Updated 5 years ago