tongzhou21/Oasis

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/tongzhou21/Oasis)

tongzhou21 / Oasis

☆23

Alternatives and similar repositories for Oasis

Users that are interested in Oasis are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

ZigeW / data_management_LLM
View on GitHub
Collection of training data management explorations for large language models
☆343Aug 2, 2024Updated last year
limenlp / safer-instruct
View on GitHub
This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"
☆17Feb 22, 2024Updated 2 years ago
sail-sg / ActivePRM
View on GitHub
☆21Apr 16, 2025Updated last year
feiyang-k / AutoScale
View on GitHub
Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…
☆14Aug 8, 2025Updated 11 months ago
Qichuzyy / POA
View on GitHub
Official implementation of ECCV24 paper: POA
☆24Aug 8, 2024Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
USTC-StarTeam / ZIP
View on GitHub
arXiv 2024 | ZIP: entropy-law data selection for efficient LLM alignment.
☆28Jun 10, 2026Updated last month
zhuyunqi96 / LoraLPrun
View on GitHub
☆13May 21, 2023Updated 3 years ago
EleutherAI / pile_dedupe
View on GitHub
Pile Deduplication Code
☆18May 15, 2023Updated 3 years ago
NEUIR / ConAE
View on GitHub
[EMNLP 2022] This is the code repo for our EMNLP‘22 paper "Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder"…
☆13Oct 20, 2022Updated 3 years ago
cedrickchee / llama
View on GitHub
Inference code for LLaMA 2 models
☆31Jul 7, 2024Updated 2 years ago
plm-team / PLM
View on GitHub
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing
☆21Mar 18, 2025Updated last year
lemurproject / ClueWeb22
View on GitHub
☆17Dec 11, 2024Updated last year
shjwudp / c4-dataset-script
View on GitHub
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese…
☆136Jun 7, 2023Updated 3 years ago
nelson-liu / website
View on GitHub
☆13Feb 5, 2022Updated 4 years ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
daeveraert / gradient-information-optimization
View on GitHub
Implementation of Gradient Information Optimization (GIO) for effective and scalable training data selection
☆14Jun 22, 2023Updated 3 years ago
xyltt / LPT
View on GitHub
This repo contains the code for Late Prompt Tuning.
☆12Dec 22, 2025Updated 7 months ago
Hu-Junfeng / PKU-Chinese-Paraphrase-Corpus
View on GitHub
中译名著多译本翻译转述语料。语料仅限于用于科研教学活动。文本著作权归原著者。
☆12Jul 26, 2018Updated 7 years ago
Olivia-fsm / DoGE
View on GitHub
Codebase for ICML submission "DOGE: Domain Reweighting with Generalization Estimation"
☆21Feb 29, 2024Updated 2 years ago
INK-USC / FiD-ICL
View on GitHub
"FiD-ICL: A Fusion-in-Decoder Approach for Efficient In-Context Learning" (ACL 2023)
☆15Jul 24, 2023Updated 3 years ago
dheeraj7596 / Small2Large
View on GitHub
☆18Feb 20, 2024Updated 2 years ago
EleutherAI / pile-cc
View on GitHub
☆16Mar 25, 2022Updated 4 years ago
OpenMatch / ThinkNote
View on GitHub
[EACL] This is the code repo for our paper "Enhancing Knowledge Integration and Utilization of Large Language Models via Constructivist C…
☆111Oct 9, 2025Updated 9 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
zhzihao / WikiGenBench
View on GitHub
WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario (COLING 2025)
☆13Jan 5, 2025Updated last year
GAIR-NLP / daVinci-Dev
View on GitHub
[ICML 2026 Oral] Agent-native Mid-training for Software Engineering
☆71Jun 7, 2026Updated last month
xueyouluo / wiki-error-extract
View on GitHub
根据维基百科历史编辑数据提取纠错语料。
☆12Apr 6, 2022Updated 4 years ago
zhang-wei-chao / DC-PDD
View on GitHub
This repository presents the original implementation of Pretraining Data Detection for Large Language Models: A Divergence-based Calibrat…
☆23May 21, 2025Updated last year
yangzhipeng1108 / moss-finetune-and-moss-finetune-int8
View on GitHub
实现moss int8的finetune和优化源moss项目模型保存问题
☆17Jun 1, 2023Updated 3 years ago
chtmp223 / suri
View on GitHub
Suri: Multi-constraint instruction following for long-form text generation [EMNLP’24]
☆27Oct 3, 2025Updated 9 months ago
FreedomIntelligence / DPTDR
View on GitHub
Code for COLING22 paper, DPTDR: Deep Prompt Tuning for Dense Passage Retrieval
☆26Aug 7, 2023Updated 2 years ago
RUCAIBox / JiuZhang3.0
View on GitHub
The code and data for the paper JiuZhang3.0
☆49May 26, 2024Updated 2 years ago
Junjie-Ye / ToolEyes
View on GitHub
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆74May 13, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
ryanzhumich / sparc_atis_pytorch
View on GitHub
☆10Oct 28, 2019Updated 6 years ago
leokhoa / Open-DocLLM
View on GitHub
☆16Apr 3, 2024Updated 2 years ago
UriSha / EmbeddinglessNMT
View on GitHub
The implementation of "Neural Machine Translation without Embeddings", NAACL 2021
☆33Jun 9, 2021Updated 5 years ago
JiangYanting / English_books_classification_Program
View on GitHub
英文文献的《中国图书馆分类法》自动标注小程序
☆13Oct 29, 2024Updated last year
SqueezeAILab / LLM2LLM
View on GitHub
[ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
☆196Mar 25, 2024Updated 2 years ago
freesunshine0316 / sembleu
View on GitHub
SemBleu: A Robust Metric for AMR Parsing Evaluation
☆12Feb 22, 2021Updated 5 years ago
RAIVNLab / MatFormer-OLMo
View on GitHub
Code repository for the public reproduction of the language modelling experiments on "MatFormer: Nested Transformer for Elastic Inference…
☆31Nov 14, 2023Updated 2 years ago