CarperAI/squeakily

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/CarperAI/squeakily)

CarperAI / squeakily

A library for squeakily cleaning and filtering language datasets.

☆50

Alternatives and similar repositories for squeakily

Users that are interested in squeakily are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

CarperAI / treasure_trove
View on GitHub
☆21Aug 27, 2023Updated 2 years ago
serega / gaoya
View on GitHub
Locality Sensitive Hashing
☆81May 29, 2026Updated last month
vaguenebula / AlpacaDataReflect
View on GitHub
An experiment to see if chatgpt can improve the output of the stanford alpaca dataset
☆12Mar 29, 2023Updated 3 years ago
laramohan / wikillm
View on GitHub
LLMs as Collaboratively Edited Knowledge Bases
☆52Feb 8, 2026Updated 5 months ago
CarperAI / decontamination
View on GitHub
This repository contains code for cleaning your training data of benchmark data to help combat data snooping.
☆28Apr 21, 2023Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
Alignment-Lab-AI / datagen
View on GitHub
a pipeline for using api calls to agnostically convert unstructured data into structured training data
☆32Sep 22, 2024Updated last year
r-three / RAD
View on GitHub
Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
☆45Oct 1, 2025Updated 9 months ago
davidbrochart / ipyhtmx
View on GitHub
Build modern UIs in Jupyter with Python
☆12Dec 28, 2022Updated 3 years ago
teknium1 / RawTransform
View on GitHub
A repository of prompts and Python scripts for intelligent transformation of raw text into diverse formats.
☆34May 29, 2023Updated 3 years ago
lucidrains / autoregressive-linear-attention-cuda
View on GitHub
CUDA implementation of autoregressive linear attention, with all the latest research findings
☆46May 23, 2023Updated 3 years ago
rian-dolphin / fasthtml-chat
View on GitHub
A chat implementation for FastHTML
☆12Sep 14, 2025Updated 10 months ago
emrgnt-cmplxty / SmolTrainer
View on GitHub
☆21Oct 6, 2023Updated 2 years ago
huu4ontocord / MDEL
View on GitHub
Multi-Domain Expert Learning
☆66Jan 23, 2024Updated 2 years ago
Gryphe / BlockMerge_Gradient
View on GitHub
Merge Transformers language models by use of gradient parameters.
☆214Aug 8, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
dvruette / concept-guidance
View on GitHub
Code accompanying the paper "A Language Model's Guide Through Latent Space". It contains functionality for training and using concept vec…
☆21Feb 23, 2024Updated 2 years ago
luohongyin / EntST
View on GitHub
Entailment self-training
☆27May 30, 2023Updated 3 years ago
johnrobinsn / alpaca_lora_30b_4bit
View on GitHub
☆18Apr 3, 2023Updated 3 years ago
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
ChrisHayduk / qlora-multi-gpu
View on GitHub
QLoRA with Enhanced Multi GPU Support
☆38Aug 8, 2023Updated 2 years ago
thunlp / CSS-LM
View on GitHub
CSS-LM: Contrastive Semi-supervised Fine-tuning of Pre-trained Language Models
☆11Jul 1, 2023Updated 3 years ago
phihung / fh_utils
View on GitHub
A collection of utilities for FastHTML projects.
☆14Oct 23, 2024Updated last year
alea-institute / kl3m-data
View on GitHub
KL3M training data collection and preprocessing
☆22Apr 14, 2025Updated last year
reactorsh / ambrosia
View on GitHub
clean up your LLM datasets
☆113May 30, 2023Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
kernelmachine / cbtm
View on GitHub
Code repository for the c-BTM paper
☆109Sep 26, 2023Updated 2 years ago
kaiokendev / cutoff-len-is-context-len
View on GitHub
Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit
☆62Jun 21, 2023Updated 3 years ago
hadasah / btm
View on GitHub
☆79Apr 29, 2024Updated 2 years ago
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Jun 19, 2026Updated last month
joeljang / ELM
View on GitHub
[ICML 2023] Exploring the Benefits of Training Expert Language Models over Instruction Tuning
☆99Apr 26, 2023Updated 3 years ago
TehVenomm / LM_Transformers_BlockMerge
View on GitHub
Image Diffusion block merging technique applied to transformers based Language Models.
☆55May 8, 2023Updated 3 years ago
lablab-ai / webgpu-llm-chrome-extension-starter
View on GitHub
WebLLM Chrome Extension Starter Pack.
☆12Aug 10, 2023Updated 2 years ago
EleutherAI / polyglot-data
View on GitHub
data related codebase for polyglot project
☆19Mar 30, 2023Updated 3 years ago
cooelf / CompassMTL
View on GitHub
Task Compass: Scaling Multi-task Pre-training with Task Prefix (EMNLP 2022: Findings) (stay tuned & more will be updated)
☆22Oct 17, 2022Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
gonglinyuan / metro_t0
View on GitHub
Code repo for "Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers" (ACL 2023)
☆22Nov 1, 2023Updated 2 years ago
SkunkworksAI / hydra-moe
View on GitHub
☆416Nov 2, 2023Updated 2 years ago
lucidrains / st-moe-pytorch
View on GitHub
Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch
☆385Jun 17, 2024Updated 2 years ago
Exant64 / CWE
View on GitHub
☆13Updated this week
crowsonkb / torch-dist-utils
View on GitHub
Utilities for PyTorch distributed
☆26Feb 27, 2025Updated last year
fastai / apl-study
View on GitHub
fast.ai APL study group notes
☆29Oct 19, 2022Updated 3 years ago
jiacheng-ye / ZeroGen
View on GitHub
[EMNLP 2022] Code for our paper “ZeroGen: Efficient Zero-shot Learning via Dataset Generation”.
☆47Feb 18, 2022Updated 4 years ago