AlekseyKorshuk / chat-data-pipeline

Chat data cleaning, filtering and deduplication pipeline.

☆16

Alternatives and similar repositories for chat-data-pipeline:

Users that are interested in chat-data-pipeline are comparing it to the libraries listed below

CarperAI / squeakily
A library for squeakily cleaning and filtering language datasets.
☆45Updated last year
LAION-AI / riverbed
Tools for content datamining and NLP at scale
☆42Updated 7 months ago
1rgs / tokenwiz
A clone of OpenAI's Tokenizer page for HuggingFace Models
☆44Updated last year
deep-diver / LLM-Pref-Mark-UI
☆37Updated last year
huu4ontocord / MDEL
Multi-Domain Expert Learning
☆67Updated last year
chai-research / lmgym
Code base for internal reward models and PPO training
☆23Updated last year
argilla-io / distilabel-spin-dibt
Repository containing the SPIN experiments on the DIBT 10k ranked prompts
☆24Updated 10 months ago
teknium1 / stanford_alpaca-replit
Modified Stanford-Alpaca Trainer for Training Replit's Code Model
☆40Updated last year
kaiokendev / cutoff-len-is-context-len
Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit
☆63Updated last year
Narsil / hf-chat
☆25Updated last month
CarperAI / treasure_trove
☆22Updated last year
S1M0N38 / dspy-arxiv
Explore the use of DSPy for extracting features from PDFs 🔎
☆38Updated 10 months ago
Birch-san / booru-embed
[WIP] Transformer to embed Danbooru labelsets
☆13Updated 10 months ago
Birch-san / llama-play
Command-line script for inferencing from models such as LLaMA, in a chat scenario, with LoRA adaptations
☆33Updated last year
shrahimim / Boosting-Theory-of-Mind-in-LLMs-with-Prompting
☆31Updated last year
huggingface / huggingface-inference-toolkit
Hugging Face Inference Toolkit used to serve transformers, sentence-transformers, and diffusers models.
☆61Updated 2 weeks ago
taylorai / onnx_embedding_models
utilities for loading and running text embeddings with onnx
☆43Updated 5 months ago
geov-ai / geov
The GeoV model is a large langauge model designed by Georges Harik and uses Rotary Positional Embeddings with Relative distances (RoPER).…
☆121Updated last year
pacman100 / peft-codegen-25
☆24Updated last year
FL33TW00D / embd
GPU accelerated client-side embeddings for vector search, RAG etc.
☆66Updated last year
Scikud / AnythingButWrappers
☆13Updated last year
daniel-furman / sft-demos
Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.
☆66Updated 3 months ago
Pleias / Various-Finetuning
Set of scripts to finetune LLMs
☆36Updated 10 months ago
scottlogic-alex / prm800k-denorm
Script for processing OpenAI's PRM800K process supervision dataset into an Alpaca-style instruction-response format
☆27Updated last year
arjunbansal / awesome-oss-llm-ift-rlhf
Collection of open source implementations of LLMs with IFT and RLHF that are striving to get to ChatGPT level of performance
☆51Updated last year
ChrisHayduk / qlora-multi-gpu
QLoRA with Enhanced Multi GPU Support
☆36Updated last year
Rallio67 / language-model-agents
Experiments with generating opensource language model assistants
☆97Updated last year
Alignment-Lab-AI / datagen
a pipeline for using api calls to agnostically convert unstructured data into structured training data
☆29Updated 4 months ago
cognitivecomputations / SystemChat
☆30Updated 6 months ago
moomou / listening-with-llm
☆16Updated last year