AlekseyKorshuk / chat-data-pipeline
Chat data cleaning, filtering and deduplication pipeline.
☆16Updated last year
Alternatives and similar repositories for chat-data-pipeline:
Users that are interested in chat-data-pipeline are comparing it to the libraries listed below
- A library for squeakily cleaning and filtering language datasets.☆45Updated last year
- Tools for content datamining and NLP at scale☆42Updated 7 months ago
- A clone of OpenAI's Tokenizer page for HuggingFace Models☆44Updated last year
- ☆37Updated last year
- Multi-Domain Expert Learning☆67Updated last year
- Code base for internal reward models and PPO training☆23Updated last year
- Repository containing the SPIN experiments on the DIBT 10k ranked prompts☆24Updated 10 months ago
- Modified Stanford-Alpaca Trainer for Training Replit's Code Model☆40Updated last year
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Updated last year
- ☆25Updated last month
- ☆22Updated last year
- Explore the use of DSPy for extracting features from PDFs 🔎☆38Updated 10 months ago
- [WIP] Transformer to embed Danbooru labelsets☆13Updated 10 months ago
- Command-line script for inferencing from models such as LLaMA, in a chat scenario, with LoRA adaptations☆33Updated last year
- ☆31Updated last year
- Hugging Face Inference Toolkit used to serve transformers, sentence-transformers, and diffusers models.☆61Updated 2 weeks ago
- utilities for loading and running text embeddings with onnx☆43Updated 5 months ago
- The GeoV model is a large langauge model designed by Georges Harik and uses Rotary Positional Embeddings with Relative distances (RoPER).…☆121Updated last year
- ☆24Updated last year
- GPU accelerated client-side embeddings for vector search, RAG etc.☆66Updated last year
- ☆13Updated last year
- Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.☆66Updated 3 months ago
- Set of scripts to finetune LLMs☆36Updated 10 months ago
- Script for processing OpenAI's PRM800K process supervision dataset into an Alpaca-style instruction-response format☆27Updated last year
- Collection of open source implementations of LLMs with IFT and RLHF that are striving to get to ChatGPT level of performance☆51Updated last year
- QLoRA with Enhanced Multi GPU Support☆36Updated last year
- Experiments with generating opensource language model assistants☆97Updated last year
- a pipeline for using api calls to agnostically convert unstructured data into structured training data☆29Updated 4 months ago
- ☆30Updated 6 months ago
- ☆16Updated last year