AI4Bharat / setuLinks
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆15Updated last year
Alternatives and similar repositories for setu
Users that are interested in setu are comparing it to the libraries listed below
Sorting:
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks☆38Updated last year
- Solving data for LLMs - Create quality synthetic datasets!☆150Updated 11 months ago
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafte…☆78Updated last year
- Recipes and resources for building, deploying, and fine-tuning generative AI with Fireworks.☆130Updated this week
- Writing Blog Posts with Generative Feedback Loops!☆50Updated last year
- Verbosity control for AI agents☆64Updated last year
- Repository for fine-tuning gemma models using unsloth for indic languages☆97Updated last year
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆68Updated last month
- ☆53Updated 10 months ago
- Simple Graph Memory for AI applications☆89Updated 7 months ago
- Data extraction with LLM on CPU☆68Updated 2 years ago
- 💙 Unstructured Data Connectors for Haystack 2.0☆17Updated 2 years ago
- AnyModal is a Flexible Multimodal Language Model Framework for PyTorch☆103Updated 11 months ago
- Using open source LLMs to build synthetic datasets for direct preference optimization☆71Updated last year
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆51Updated last year
- ☆21Updated last year
- Simple examples using Argilla tools to build AI☆56Updated last year
- Embed anything.☆27Updated last year
- Supervised instruction finetuning for LLM with HF trainer and Deepspeed☆36Updated 2 years ago
- Dataset Viber is your chill repo for data collection, annotation and vibe checks.☆45Updated last year
- PyLate efficient inference engine☆68Updated 3 months ago
- DSPy program/pipeline inspector widget for Jupyter/VSCode Notebooks.☆43Updated last year
- ☆30Updated last year
- This repo is the central repo for all the RAG Evaluation reference material and partner workshop☆77Updated 7 months ago
- ☆31Updated last year
- Dynamic Metadata based RAG Framework☆78Updated 2 weeks ago
- Machine Learning Serving focused on GenAI with simplicity as the top priority.☆59Updated 2 months ago
- a LLM cookbook, for building your own from scratch, all the way from gathering data to training a model☆166Updated last year
- Agentic RAG to help you build a startup🚀☆55Updated 8 months ago
- LLM-Training-API: Including Embeddings & ReRankers, mergekit, LaserRMT☆27Updated last year