AI4Bharat / setu
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆14Updated 10 months ago
Alternatives and similar repositories for setu:
Users that are interested in setu are comparing it to the libraries listed below
- Data extraction with LLM on CPU☆68Updated last year
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks☆33Updated 9 months ago
- Low latency, High Accuracy, Custom Query routers for Humans and Agents. Built by Prithivi Da☆101Updated 2 weeks ago
- RAG with postgreSQL(nebius) and pgvector☆24Updated 4 months ago
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆67Updated 4 months ago
- Embed anything.☆29Updated 10 months ago
- Solving data for LLMs - Create quality synthetic datasets!☆145Updated 2 months ago
- Supervised instruction finetuning for LLM with HF trainer and Deepspeed☆34Updated last year
- Dynamic Metadata based RAG Framework☆72Updated 7 months ago
- ☆1Updated 8 months ago
- Text to Python Objects via a LLM Function Call☆57Updated last year
- This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resultin…☆23Updated last year
- Testing paligemma2 finetuning on reasoning dataset☆18Updated 2 months ago
- ☆38Updated last month
- ☆29Updated last year
- A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages☆105Updated 5 months ago
- Agentic RAG to help you build a startup🚀☆16Updated 2 weeks ago
- Writing Blog Posts with Generative Feedback Loops!☆47Updated last year
- Collection of recipes aiding Gen AI model development☆100Updated 2 weeks ago
- Simple examples using Argilla tools to build AI☆53Updated 4 months ago
- A project that enables identification and classification of an intent of a message with dynamic labels☆36Updated 3 months ago
- Radiantloom Email Assist 7B is an email-assistant large language model fine-tuned from Zephyr-7B-Beta, over a custom-curated dataset of 1…☆14Updated last year
- Repository for fine-tuning gemma models using unsloth for indic languages☆89Updated last year
- ☆12Updated 11 months ago
- ☆12Updated last week
- Quick Notebook Tutorials☆32Updated last month
- ☆18Updated 5 months ago
- ☆20Updated 11 months ago
- Running load tests on a FastAPI application using Locust☆13Updated 4 months ago
- Example implementation of Iteration of Tought - Gives a star if you like the project☆39Updated 3 months ago