AI4Bharat / setu
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆14Updated 11 months ago
Alternatives and similar repositories for setu:
Users that are interested in setu are comparing it to the libraries listed below
- ☆1Updated 9 months ago
- Data extraction with LLM on CPU☆68Updated last year
- ☆74Updated 6 months ago
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks☆34Updated 10 months ago
- Solving data for LLMs - Create quality synthetic datasets!☆145Updated 3 months ago
- A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages☆106Updated 6 months ago
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆66Updated 5 months ago
- RAG with postgreSQL(nebius) and pgvector☆24Updated 4 months ago
- A project that enables identification and classification of an intent of a message with dynamic labels☆38Updated 4 months ago
- Fine tuning ModernBERT-embed-base on synthetic domain specific data for improvement to unseen queries☆26Updated 3 months ago
- Example implementation of Iteration of Tought - Gives a star if you like the project☆40Updated 3 months ago
- GenAI Experimentation☆58Updated 2 months ago
- An overview of GRPO & DeepSeek-R1 Training with Open Source GRPO Model Fine Tuning☆31Updated 2 months ago
- Own your AI, search the web with it🌐😎☆84Updated 3 months ago
- Low latency, High Accuracy, Custom Query routers for Humans and Agents. Built by Prithivi Da☆102Updated 3 weeks ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆49Updated 9 months ago
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Models☆105Updated last week
- This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resultin…☆23Updated last year
- AI agent with RAG+ReAct on Indian Constitution & BNS☆61Updated 5 months ago
- Repository for fine-tuning gemma models using unsloth for indic languages☆89Updated last year
- Build reliable, secure, and production-ready AI apps easily.☆71Updated this week
- RAG example using DSPy, Gradio, FastAPI☆78Updated last year
- OpenMindedChatbot is a Proof Of Concept that leverages the power of Open source Large Language Models (LLM) with Function Calling capabil…☆29Updated last year
- Reliable RAG setup that uses Semantic Double Merging Chunking from llamaindex, Qdrant Hybrid Search, colBERT for reranking and Google Gem…☆38Updated 4 months ago
- Dataset Viber is your chill repo for data collection, annotation and vibe checks.☆47Updated 7 months ago
- Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)☆76Updated 2 months ago
- This project involves using llamaindex Multi Agents concierge system and Qdrant vector database to customize the RAG application with use…☆50Updated 8 months ago
- ☆19Updated 6 months ago
- Fine-tune an LLM to perform batch inference and online serving.☆109Updated this week
- ☆29Updated last year