AI4Bharat / setuLinks
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆16Updated last year
Alternatives and similar repositories for setu
Users that are interested in setu are comparing it to the libraries listed below
Sorting:
- Data extraction with LLM on CPU☆68Updated last year
- A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages☆106Updated 8 months ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆49Updated 10 months ago
- This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resultin…☆23Updated last year
- Repository for fine-tuning gemma models using unsloth for indic languages☆92Updated last year
- Low latency, High Accuracy, Custom Query routers for Humans and Agents. Built by Prithivi Da☆105Updated 2 months ago
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks☆35Updated 11 months ago
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆66Updated 7 months ago
- A simple, consistent and extendable toolkit for IndicTrans2☆32Updated last month
- Example implementation of Iteration of Tought - Gives a star if you like the project☆41Updated 5 months ago
- A project that enables identification and classification of an intent of a message with dynamic labels☆39Updated 5 months ago
- Fine-tune an LLM to perform batch inference and online serving.☆111Updated last week
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafte…☆70Updated 7 months ago
- Repository containing awesome resources regarding Hugging Face tooling.☆47Updated last year
- ☆43Updated 3 months ago
- Repository of the code base for KT Generation process that we worked at Google Cloud and Searce GenAI Hackathon.☆74Updated last year
- ☆59Updated 2 weeks ago
- This repo is the central repo for all the RAG Evaluation reference material and partner workshop☆64Updated last month
- Using various instructor clients evaluating the quality and capabilities of extractions and reasoning.☆51Updated 8 months ago
- Build Agentic workflows with function calling using open LLMs☆26Updated this week
- Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFs☆26Updated 4 months ago
- 💙 Unstructured Data Connectors for Haystack 2.0☆16Updated last year
- ☆92Updated 2 months ago
- Code repository for "Introducing Airavata: Hindi Instruction-tuned LLM"☆59Updated 7 months ago
- Writing Blog Posts with Generative Feedback Loops!☆48Updated last year
- Diagnose the performance of your RAG🩺☆36Updated 2 months ago
- Machine Learning Serving focused on GenAI with simplicity as the top priority.☆58Updated last month
- A framework for fine-tuning retrieval-augmented generation (RAG) systems.☆87Updated this week
- A Python wrapper around HuggingFace's TGI (text-generation-inference) and TEI (text-embedding-inference) servers.☆33Updated 3 weeks ago
- AI agent with RAG+ReAct on Indian Constitution & BNS☆65Updated 7 months ago