AI4Bharat / setu
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆16Updated 11 months ago
Alternatives and similar repositories for setu
Users that are interested in setu are comparing it to the libraries listed below
Sorting:
- Data extraction with LLM on CPU☆69Updated last year
- ☆19Updated 6 months ago
- OpenMindedChatbot is a Proof Of Concept that leverages the power of Open source Large Language Models (LLM) with Function Calling capabil…☆29Updated last year
- Embed anything.☆29Updated 11 months ago
- A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages☆106Updated 7 months ago
- Quick Notebook Tutorials☆32Updated 3 months ago
- ☆32Updated last year
- ☆18Updated 11 months ago
- Writing Blog Posts with Generative Feedback Loops!☆47Updated last year
- A seamless matchmaking application that is programmed with Cohere Command R+, Stanford NLP DSPy framework, Weaviate Vector store and Crew…☆59Updated last year
- Fun project: LLM powered RAG Discord Bot that works seamlessly on CPU☆32Updated last year
- RAG with postgreSQL(nebius) and pgvector☆24Updated 5 months ago
- ☆1Updated 10 months ago
- A project that enables identification and classification of an intent of a message with dynamic labels☆39Updated 4 months ago
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆66Updated 6 months ago
- GenAI Experimentation☆58Updated 3 weeks ago
- Build a Streamlit Chatbot using Langchain, ColBERT, Ragatouille, and ChromaDB☆119Updated last year
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks☆35Updated 11 months ago
- Tools for formatting large language model prompts.☆13Updated last year
- Repository for fine-tuning gemma models using unsloth for indic languages☆92Updated last year
- Example implementation of Iteration of Tought - Gives a star if you like the project☆41Updated 4 months ago
- Data extraction with LLM on CPU☆85Updated last year
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆49Updated 10 months ago
- Verbosity control for AI agents☆63Updated 11 months ago
- Mistral + Haystack: build RAG pipelines that rock 🤘☆103Updated last year
- Dynamic Metadata based RAG Framework☆75Updated 9 months ago
- ☆39Updated last year
- This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resultin…☆23Updated last year
- Radiantloom Email Assist 7B is an email-assistant large language model fine-tuned from Zephyr-7B-Beta, over a custom-curated dataset of 1…☆14Updated last year
- Low latency, High Accuracy, Custom Query routers for Humans and Agents. Built by Prithivi Da☆103Updated last month