AI4Bharat / setuLinks
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆16Updated last year
Alternatives and similar repositories for setu
Users that are interested in setu are comparing it to the libraries listed below
Sorting:
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks☆36Updated last year
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆49Updated 11 months ago
- ☆20Updated 8 months ago
- Data extraction with LLM on CPU☆68Updated last year
- Quick Notebook Tutorials☆32Updated 4 months ago
- ☆1Updated 11 months ago
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆66Updated 7 months ago
- Repository for fine-tuning gemma models using unsloth for indic languages☆94Updated last year
- 💙 Unstructured Data Connectors for Haystack 2.0☆17Updated last year
- A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages☆107Updated 8 months ago
- Writing Blog Posts with Generative Feedback Loops!☆48Updated last year
- Repository of the code base for KT Generation process that we worked at Google Cloud and Searce GenAI Hackathon.☆74Updated last year
- ☆31Updated last year
- OpenMindedChatbot is a Proof Of Concept that leverages the power of Open source Large Language Models (LLM) with Function Calling capabil…☆29Updated last year
- Low latency, High Accuracy, Custom Query routers for Humans and Agents. Built by Prithivi Da☆105Updated 2 months ago
- Solving data for LLMs - Create quality synthetic datasets!☆149Updated 5 months ago
- ☆31Updated 5 months ago
- Explore the use of DSPy for extracting features from PDFs 🔎☆42Updated last year
- Testing speed and accuracy of RAG with, and without Cross Encoder Reranker.☆48Updated last year
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆91Updated 5 months ago
- KMD is a collection of conversational exchanges between patients and doctors on various medical topics. It aims to capture the intricaci…☆24Updated last year
- Embed anything.☆28Updated last year
- Radiantloom Email Assist 7B is an email-assistant large language model fine-tuned from Zephyr-7B-Beta, over a custom-curated dataset of 1…☆14Updated last year
- Collection of resources for RL and Reasoning☆25Updated 4 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆58Updated last month
- Repository containing awesome resources regarding Hugging Face tooling.☆47Updated last year
- Code repository for "Introducing Airavata: Hindi Instruction-tuned LLM"☆59Updated 8 months ago
- Using modal.com to process FineWeb-edu data☆20Updated 2 months ago
- GenAI Experimentation☆57Updated 2 months ago
- ☆19Updated 10 months ago