stanford-crfm / EUAIActJune15
Stanford CRFM's initiative to assess potential compliance with the draft EU AI Act
☆92Updated 11 months ago
Related projects: ⓘ
- Fiddler Auditor is a tool to evaluate language models.☆163Updated 6 months ago
- ☆256Updated this week
- The Foundation Model Transparency Index☆65Updated 3 months ago
- 📚 A curated list of papers & technical articles on AI Quality & Safety☆155Updated 11 months ago
- Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning☆40Updated 9 months ago
- Sample notebooks and prompts for LLM evaluation☆104Updated 5 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆96Updated 10 months ago
- WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.☆27Updated last month
- The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"☆109Updated 11 months ago
- ☆184Updated last week
- ☆57Updated 5 months ago
- AI Verify☆111Updated this week
- This is the reproduction repository for my 🤗 Hugging Face blog post on synthetic data☆57Updated 7 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆86Updated 3 months ago
- TalkToModel gives anyone with the powers of XAI through natural language conversations 💬!☆108Updated last year
- This is an open-source tool to assess and improve the trustworthiness of AI systems.☆70Updated this week
- A curated list of awesome academic research, books, code of ethics, data sets, institutes, newsletters, principles, podcasts, reports, to…☆50Updated this week
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆93Updated 5 months ago
- Let's build better datasets, together!☆195Updated last month
- [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets☆209Updated 8 months ago
- Functional Benchmarks and the Reasoning Gap☆74Updated last month
- Mistral + Haystack: build RAG pipelines that rock 🤘☆99Updated 7 months ago
- Building a chatbot powered with a RAG pipeline to read,summarize and quote the most relevant papers related to the user query.☆161Updated 4 months ago
- Automating enterprise workflows with multimodal agents☆83Updated last month
- Red-Teaming Language Models with DSPy☆116Updated 5 months ago
- Make it easy to automatically and uniformly measure the behavior of many AI Systems.☆25Updated last week
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆48Updated 2 months ago
- Codebase accompanying the Summary of a Haystack paper.☆65Updated 2 months ago
- Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.☆91Updated last week
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.☆50Updated this week