tcapelle / mistral_wandbLinks

A full fledged mistral+wandb

☆13

Alternatives and similar repositories for mistral_wandb

Users that are interested in mistral_wandb are comparing it to the libraries listed below

Sorting:

alopatenko / LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use…
☆143Updated this week
salesforce / summary-of-a-haystack
Codebase accompanying the Summary of a Haystack paper.
☆79Updated last year
quotient-ai / judges
A small library of LLM judges
☆294Updated 2 months ago
CLAIRE-Labo / quantile-reward-policy-optimization
Official codebase for "Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions" (Matrenok …
☆27Updated 3 months ago
vinid / NegotiationArena
☆77Updated last year
patronus-ai / Lynx-hallucination-detection
☆43Updated last year
philschmid / evaluate-llms
Includes examples on how to evaluate LLMs
☆23Updated 11 months ago
davanstrien / data-for-fine-tuning-llms
☆80Updated last year
kaistAI / FLASK
[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
☆218Updated last year
microsoft / eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
☆169Updated this week
allenai / hybrid-preferences
Learning to route instances for Human vs AI Feedback (ACL Main '25)
☆24Updated 2 months ago
MoritzLaurer / synthetic-data-blog
This is the reproduction repository for my 🤗 Hugging Face blog post on synthetic data
☆68Updated last year
MadryLab / context-cite
Attribute (or cite) statements generated by LLMs back to in-context information.
☆291Updated last year
davanstrien / awesome-synthetic-datasets
awesome synthetic (text) datasets
☆302Updated 3 months ago
MoritzLaurer / zeroshot-classifier
Notebooks for training universal 0-shot classifiers on many different tasks
☆136Updated 9 months ago
rajshah4 / LLM-Evaluation
Sample notebooks and prompts for LLM evaluation
☆151Updated this week
huggingface / data-is-better-together
Let's build better datasets, together!
☆262Updated 9 months ago
Arize-ai / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆104Updated last month
apple / ml-superposition-prompting
☆146Updated last year
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆187Updated 7 months ago
princeton-pli / hal-harness
☆138Updated last week
HumanSignal / RLHF
Collection of links, tutorials and best practices of how to collect the data and build end-to-end RLHF system to finetune Generative AI m…
☆223Updated 2 years ago
haizelabs / verdict
Inference-time scaling for LLMs-as-a-judge.
☆302Updated 2 weeks ago
geronimi73 / phi2-finetune
☆88Updated last year
bhaweshiitk / ConformalLLM
Extending Conformal Prediction to LLMs
☆68Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆163Updated last year
SALT-NLP / demonstrated-feedback
☆128Updated last year
deshwalmahesh / PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…
☆50Updated last year
pacman100 / peft-codegen-25
☆23Updated 2 years ago
mddunlap924 / LangChain-SynData-RAG-Eval
LangChain, Llama2-Chat, and zero- and few-shot prompting are used to generate synthetic datasets for IR and RAG system evaluation
☆37Updated last year