microsoft / lost_in_conversationLinks

Code that accompanies the public release of the paper Lost in Conversation (https://arxiv.org/abs/2505.06120)

☆148

Alternatives and similar repositories for lost_in_conversation

Users that are interested in lost_in_conversation are comparing it to the libraries listed below

Sorting:

zai-org / ComplexFuncBench
Complex Function Calling Benchmark.
☆123Updated 6 months ago
Nardien / agent-distillation
Official Code Repository for the paper "Distilling LLM Agent into Small Models with Retrieval and Code Tools"
☆130Updated this week
NVIDIA / When2Call
A dataset for training and evaluating LLMs on decision making about "when (not) to call" functions
☆30Updated 3 months ago
THU-KEG / Agentic-Reward-Modeling
[ACL 2025] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
☆99Updated last month
facebookresearch / ReasonIR
Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".
☆188Updated last month
AlexCuadron / ThinkingAgent
Systematic evaluation framework that automatically rates overthinking behavior in large language models.
☆91Updated 2 months ago
dwzhu-pku / LongEmbed
LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)
☆139Updated 8 months ago
voidism / Lookback-Lens
Code for the EMNLP 2024 paper "Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps"
☆130Updated 11 months ago
TIGER-AI-Lab / CritiqueFineTuning
Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" [COLM 2025]
☆169Updated 3 weeks ago
InternLM / SWE-Fixer
☆108Updated 2 months ago
sunnynexus / RetroLLM
RetroLLM: Empowering LLMs to Retrieve Fine-grained Evidence within Generation [ACL 2025]
☆115Updated 6 months ago
huggingface / fineweb-2
☆174Updated last month
bespokelabsai / verifiers
Verifiers for LLM Reinforcement Learning
☆68Updated 3 months ago
QwenLM / WorldPM
☆90Updated 2 months ago
sail-sg / sailcraft
🚢 Data Toolkit for Sailor Language Models
☆94Updated 5 months ago
Ayanami0730 / deep_research_bench
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
☆245Updated this week
SALT-NLP / collaborative-gym
Framework and toolkits for building and evaluating collaborative agents that can work together with humans.
☆91Updated 3 months ago
SALT-NLP / demonstrated-feedback
☆125Updated 10 months ago
DataArcTech / LLM-as-a-Judge
☆128Updated 4 months ago
allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆233Updated 9 months ago
liuqi6777 / llm4ranking
Large language models for document ranking.
☆64Updated 2 months ago
TIGER-AI-Lab / General-Reasoner
General Reasoner: Advancing LLM Reasoning Across All Domains
☆156Updated last month
orionw / promptriever
The first dense retrieval model that can be prompted like an LM
☆81Updated 2 months ago
Liyan06 / MiniCheck
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [EMNLP 2024]
☆174Updated 7 months ago
ScalerLab / JudgeBench
☆91Updated 9 months ago
ReasoningTransfer / Transferability-of-LLM-Reasoning
☆80Updated 2 weeks ago
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
wang-research-lab / agentinstruct
Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"
☆115Updated 10 months ago
GAIR-NLP / OctoThinker
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
☆161Updated 2 weeks ago
salesforce / summary-of-a-haystack
Codebase accompanying the Summary of a Haystack paper.
☆79Updated 10 months ago