kkk-an / COFFTEALinks

Code for Findings of EMNLP2023 paper "Coarse-to-Fine Dual Encoders are Better Frame Identification Learners"

☆12

Alternatives and similar repositories for COFFTEA

Users that are interested in COFFTEA are comparing it to the libraries listed below

Sorting:

Yifan-Song793 / GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
☆30Updated 10 months ago
hkust-nlp / Laser
Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
☆41Updated 2 weeks ago
TobiasLee / VEC
Visual and Embodied Concepts evaluation benchmark
☆21Updated last year
TingchenFu / MathIF
instruction-following benchmark for large reasoning models
☆28Updated last week
Spico197 / MoE-SFT
🍼 Official implementation of Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts
☆38Updated 8 months ago
SihengLi99 / TextBind
[2024-ACL]: TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wildrounded Conversation
☆46Updated last year
dqxiu / KAssess
☆14Updated last year
lfy79001 / S3Eval
[NAACL 2024] A Synthetic, Scalable and Systematic Evaluation Suite for Large Language Models
☆32Updated 11 months ago
hkust-nlp / mstar
[ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning
☆60Updated 5 months ago
jinzhuoran / RAG-RewardBench
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
☆16Updated 5 months ago
GAIR-NLP / MoPS
[ACL 2024] Code for "MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation"
☆36Updated 10 months ago
GAIR-NLP / BeHonest
BeHonest: Benchmarking Honesty in Large Language Models
☆33Updated 9 months ago
chenllliang / MMEvalPro
[NAACL 2025] Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs
☆24Updated 8 months ago
kkk-an / UltraIF
Code of paper 'UltraIF: Advancing Instruction Following from the Wild'.
☆13Updated 2 months ago
CriticBench / CriticBench
[ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
☆25Updated last year
PKU-TANGENT / ConFiguRe
Dataset and baseline for Coling 2022 long paper (oral): "ConFiguRe: Exploring Discourse-level Chinese Figures of Speech"
☆11Updated last year
lscpku / VITATECS
☆18Updated 10 months ago
M3-IT / YING-VLM
Vision Large Language Models trained on M3IT instruction tuning dataset
☆17Updated last year
sail-sg / ActivePRM
☆15Updated last month
YujieLu10 / Seeker
☆10Updated last year
Mikivishy / FullFront
The official code repository for the FullFront benchmark
☆16Updated 2 weeks ago
GAIR-NLP / weak-to-strong-reasoning
☆59Updated 9 months ago
kiaia / GIRAFFE
Extending context length of visual language models
☆11Updated 5 months ago
zhaochen0110 / Timo
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
☆21Updated 7 months ago
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆50Updated 5 months ago
RLHFlow / RAFT
This is an official implementation of the Reward rAnked Fine-Tuning Algorithm (RAFT), also known as iterative best-of-n fine-tuning or re…
☆31Updated 8 months ago
lancopku / MUKI
[Findings of EMNLP22] From Mimicking to Integrating: Knowledge Integration for Pre-Trained Language Models
☆19Updated 2 years ago
KbsdJames / omni-math-rule
The rule-based evaluation subset and code implementation of Omni-MATH
☆22Updated 5 months ago
hkust-nlp / GUIMid
☆18Updated last month
chtmp223 / suri
Suri: Multi-constraint instruction following for long-form text generation (EMNLP’24)
☆22Updated 6 months ago