LLM-based metrics like GEMBA predict many ties, but the way that ties should be handled in Kendall’s tau for meta-evaluating metrics has been a longstanding issue. We propose an update to the meta-evaluation methodology to handle ties. arxiv.org/pdf/2305.14324…
3
13
61
17K
16
Download Image
First, we show that existing Kendall variants have shortcomings related to how they handle ties, and, in some cases, ties can be exploited to game the correlations. A metric could have taken advantage of this to inflate its correlations in the WMT’22 metrics shared task.
@_danieldeutsch Hi Dan, do you have code for this paper? I would like to use this meta evaluation to validate my current metrics.