turns out ACL reviewers kinda hated it, so arxiv it is. its not a huge paper, but I think it has a message that's worth sharing. the title is "Two Kinds of Recall". arxiv.org/pdf/2303.10527…
turns out ACL reviewers kinda hated it, so arxiv it is. its not a huge paper, but I think it has a message that's worth sharing. the title is "Two Kinds of Recall". arxiv.org/pdf/2303.10527…
the story is that we traditionally think of pattern/rule-based systems as "precise but low recall" and neural/learned systems as "(a bit less) precise but (very) high recall". and this is true... sort of. there are two kinds of recall, and neural systems fall short on the 2nd one
the first kind of recall ("d-recall") is about diversity. the ability to cover many distinct cases. the second kind ("e-recall") is about exhaustiveness: being consistently right on all the sentence that adhere to some pattern.
and neural systems are not good at e-recall. they have weird and unexpected blind spots. i demonstrate it with a quick experiment. l look at sentences that follow a very clear syntactic pattern. and then ask a Squad-QA model about it. and it fails to identidy many of them.
yes, the Q is a bit odd and unnatural. but it does manage to answer many other sentences with this pattern and question. just not these ones. why? who knows. does it "understand" language? eh. is this particular example fixable? sure. is the underlying issue fixable? much harder.
afer review, i tried also with text-davinci-003. it is much better of course. but... still not exhaustive.
i demonstrated this for a qa/ie task and a simple case, but i think the issue is much more pervasive. models have blind spots. they are not consistent/exhaustive. and our datasets *suck* at evaluating e-recall (exhaustiveness). this is an issue. we should attempt to eval better.
how do we do better? i don't know. this is hard. i leave it as a challenge. but we really should find a way to evaluate also the exhaustiveness of models, and not just their diversity. (we don't want to be bit by pattern-based systems, do we?)
@yoavgo Seems like you’re talking about consistency, rather than exhaustiveness.
@arumshisky i thought about this term at some point, but consistency is not a recall oriented property, imo. no?