(((ل()(ل() 'yoav))))👾 @yoavgo, Twitter Profile

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

turns out ACL reviewers kinda hated it, so arxiv it is. its not a huge paper, but I think it has a message that's worth sharing. the title is "Two Kinds of Recall". arxiv.org/pdf/2303.10527…

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

turns out ACL reviewers kinda hated it, so arxiv it is. its not a huge paper, but I think it has a message that's worth sharing. the title is "Two Kinds of Recall". arxiv.org/pdf/2303.10527…

8 0 24 80K 4

24 17 172 120K 68

the story is that we traditionally think of pattern/rule-based systems as "precise but low recall" and neural/learned systems as "(a bit less) precise but (very) high recall". and this is true... sort of. there are two kinds of recall, and neural systems fall short on the 2nd one

3 2 18 6K 3

Download Image

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

the first kind of recall ("d-recall") is about diversity. the ability to cover many distinct cases. the second kind ("e-recall") is about exhaustiveness: being consistently right on all the sentence that adhere to some pattern.

1 0 8 4K 1

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

and neural systems are not good at e-recall. they have weird and unexpected blind spots. i demonstrate it with a quick experiment. l look at sentences that follow a very clear syntactic pattern. and then ask a Squad-QA model about it. and it fails to identidy many of them.

1 0 10 4K 0

Download Image

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

yes, the Q is a bit odd and unnatural. but it does manage to answer many other sentences with this pattern and question. just not these ones. why? who knows. does it "understand" language? eh. is this particular example fixable? sure. is the underlying issue fixable? much harder.

1 0 7 4K 0

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

afer review, i tried also with text-davinci-003. it is much better of course. but... still not exhaustive.

2 0 5 4K 0

Download Image

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

i demonstrated this for a qa/ie task and a simple case, but i think the issue is much more pervasive. models have blind spots. they are not consistent/exhaustive. and our datasets *suck* at evaluating e-recall (exhaustiveness). this is an issue. we should attempt to eval better.

1 0 16 4K 1

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

how do we do better? i don't know. this is hard. i leave it as a challenge. but we really should find a way to evaluate also the exhaustiveness of models, and not just their diversity. (we don't want to be bit by pattern-based systems, do we?)

5 0 8 5K 0

Anna Rumshisky @arumshisky

a year ago

@yoavgo Seems like you’re talking about consistency, rather than exhaustiveness.

1 0 1 259 0

(((ل()(ل() 'yoav))))👾 @yoavgo

a year ago

@arumshisky i thought about this term at some point, but consistency is not a recall oriented property, imo. no?

1 0 0 334 0