This is probably the first encoder-decoder model in the recent model race.
This is probably the first encoder-decoder model in the recent model race.
@srchvrs Most of the multimodal LLMs I know are enc-dec models, which I believe include Gemini and GPT-4V.
@sir_deenicus @xwang_lk Agree it's not common. I have double checked that Gemini is decoder only.
@srchvrs @sir_deenicus Uncommon for text-only LLMs but common for multimodal LLMs. I think you are referring to text-only Gemini.
@srchvrs @sir_deenicus By Gemini. Currently, it is very difficult to build a decoder-only model to encode video, image, text, audio all within the same sequence. It is worth it to research but quite challenging.
@xwang_lk @srchvrs @sir_deenicus google-research.github.io/seanet/audiopa… AudioPalm is text-audio decoder-only