Abstract: Recently, integrating visual foundation models into large language models (LLMs) to form video understanding systems has attracted widespread attention. Most of the existing models compress ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results