Abstract: This paper presents a grounded language-image pretraining (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results