This repository will provide the details and code for our model, dataset, and benchmark for LLaVA-ST, a model designed for fine-grained spatial-temporal multimodal understanding. LLaVA-ST demonstrates ...