Pipeline
Lengthy video datasets are difficult to construct due to the numerous guide effort required to pick out, watch, perceive and annotate lengthy movies with free-form pure language. Answering difficult questions on longer movies is commonly a multimodal process that will contain listening to the audio monitor along with watching the video. It might even be a non-linear process, as a result of generally it might be essential to rewind and rewatch key components to reply a query. Proposing appropriate high-level questions that aren’t trivially solved by observing just a few frames can be tough for folks to do constantly and with enough selection.
To be able to clear up this downside we suggest a semi-automatic pipeline that first generates candidate a number of alternative questions utilizing various robust vision-language fashions (VLMs) and enormous language fashions (LLMs) with rigorously designed prompts, after which lets human annotators filter and proper the proposed questions to scale back errors and bias. To be able to cut back human effort, we leverage computerized instruments to (1) discover appropriate movies, (2) extract helpful alerts, after which (3) mechanically generate video-level captions, questions and solutions.
Our pipeline begins with the number of video content material. We filter movies to extend visible and demographic variety. We additionally take away movies with principally static content material in addition to gaming movies and animated content material. Within the subsequent stage, we extract two sorts of captions from the ensuing movies: computerized speech recognition (ASR) captions and body captions. For the latter, we immediate a VLM to explain video frames sampled at one body per second. The subsequent step summarizes these captions by segmenting the video into pictures, grouping them by matters and prompting an LLM to summarize ASR and frame-level captions into shot-level captions.
Given these captions, the pipeline generates multiple-choice questions in two phases. Within the first stage, we immediate an LLM to generate a set of difficult questions and solutions, offering it with the video captions as context. Within the second stage, we immediate the LLM with a generated question-answer pair and the video captions and ask it to generate 4 decoy solutions. Decoys have to be incorrect however believable solutions to the query. The ultimate stage of the pipeline is human verification, the place we ask human raters to filter or appropriate incorrect questions, solutions and decoys.