Meta AI在视频理解方面取得了令人瞩目标里程碑式成就,推出了LongVU,这是一种开创性的模型,可以或许理解从前对人工智能体系来说具有挑战性的长视频。 研究论文 "LongVU:用于长视频语言理解的时空自顺应压缩 "提出了一种革命性的方法,使人工智能可以或许有效地处理和理解长达几分钟乃至一小时的视频,而这在从前是无法实现的。
多模态大语言模型(MLLM)在理解和分析视频内容方面取得了可喜的进展。 然而,受限于给定的上下文长度,处理长视频仍然是一项重大挑战。 为了解决这一限定,我们提出了一种时空自顺应压缩机制 LongVU,以淘汰视频标志的数量,同时保留长视频的视觉细节。 我们的想法是使用跨模态查询和帧间依赖关系,自顺应地淘汰视频中的时空冗余。 具体来说,我们使用 DINOv2 特征来删除相似度高的冗余帧。 然后,我们使用文本引导的跨模态查询来选择性地淘汰帧特征。 此外,我们还根据帧与帧之间的时间依赖关系,对帧进行空间标志缩减。 我们的自顺应压缩策略在有限的上下文长度内有效地处理了大量帧,险些没有丧失任何视觉信息。 在各种视频理解基准测试中,我们的 LongVU 始终超越现有方法,尤其是在长达一小时的视频理解使命(如 VideoMME 和 MLVU)中。 在轻量级 LLM 的情况下,我们的 LongVU 还能有效地扩展到更小的规模,并具有最先进的视频理解性能。
LongVU 架构
LongVU 的结构。 给定一个密集采样的视频帧,我们首先使用 DINOv2 去除冗余帧,然后融合 SigLIP 和 DINOv2 的剩余帧特征。 然后,我们通过跨模态查询有选择地淘汰视觉标志。 末了,我们基于时间依赖性进行空间标志压缩,以进一步满意 LLM 的有限上下文长度。
示例
- # git clone https://github.com/Vision-CAIR/LongVU
- import numpy as np
- import torch
- from longvu.builder import load_pretrained_model
- from longvu.constants import (
- DEFAULT_IMAGE_TOKEN,
- IMAGE_TOKEN_INDEX,
- )
- from longvu.conversation import conv_templates, SeparatorStyle
- from longvu.mm_datautils import (
- KeywordsStoppingCriteria,
- process_images,
- tokenizer_image_token,
- )
- from decord import cpu, VideoReader
- tokenizer, model, image_processor, context_len = load_pretrained_model(
- "./checkpoints/longvu_qwen", None, "cambrian_qwen",
- )
- model.eval()
- video_path = "./examples/video1.mp4"
- qs = "Describe this video in detail"
- vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
- fps = float(vr.get_avg_fps())
- frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
- video = []
- for frame_index in frame_indices:
- img = vr[frame_index].asnumpy()
- video.append(img)
- video = np.stack(video)
- image_sizes = [video[0].shape[:2]]
- video = process_images(video, image_processor, model.config)
- video = [item.unsqueeze(0) for item in video]
- qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
- conv = conv_templates["qwen"].copy()
- conv.append_message(conv.roles[0], qs)
- conv.append_message(conv.roles[1], None)
- prompt = conv.get_prompt()
- input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
- stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
- keywords = [stop_str]
- stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
- with torch.inference_mode():
- output_ids = model.generate(
- input_ids,
- images=video,
- image_sizes=image_sizes,
- do_sample=False,
- temperature=0.2,
- max_new_tokens=128,
- use_cache=True,
- stopping_criteria=[stopping_criteria],
- )
- pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
复制代码 Github:https://github.com/Vision-CAIR/LongVU
如何 24GB VRAM 运行
https://github.com/Vision-CAIR/LongVU/issues/6
- # git clone https://github.com/Vision-CAIR/LongVU
- import numpy as np
- import torch
- from longvu.builder import load_pretrained_model
- from longvu.constants import (
- DEFAULT_IMAGE_TOKEN,
- IMAGE_TOKEN_INDEX,
- )
- from longvu.conversation import conv_templates, SeparatorStyle
- from longvu.mm_datautils import (
- KeywordsStoppingCriteria,
- process_images,
- tokenizer_image_token,
- )
- from decord import cpu, VideoReader
- tokenizer, model, image_processor, context_len = load_pretrained_model(
- "Vision-CAIR/LongVU_Qwen2_7B",
- model_base=None,
- model_name="cambrian_qwen",
- device="cuda:0"
- )
- model.eval()
- video_path = "./examples/video1.mp4"
- qs = "Describe this video in detail"
- vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
- fps = float(vr.get_avg_fps())
- # frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
- num_frames = 1000 if len(vr) > 1000 else len(vr)
- frame_indices = np.array([i for i in range(0, num_frames, round(fps),)])
- video = []
- for frame_index in frame_indices:
- img = vr[frame_index].asnumpy()
- video.append(img)
- video = np.stack(video)
- image_sizes = [video[0].shape[:2]]
- video = process_images(video, image_processor, model.config)
- video = [item.unsqueeze(0) for item in video]
- qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
- conv = conv_templates["qwen"].copy()
- conv.append_message(conv.roles[0], qs)
- conv.append_message(conv.roles[1], None)
- prompt = conv.get_prompt()
- input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
- stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
- keywords = [stop_str]
- stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
- # with torch.inference_mode():
- # output_ids = model.generate(
- # input_ids,
- # images=video,
- # image_sizes=image_sizes,
- # do_sample=False,
- # temperature=0.2,
- # max_new_tokens=128,
- # use_cache=True,
- # stopping_criteria=[stopping_criteria],
- # )
- attention_mask = torch.ones_like(input_ids)
- with torch.inference_mode():
- output_ids = model.generate(
- input_ids,
- attention_mask=attention_mask,
- images=video,
- image_sizes=image_sizes,
- do_sample=True,
- temperature=0.2,
- pad_token_id=tokenizer.eos_token_id,
- max_new_tokens=512,
- use_cache=True,
- stopping_criteria=[stopping_criteria],
- )
- pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
复制代码 输出:
‘The video begins with a scene featuring two characters in an animated setting, one dressed in a bright yellow and red outfit with a mask, and the other in a blue and white traditional robe, standing on a rocky terrain with a green, leaf-like structure and a mountainous backdrop. The character in the yellow and red outfit is seen making a gesture with their right hand, while the other character appears to be speaking or reacting to the first character. The scene then transitions to a misty, ethereal environment where the same two characters are now standing on a staircase leading to a building with a golden roof, surrounded by smoke or clouds. The character in the yellow and red outfit is now holding a sword, while the other character is holding a fan, and both are looking up at the building. The scene shifts again to a large, ornate building with a golden roof, where a figure in a white and red outfit is seen descending a staircase, with smaller figures in white and red attire standing on the steps, and a large, white, cloud-like object in the foreground. The final scene shows the same building with the figure in white and red now seated on a golden throne, surrounded by smaller figures in white and red, and a large, white, cloud-like object still in the foreground, suggesting a ceremonial or significant event taking place.’
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |