IT评测·应用市场-qidao123.com
标题:
阿里国际开源Ovis2系列多模态大语言模型 共有六个版本
[打印本页]
作者:
圆咕噜咕噜
时间:
2025-3-7 06:11
标题:
阿里国际开源Ovis2系列多模态大语言模型 共有六个版本
2025 年 2 月 21 日,阿里巴巴国际化团队宣布其新型多模态大语言模型Ovis2 系列正式开源。
Ovis2 是阿里巴巴国际化团队提出的Ovis系列模型的最新版本。与前序1. 6 版本相比,Ovis2 在数据构造和练习方法上都有显著改进。它不仅强化了小规模模型的本事密度,还通过指令微调和偏好学习大幅提拔了头脑链(CoT)推理本事。此外,Ovis2 引入了视频和多图像处理本事,并增强了多语言本事和复杂场景下的OCR本事,显著提拔了模型的实用性。
此次开源的Ovis2 系列包括1B、2B、4B、8B、16B和34B六个版本,各个参数版本均达到了同尺寸的SOTA(State of the Art)水平。其中,Ovis2-34B在权威评测榜单OpenCompass上展现出了杰出的性能。在多模态通用本事榜单上,Ovis2-34B位列所有开源模型第二,以不到一半的参数尺寸高出了诸多70B开源旗舰模型。在多模态数学推理榜单上,Ovis2-34B更是位列所有开源模型第一,其他尺寸版本也展现出出色的推理本事。这些结果不仅证实白Ovis架构的有效性,也展示了开源社区在推动多模态大模型发展方面的巨大潜力。
Ovis2 的架构计划巧妙地办理了模态间嵌入计谋差异这一局限性。它由视觉tokenizer、视觉嵌入表和LLM三个关键组件构成。视觉tokenizer将输入图像分割成多个图像块,使用视觉Transformer提取特征,并通过视觉头层将特征匹配到“视觉单词”上,得到概率化的视觉token。视觉嵌入表存储每个视觉单词对应的嵌入向量,而LLM则将视觉嵌入向量与文本嵌入向量拼接后进行处理,天生文本输出,完成多模态使命。
在练习计谋上,Ovis2 采用了四阶段练习方法,以充实引发其多模态明白本事。第一阶段冻结大部门LLM和ViT参数,练习视觉模块,学习视觉特征到嵌入的转化。第二阶段进一步增强视觉模块的特征提取本事,提拔高分辨率图像明白、多语言和OCR本事。第三阶段通过对话形式的视觉Caption数据对齐视觉嵌入与LLM的对话格式。第四阶段则是多模态指令练习和偏好学习,进一步提拔模型在多种模态下对用户指令的遵循本事和输出质量。
为了提拔视频明白本事,Ovis2 开发了一种创新的关键帧选择算法。该算法基于帧与文本的相关性、帧之间的组合多样性和帧的序列性挑选最有用的视频帧。通过高维条件相似度计算、行列式点过程(DPP)和马尔可夫决策过程(MDP),算法能够在有限的视觉上下文中高效地选择关键帧,从而提拔视频明白的性能。
Ovis2 系列模型在OpenCompass多模态评测榜单上的表现尤为突出。不同尺寸的模型在多个Benchmark上均取得了SOTA结果。比方,Ovis2-34B在多模态通用本事和数学推理榜单上分别位列第二和第一,展现了其强盛的性能。此外,Ovis2 在视频明白榜单上也取得了领先性能,进一步证实白其在多模态使命中的优势。
阿里巴巴国际化团队表现,开源是推动AI技能进步的关键气力。通过公开分享Ovis2 的研究结果,团队等候与全球开发者共同探索多模态大模型的前沿,并引发更多创新应用。现在,Ovis2 的代码已开源至GitHub,模型可在Hugging Face和Modelscope平台上获取,同时提供了在线Demo供用户体验。相关研究论文也已发布在arXiv上,供开发者和研究者参考。
我们使用 OpenCompass 多模态和推理排行榜中使用的 VLMEvalKit 来评估 Ovis2。
图像基准
BenchmarkQwen2.5-VL-3BSAIL-VL-2BInternVL2.5-2B-MPOOvis1.6-3BInternVL2.5-1B-MPOOvis2-1BOvis2-2BMMBench-V1.1test
77.1
73.670.774.165.868.476.9MMStar56.556.554.952.049.552.1
56.7
MMMUval
51.4
44.144.646.740.336.145.6MathVistatestmini60.162.853.458.947.759.4
64.1
HallusionBench48.745.940.743.834.845.2
50.2
AI2D81.477.475.177.868.576.4
82.7
OCRBench83.183.183.880.184.3
89.0
87.3MMVet63.244.2
64.2
57.647.250.058.3MMBenchtest78.67772.876.667.970.2
78.9
MMT-Benchval60.857.154.459.250.855.5
61.7
RealWorldQA66.56261.3
66.7
5763.966.0BLINK
48.4
46.443.843.84144.047.9QBench74.472.869.875.863.371.3
76.2
ABench75.574.571.175.267.571.3
76.6
MTVQA24.920.222.621.121.723.7
25.6
BenchmarkQwen2.5-VL-7BInternVL2.5-8B-MPOMiniCPM-o-2.6Ovis1.6-9BInternVL2.5-4B-MPOOvis2-4BOvis2-8BMMBench-V1.1test82.682.080.680.577.881.4
83.6
MMStar64.1
65.2
63.362.96161.964.6MMMUval56.254.850.95551.849.0
57.4
MathVistatestmini65.867.9
73.3
67.364.169.671.8HallusionBench
56.3
51.751.152.247.553.8
56.3
AI2D84.184.586.184.481.585.7
86.6
OCRBench87.788.288.98387.9
91.1
89.1MMVet66.6
68.1
67.2656665.565.1MMBenchtest83.483.283.282.779.683.2
84.9
MMT-Benchval62.762.562.364.961.665.2
66.6
RealWorldQA68.871.168.070.764.471.1
72.5
BLINK56.1
56.6
53.948.550.653.054.3QBench77.973.878.776.771.578.1
78.9
ABench75.677.0
77.5
74.475.9
77.5
76.4MTVQA28.527.223.119.22829.4
29.7
BenchmarkQwen2.5-VL-72BInternVL2.5-38B-MPOInternVL2.5-26B-MPOOvis1.6-27BLLaVA-OV-72BOvis2-16BOvis2-34BMMBench-V1.1test
87.8
85.484.282.284.485.686.6MMStar
71.1
70.167.763.565.867.269.2MMMUval
67.9
63.856.460.356.660.766.7MathVistatestmini70.873.671.570.268.473.7
76.1
HallusionBench58.8
59.7
52.454.147.956.858.8AI2D88.287.986.286.686.286.3
88.3
OCRBench88.189.4
90.5
85.674.187.989.4MMVet76.772.668.16860.668.4
77.1
MMBenchtest
88.2
86.485.484.685.687.187.8MMT-Benchval69.169.165.768.2-69.2
71.2
RealWorldQA
75.9
74.473.772.773.974.175.6BLINK62.3
63.2
62.648-59.060.1QBench-76.176.077.7-79.5
79.8
ABench-78.6
79.4
76.5-
79.4
78.7MTVQA-
31.2
28.726.5-30.330.6
视频基准
BenchmarkQwen2.5-VL-3BInternVL2.5-2BInternVL2.5-1BOvis2-1BOvis2-2BVideoMME(wo/w-subs)
61.5/67.6
51.9 / 54.150.3 / 52.348.6/49.557.2/60.8MVBench67.0
68.8
64.360.3264.9MLVU(M-Avg/G-Avg)68.2/-61.4/-57.3/-58.5/3.66
68.6
/3.86MMBench-Video
1.63
1.441.361.261.57TempCompass
64.4
--51.4362.64 BenchmarkQwen2.5-VL-7BInternVL2.5-8BLLaVA-OV-7BInternVL2.5-4BOvis2-4BOvis2-8BVideoMME(wo/w-subs)65.1/71.664.2 / 66.958.2/61.562.3 / 63.664.0/66.3
68.0/71.6
MVBench69.6
72.0
56.771.668.4568.15MLVU(M-Avg/G-Avg)70.2/-68.9/-64.7/-68.3/-70.8/4.23
76.4
/4.25MMBench-Video1.791.68-1.731.69
1.85
TempCompass
71.7
---67.0269.28 BenchmarkQwen2.5-VL-72BInternVL2.5-38BInternVL2.5-26BLLaVA-OneVision-72BOvis2-16BOvis2-34BVideoMME(wo/w-subs)
73.3/79.1
70.7 / 73.166.9 / 69.266.2/69.570.0/74.471.2/75.6MVBench70.474.4
75.2
59.468.670.3MLVU(M-Avg/G-Avg)74.6/-75.3/-72.3/-68.0/-77.7/4.44
77.8
/4.59MMBench-Video
2.02
1.821.86-1.921.98TempCompass74.8---74.16
75.97
使用
pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
pip install flash-attn==2.7.0.post2 --no-build-isolation
复制代码
Ovis2-1B
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
torch_dtype=torch.bfloat16,
multimodal_max_length=32768,
trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()
# single-image input
image_path = '/data/images/example_1.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = 'Describe the image.'
query = f'<image>\n{text}'
## cot-style input
# cot_suffix = "Provide a step-by-step solution to the problem, and conclude with 'the answer is' followed by the final solution."
# image_path = '/data/images/example_1.jpg'
# images = [Image.open(image_path)]
# max_partition = 9
# text = "What's the area of the shape?"
# query = f'<image>\n{text}\n{cot_suffix}'
## multiple-images input
# image_paths = [
# '/data/images/example_1.jpg',
# '/data/images/example_2.jpg',
# '/data/images/example_3.jpg'
# ]
# images = [Image.open(image_path) for image_path in image_paths]
# max_partition = 4
# text = 'Describe each image.'
# query = '\n'.join([f'Image {i+1}: <image>' for i in range(len(images))]) + '\n' + text
## video input (require `pip install moviepy==1.0.3`)
# from moviepy.editor import VideoFileClip
# video_path = '/data/videos/example_1.mp4'
# num_frames = 12
# max_partition = 1
# text = 'Describe the video.'
# with VideoFileClip(video_path) as clip:
# total_frames = int(clip.fps * clip.duration)
# if total_frames <= num_frames:
# sampled_indices = range(total_frames)
# else:
# stride = total_frames / num_frames
# sampled_indices = [min(total_frames - 1, int((stride * i + stride * (i + 1)) / 2)) for i in range(num_frames)]
# frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
# frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
# images = frames
# query = '\n'.join(['<image>'] * len(images)) + '\n' + text
## text-only input
# images = []
# max_partition = None
# text = 'Hello'
# query = text
# format conversation
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]
# generate output
with torch.inference_mode():
gen_kwargs = dict(
max_new_tokens=1024,
do_sample=False,
top_p=None,
top_k=None,
temperature=None,
repetition_penalty=None,
eos_token_id=model.generation_config.eos_token_id,
pad_token_id=text_tokenizer.pad_token_id,
use_cache=True
)
output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
print(f'Output:\n{output}')
复制代码
批推理
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
torch_dtype=torch.bfloat16,
multimodal_max_length=32768,
trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()
# preprocess inputs
batch_inputs = [
('/data/images/example_1.jpg', 'What colors dominate the image?'),
('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
('/data/images/example_3.jpg', 'Is there any text in the image?')
]
batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []
for image_path, text in batch_inputs:
image = Image.open(image_path)
query = f'<image>\n{text}'
prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
batch_input_ids.append(input_ids.to(device=model.device))
batch_attention_mask.append(attention_mask.to(device=model.device))
batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))
batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
padding_value=0.0).flip(dims=[1])
batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
batch_first=True, padding_value=False).flip(dims=[1])
batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]
# generate outputs
with torch.inference_mode():
gen_kwargs = dict(
max_new_tokens=1024,
do_sample=False,
top_p=None,
top_k=None,
temperature=None,
repetition_penalty=None,
eos_token_id=model.generation_config.eos_token_id,
pad_token_id=text_tokenizer.pad_token_id,
use_cache=True
)
output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
**gen_kwargs)
for i in range(len(batch_inputs)):
output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
print(f'Output {i + 1}:\n{output}\n')
复制代码
参考
代码:https://github.com/AIDC-AI/Ovis
模型(Huggingface):https://huggingface.co/AIDC-AI/Ovis2-34B
模型(Modelscope):https://modelscope.cn/collections/Ovis2-1e2840cb4f7d45
Demo:https://huggingface.co/spaces/AIDC-AI/Ovis2-16B
arXiv: https://arxiv.org/abs/2405.20797
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
欢迎光临 IT评测·应用市场-qidao123.com (https://dis.qidao123.com/)
Powered by Discuz! X3.4