在2024年的NeurIPS(神经信息处理体系大会)中,多模态学习和大语言模型研究一连保持强劲势头。通过对3000多篇录用论文的分析,我们发现:
- 多模态强相干论文超200篇,占比约8%
- 大语言模型与多模态大模型相干研究占比超15%
- 录用率显示该领域仍处于发达发展阶段
当前多模态学习研究主要集中在以下几个关键领域:
下面是搜集的有关多模态学习、多模态大模型的强相干论文,包括论文标题和摘要、翻译,而且博主根据摘要打上了一些方向标签。目前研究的热门方向照旧包括: 视觉识别、视觉明白、视觉对齐、视觉感知、偏好对齐(幻觉处理)、高效模型、Agent、强化学习 等等方面对 MLLM 开展研究。
#NeurIPS2024 #多模态学习 #大语言模型 #人工智能 #机器学习 #视觉识别 #MLLM #深度学习 #AI研究趋势 #计算机视觉
本文是对2024年NeurIPS集会多模态学习相干论文的体系性总结。如果您对相干研究感兴趣,欢迎关注后续更新。
博主新博客地点:BbiHH’s blog | bbihh.top 近期一连跟新AI定会的多模态论文研究趋势,欢迎关注。
(部门论文整理展示,26 篇,后续一连更新,阅读时间~30min)
·Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
We present a novel framework for OCR-free document understanding based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multiscale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for the pretrained MLLMs, we propose a hierarchical visual feature aggregation module designed to reduce the number of input tokens to LLMs. Our approach leverages feature pyramid hierarchy with cross-attentive pooling, effectively handling the trade-off between information loss and efficiency without being affected by varying document image sizes.Additionally, we introduce a novel instruction tuning task that aims to enhance model readability by incorporating text positional information within images, which is robust to text truncation issue. Through comprehensive experiments, we demonstrate the efficacy of our framework in achieving outstanding document understanding performance on various tasks.
文档明白、OCR-free、多尺度特性、层次聚合
用于无 OCR 文档明白的分层视觉特性聚合
提出了一种基于多模式大语言模型(MLLMS)的无 OCR 文档明白框架。该方法利用 多尺度视觉特性来有用地处理文档图像中的各种字体大小,针对预先训练的 MLLMS 思量多尺度视觉输入的代价不断增长的问题,提出了一种层次化视觉特性 聚合模块,旨在淘汰 LLMS 的输入标记数。该方法利用特性金字塔层次布局和交 叉注意池,在不受文档图像大小影响的情况下,有用地处理了信息损失和效率之间 的权衡;此外,我们还引入了一种新的指令调优任务,旨在通过在图像中参加文本 位置信息来加强模型的可读性,该任务对文本截断问题具有很强的鲁棒性。通过全 面的实行,我们证明了我们的框架在各种任务上取得了出色的文档明白性能。
(文档明白,高效感知)
M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation
This paper presents M 3 M^3 M3GPT, an advanced \textbf{M}ultimodal, \textbf{M}ultitask framework for \textbf{M}otion comprehension and generation. M 3 M^3 M3GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal control and generation signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary.The second involves modeling model generation directly in the raw motion space. This strategy circumvents the information loss associated with discrete tokenizer, resulting in more detailed and comprehensive model generation. Third, M 3 M^3 M3GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M 3 M^3 M3GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M 3 M^3 M3GPT’s superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.
动作生成、多模态统一、多任务学习、零样本泛化
KaTeX parse error: Expected 'EOF', got '#' at position 2: M#̲3GPT:用于运动明白和生成的高级多模式、多任务框架
本文提出了一个高级的多通道、多任务框架– M 3 M^3 M3GPT,用于明白和生成文本操 作。 M 3 M^3 M3GPT 依照三个根本原则。第一个重点是为各种运动相干的模态创建一个 统一的表现空间。我们将离散矢量量化用于文本、音乐和动作/舞蹈等多模式控制 和生成信号,使其能够无缝集成到具有单一词汇的大型语言模型(LLM)中。第二, 直接在原始运动空间中建模模型生成。该战略避免了与离散标记器相干的信息损失, 从而产生更详细和更全面的模型生成。第三, M 3 M^3 M3GPT 学习对各种与运动相干的 任务之间的接洽和协同作用举行建模。语篇是 LLMS 最熟悉和最被明白的情态形 式,它被用作在差别的动作任务之间建立接洽的桥梁,促进了相互加强。据我们所 知, M 3 M^3 M3GPT 是第一个能够明白和生成基于多个信号的运动的模型。广泛的实行 突出了 M 3 M^3 M3GPT 在各种与运动相干的任务中的卓越性能,以及其针对极具挑衅性 的任务的强大的零射泛化能力。
(运动明白和生成)
·Training-Free Visual Prompt Learning for Multimodal Large Language Models
In this work, we propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) through learnable visual token optimization. We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. Our approach involves adjusting visual tokens from the MLP output during inference, controlling which text prompt tokens attend to which visual tokens. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referential abilities into MLLMs. Our method support referring with box, mask, scribble and point. The results demonstrate that our method exhibits controllability and interpretability.
视觉指代、无训练优化、多情势标注
多模式大型语言模型的免培训视觉提示学习
在这项工作中,我们提出了一种免训练的方法,通过可学习的视觉标记优化将视觉 引用注入多模式大型语言模型(MLLM)。我们观察 MLLM 中文本提示标记和视 觉标记之间的关系,此中注意力层对它们之间的连接举行建模。我们的方法涉及在 推理期间调解 MLP 输出的视觉标记,控制哪些文本提示标记关注哪些视觉标记。 我们基于能量函数优化可学习的视觉标记,加强注意力舆图中参考地区的强度。这 使得可以举行详细的地区描述和推理,而不必要大量的培训资本或模型再培训。我 们的方法为将参考能力集成到 MLLM 中提供了一个有希望的方向。我们的方法支 持引用框、面具、涂鸦和点。效果表明我们的方法具有可控性和可解释性。
(提示学习、后训练)
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales.This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pretraining.Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by $\sim$73%.Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks. Codes and models will be released.
预训练加速、多尺度视觉、令牌扩展
通过视线链加速多模式 LLM 的预训练
本文介绍了一种视觉-语言桥梁模块–视链,它加速了多通道大语言模型(MLLMS) 的预训练。我们的方法使用一系列视觉重采样器来捕获差别空间尺度上的视觉细节。 该架构不仅有用地利用了全局和局部视觉上下文,而且通过复合令牌缩放战略促进 了视觉令牌的灵活扩展,使得训练前的令牌计数增长了 16 倍。因此,与微调阶段相比,视链在预训练阶段必要的视觉令牌显著淘汰。这种在训练前故意淘汰视觉标 记的方法显著加快了训练前的过程,将挂钟训练时间收缩了 73 美元。一系列视觉 语言基准的经验效果表明,通过视链举行训练前的加速是在不断送性能的情况下实 现的,在整个训练过程中匹配或超过了使用全部视觉标记的尺度流水线。在一系列 基准中,进一步扩大训练前视觉标记的数量将导致更强的体现,与现有方法具有竞 争力。代码和型号将会公布。
(预训练)
·Visual Perception by Large Language Model’s Weights
Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM’s weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to low-rank perceptual weights since the visual information is redundant. Due to the low-rank property, our generated perceptual weights exhibit a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference.
视觉感知、参数空间对齐、低秩权重、计算优化
大型语言模型权重的视觉感知
现有的多通道大语言模型依照的是通过将视觉特性与大语言模型的输入空间对齐, 并将视觉标记与文本标记连接以形成大语言模型的统一序列输入来感知视觉信息的 范例。这些方法在差别的视觉语言任务上显示出良好的效果,但由于视觉标记的参 与导致输入序列的扩展而限定了较高的计算工作量。在本文中,我们提出了一种新 的参数空间对齐范式,将视觉信息表现为模型权重,而不是输入空间对齐。对于每 一幅输入图像,我们使用视觉编码器来提取视觉特性,将特性转换为感知权重,并 将感知权重与 LLM 的权重举行归并。如许,LLM 的输入不必要视觉标记,淘汰了 输入序列的长度,大大提高了效率。依照这一范式,我们提出了具有知觉权重生成 器的 Vlora。由于视觉信息是冗余的,感知权重生成器被计划为将视觉特性转换为 低品级感知权重。由于低阶性质,我们生成的知觉权重呈现出类似于 LORA 的形 式。实行效果表明,我们的 Vlora 在 MLLMS 的各种基准上取得了相当的性能,同 时显著降低了训练和推理的计算代价。
(高效视觉感知)
·Dense Connector for MLLMs
Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B→70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https://anonymous.4open.science/r/DCNIPS.
视觉接口、多层视觉特性、即插即用、跨模态通用性
MLLM 的密集连接器
我们是否充分利用了多模式大型语言模型(MLLMS)中的可视编码器的潜力?比年 来,最大似然模型在多模式明白方面的出色体现引起了学术界和产业界的广泛关注。 在目前的 MLLM 剧烈竞争中,焦点似乎主要集中在语言方面。我们见证了更大、 更高质量的指令数据集的崛起,以及更大规模的 LLM 的参与。然而,很少有人注 意到 MLLMS 所利用的视觉信号,这些信号通常被以为是由冻结的视觉编码器提 取的终极高级特性。在本文中,我们介绍了密集连接器-一种简单、有用和即插即 用的视觉语言连接器,它通过利用多层视觉功能显著加强了现有的 MLLMS,而额 外的计算开销最小。此外,我们的模型仅针对图像举行培训,在视频明白方面也展 示了非凡的零镜头能力。在差别视觉编码器、图像分辨率、训练数据集比例、差别 大小的 LLMS(2.7B→70B)和差别架构的 MLLMS(比方 LLaVA 和 Mini-Gemini)上的 实行效果验证了我们方法的通用性和可扩展性,在 19 个图像和视频基准上获得了 最先辈的性能。我们希望这项工作将提供名贵的经验,并作为将来 MLLM 发展的 根本模块。代码可在 https://anonymous.4open.science/r/DC-NIPS.上找到
(高效视觉感知)
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |