3. 【CoRL 2023】PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
动机
让机器人能够理解并执行基于自然语言指令的操作任务是机器人技术的长期目标。
语言引导操作的主要方法使用 2D 图像表示,这在组合多视角摄像机和推断精确的 3D 位置和关系方面面临困难
好的relate work写法
Most existing work on language-guided robotic manipulation uses 2D image representations [1, 2, 3, 4]. BC-Z [1] applies ResNet [5] to encode a single-view image for action prediction. Hiveformer [3] employs transformers [6] to jointly encode multi-view images and all the history. Recent advances in vision and language learning [7, 8] have further paved the way in image-based manipulation [4]. CLIPort [4] and InstructRL [9] take advantage of pretrained vision-and-language models [8, 10] to improve generalization in multi-task manipulation. GATO [11] and PALM-E [12] jointly train robotic tasks with massive web image-text data for better representation and task reasoning.
Although 2D image-based policies have achieved promising results, they have inherent limitations for manipulation in the 3D world. First, they do not take full advantage of multi-view cameras for visual occlusion reasoning, as multi-view images are not explicitly aligned with each other, as shown in Figure 1. Second, accurately inferring the precise 3D positions and spatial relations [13] from 2D images is a significant challenge. Current 2D approaches mainly rely on extensive pretraining and sufficient in-domain data to achieve satisfactory performance.
尽管基于 2D 图像的策略取得了令人鼓舞的成果,但它们在 3D 世界中的操作存在固有的局限性。
为了克服基于2D的操控策略学习的限制,近期的研究已经转向基于3D的方法。使用3D表示提供了一种自然的方式来融合多视图观察,并促进更精确的3D定位。例如,PerAct采用了一种以动作为中心的方法,它采用超过100万个体素的高维输入来分类下一个活跃的体素,为多任务语言引导的操控取得了最先进的结果。然而,这种以动作为中心的3D体素存在量化误差和计算效率低下的问题。以点云形式的替代3D表示已经成功地用于3D对象检测、分割和定位。然而,对于机器人操控来说,3D点云的有效和高效处理仍然未被充分探索。此外,现有的工作主要集中在单一任务操控上,缺乏同时整合语言指令以完成多项任务的多功能性。
解决方案
附录I
Voxel-based representations have been used in several domains that specifically benefit from 3D understanding. Like in object detection [91, 92], object search [93], and vision-language grounding [94, 95], voxel maps have been used to build persistent scene representations [96]. In Neural Radiance Fields (NeRFs), voxel feature grids have dramatically reduced training and rendering times [97, 98]. Similarly, other works in robotics have used voxelized representations to embed viewpoint-invariance for driving [99] and manipulation [100]. The use of latent vectors in Perceiver [1] is broadly related to voxel hashing [101] from computer graphics. Instead of using a location-based hashing function to map voxels to fixed size memory, PerceiverIO uses cross attention to map the input to fixed size latent vectors, which are trained end-to-end. Another major difference is the treatment of unoccupied space. In graphics, unoccupied space does not affect rendering, but in PERACT, unoccupied space is where a lot of “action detections” happen. Thus the relationship between unoccupied and occupied space, i.e., scene, objects, robot, is crucial for learning action representations.
缺点
在附录L中讲了很多:
Generlization to Novel Instances and Objects.
5. 【CoRL 2023 (Oral)】RVT: Robotic View Transformer for 3D Object Manipulation
6. 【2024Baidu】 VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation
基于二维图像的操作
动机
现有方法通常均匀地处理三维工作空间,忽略了末端执行器附近的空间对于操作任务自然发生的归纳偏差的重要性。以前的研究强调了在手视角的价值:例如,有研究表明在手视图揭示了更多与任务相关的细节,这对于高精度任务特别有利。同样,有研究表明,结合在手视图可以减少与夹持器动作无关的干扰,从而提高泛化能力。
7. Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation