Six degrees of freedom (6-DoF) object pose estimation is essential for robotic grasping and autonomous driving. While estimating pose from a single RGB image is highly desirable for real-world applications, it presents significant challenges. Many approaches incorporate supplementary information, such as depth data, to derive valuable geometric characteristics. However, the challenge of deep neural networks inadequately extracting features from object regions in RGB images remains. To overcome these limitations, we introduce the Geometry-Focused Attention Network (GFA-Net), a novel framework designed for more comprehensive feature extraction by analyzing critical geometric and textural object characteristics. GFA-Net leverages Point-wise Feature Attention (PFA) to capture subtle pose differences, guiding the network to localize object regions and identify point-wise discrepancies as pose shifts. In addition, a Geometry Feature Aggregation Module (GFAM) integrates multi-scale geometric feature maps to distill crucial geometric features. Then, the resulting dense 2D-3D correspondences are passed to a Perspective-n-Point (PnP) module for 6-DoF pose computation. Experimental results on the LINEMOD and Occlusion LINEMOD datasets indicate that our proposed method is highly competitive with state-of-the-art approaches, achieving 96.54% and 49.35% accuracy, respectively, utilizing the ADD-S metric with a 0.10d threshold.
Keywords: RGB image; deep learning; dense correspondences; geometric feature; pose estimation.