Cross-view geo-localization (CVGL) aims to determine the capture location of street-view images by matching them with corresponding 2D maps, such as satellite imagery. While recent bird's eye view (BEV)-based methods have advanced this task by addressing viewpoint and appearance differences, the existing approaches typically rely solely on either OpenStreetMap (OSM) data or satellite imagery, limiting localization robustness due to single-modality constraints. This paper presents a novel CVGL method that fuses OSM data with satellite imagery, leveraging their complementary strengths to enhance localization robustness. We integrate the semantic richness and structural information from OSM with the high-resolution visual details of satellite imagery, creating a unified 2D geospatial representation. Additionally, we employ a transformer-based BEV perception module that utilizes attention mechanisms to construct fine-grained BEV features from street-view images for matching with fused map features. Compared to state-of-the-art methods that utilize only OSM data, our approach achieves substantial improvements, with 12.05% and 12.06% recall enhancements on the KITTI benchmark for lateral and longitudinal localization within a 1-m error, respectively.
Keywords: OpenStreetMap; cross-view geo-localization; data fusion; satellite imagery.