Why Self-Supervised BEV Segmentation?

Teaser Image

Existing approaches used to generate instantaneous Bird's-Eye-View (BEV) segmentation maps from frontal view (FV) images rely on the presence of large annotated BEV datasets as they are trained in a fully-supervised manner. However, BEV ground truth annotation relies on the presence of HD maps, annotated 3D point clouds, and/or 3D bounding boxes which are extremely resource-intensive and difficult to obtain. Some approaches circumvent this limitation by leveraging data from simulation environments - but these approaches are often victim to the large domain gap between simulated and real-world images - which results in their reduced performance on real-world data.

There is thus a significant need to allow for the generation of instantaneous BEV maps without relying on large amounts of annotated data in BEV. In this work, we address the aforementioned challenges by proposing SkyEye, the first self-supervised learning framework for generating an instantaneous semantic map in BEV, given a single monocular image in FV. Find out more about our novel self-supervised framework in the approach section!

Technical Approach

SkyEye Architecture

Network Architecture
Figure: Overview of our proposed self-supervised BEV semantic mapping framework, SkyEye. The core component of our approach is the latent voxel grid \(\mathcal{V}_{0}\) that serves as a joint feature representation for segmentation tasks in FV and BEV. We encode spatial and semantic information into \(\mathcal{V}_{0}\) using implicit supervision during a pretraining step and explicit supervision in a subsequent refinement step using pseudolabels generated using a self-supervised depth prediction pipeline. The path in red denotes the flow of information during inference.

The goal of our SkyEye framework is to generate instantaneous BEV semantic maps without relying on any ground truth supervision in BEV. The core idea behind our approach is to generate an intermediate 3D voxel grid that serves as a joint feature representation for both FV and BEV segmentation. This joint representation allows us to leverage ground truth supervision in FV to augment the BEV semantic learning procedure.

Our SkyEye framework comprises five major components - (i) an image encoder to generate 2D features from the input monocular RGB image, (ii) a lifting module to generate the 3D voxel grid using a learned depth distribution, (iii) an FV semantic head to generate the FV semantic predictions, (iv) a BEV semantic head to generate the instantaneous BEV semantic map, and (v) an independent self-supervised depth network to generate the BEV pseudolabels. The encoder employs an EfficientDet-D3 backbone to generate four feature scales from an input RGB image, which we subsequently merge to generate a composite 2D feature map. The lifting module lifts the 2D features to a 3D voxel grid representation using the camera projection equation coupled with a learned depth distribution that provides the likelihood of features in a given voxel. We then process the voxel grid depth-wise or height-wise to generate the output in FV or BEV, respectively. The self-supervised depth network is independent of the aforementioned model and is only used to generate the BEV semantic pseudolabels. It follows the strategy outlined in Monodepth2 but replaces the default backbone with an EfficientDet-D3 backbone.

Implicit Supervision

Implicit Supervision
Figure: Semantic predictions of SkyEye for future time steps (\(t_1, t_2, t_3, ...\)) using the FV image of only the initial time step \(t_{0}\). The disocclusion of sidewalk in the semantic predictions indicates that SkyEye can reason about both occluded regions and spatial extents of objects in the scene with the encoded semantic information.

Autonomous driving scenes comprise many static elements such as parked cars and buildings which establish a strong framework for enforcing spatial consistency of the scene over multiple time steps. We exploit this characteristic of the real world and generate the implicit supervision signal by enforcing consistency between FV semantic predictions over multiple time steps. To this end, we predict the FV semantic maps for the initial time step (\(t_{0}\)) as well as future time steps (\(t_1, ..., t_n\)) using only the intermediate voxel grid representation at time step \(t_{0}\). This formulation helps the network learn a spatially consistent volumetric representation of the scene, which can then be leveraged during finetuning to generate spatially consistent BEV maps. Further, this formulation also helps the voxel grid encode complementary information from multiple images which plays a pivotal role in predicting an accurate BEV semantic map from the limited view of only a single time step.

Explicit Supervision

Pseudolabel Generation Pipeline
Figure: Overview of our pseudolabel generation pipeline. We lift the FV semantic ground truth annotations into the 3D world to generate a point cloud \(\mathcal{\hat{P}}_{k\rightarrow 0}\). We accumulate the static regions over multiple time steps to generate an accumulated point cloud \(\dot{\mathcal{P}}_0\). We then densify the static classes in BEV using a sequence of morphological dilate and erode operations to generate \(\hat{\mathcal{B}}^s\). Parallelly, we cluster points belonging to dyanamic objects and fit boxes around the detected clusters to generate \(\hat{\mathcal{B}}^d\). Finally, we merge \(\hat{\mathcal{B}}^s\) and \(\hat{\mathcal{B}}^d\) to generate BEV pseudolabels \(\hat{\mathcal{B}}^{pl}\).

Explicit supervision accounts for the lack of gradient flow through the BEV head during the pretraining step by leveraging the training signal from explicitly generated BEV pseudolabels. These BEV pseudolabels are generated using a self-supervised protocol, thus maintaining the sanctity of our SkyEye framework. The pseudolabel generation pipeline comprises three steps, namely, (i) a depth prediction pipeline to lift FV semantic annotations into BEV yielding a semantic point cloud, (ii) an instance generation module based on the density-based clustering algorithm DBSCAN, and (iii) a densification module to generate dense segmentation masks from sparse depth predictions for static classes. We generate the pseudolabels for a given timestep by first lifting a window of FV semantic ground truth labels into 3D to generate a set of semantic point clouds using a self-supervised depth estimation network. We then accumulate the semantic point clouds and orthographically project them into BEV. Subsequently, we apply morphological dilate and erode operations on the static regions to generate a dense representation for static regions in BEV. Parallelly, we cluster the points belonging to dynamic classes using the DBSCAN algorithm to generate multiple discernible object clusters, and fit ellipses around them to estimate their centre as well as their extents. Finally, we overlay the estimated bounding boxes over the dense BEV map to generate the final BEV pseudolabels for the given time step.



A software implementation of this project using the PyTorch framework can be found in our GitHub repository.

Bird's Eye View Semantic Segmentation Datasets

We introduce two BEV semantic segmentation datasets for autonomous driving, namely, KITTI-360-BEV and Waymo-BEV that provide semantic BEV ground truths for the KITTI-360 and Waymo datasets respectively. We provide annotations for 8 classes in the KITTI-360-BEV dataset, and annotations for 6 classes in the Waymo-BEV dataset.

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the KITTI-360-BEV or Waymo-BEV datasets, please consider citing the paper mentioned in the Publication section.


Nikhil Gosala*,   Kürsat Petek*,   Paulo L.J. Drews-Jr,   Wolfram Burgard,   Abhinav Valada
"SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images"
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

(PDF (CVF)) (PDF (arXiv)) (BibTex)



This work was partly funded by the German Research Foundation (DFG) Emmy Noether Program grant number 468878300, the Bundesministerium fur Bildung und Forschung (BMBF) grant number FKZ 16ME0027, a CAPES-Alexander von Humboldt Foundation fellowship, and a hardware grant from NVIDIA.