A Unified Encoder for Efficient Multi-Task Inference

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Huy-Dung Nguyen ¹

Anass Bairouk ¹

Mirjana Maras ¹

Wei Xiao ²

Tsun-Hsuan Wang ²

Patrick Chareyre ¹

Ramin Hasani ²

Marc Blanchon ¹

Daniela Rus ²

¹Hybrid Intelligence part of Capgemini Engineering

²Computer Science and Artificial Intelligence Lab, MIT

Abstract

Autonomous driving holds great potential to transform road safety and traffic efficiency by minimizing human error and reducing congestion. A key challenge in realizing this potential is the accurate estimation of steering angles, which is essential for effective vehicle navigation and control. Recent breakthroughs in deep learning have made it possible to estimate steering angles directly from raw camera inputs. However, the limited available navigation data can hinder optimal feature learning, impacting the system's performance in complex driving scenarios. In this paper, we propose a shared encoder trained on multiple computer vision tasks critical for urban navigation, such as depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By incorporating diverse visual information used by humans during navigation, this unified encoder might enhance steering angle estimation. To achieve effective multi-task learning within a single encoder, we introduce a multi-scale feature network for pose estimation to improve depth learning. Additionally, we employ knowledge distillation from a multi-backbone model pretrained on these navigation tasks to stabilize training and boost performance. Our findings demonstrate that a shared backbone trained on diverse visual tasks is capable of providing overall perception capabilities. While our performance in steering angle estimation is comparable to existing methods, the integration of human-like perception through multi-task learning holds significant potential for advancing autonomous driving systems.

Model architecture

Model Overview

Fig. 1: Overview of our multi-task training strategy. Let \(I_s\), \(I_t\), and \(I_1, I_2, \ldots, I_{16}\) represent the source image, target image, and 16 sequential images, respectively. Their corresponding features, denoted as \(f_s\), \(f_t\), and \(f_1, f_2, \ldots, f_{16}\), are extracted using a shared encoder. These features can be concatenated when necessary for subsequent processing.

Fig. 2: Simplified architecture of our model:
(a) Depth network using target image features \(f_t\) to output depth \(\mathbf{d}_t\).
(b) Multi-scale pose network using source and target image features \(f_s, f_t\) to output relative pose \(\mathbf{T}_{t \rightarrow s}\).
(c) 3D Scene Flow \(\mathbf{F}_C\) and Motion mask \(\mathbf{M}\) networks using RGB images and features \(f_s, f_t\).
(d) Segmentation network outputting panoptic, instance, and semantic segmentations.
(e) Loss computation \(L_{\text{ssup}}\) for joint training of depth, pose, 3D scene flow, and motion mask segmentation.
We denote rigid flow \(\mathbf{F}_R\), independent flow \(\mathbf{F}_I\), final flow, and sampled target image \(\hat{\mathbf{I}}_t\).

Fig. 3: Our shared encoder architecture based on Swin Transformer [1]

Fig. 4: Pose decoder architecture details.

Fig. 5: Depth decoder architecture details. The depth decoder is based on the TransDssl architecture [2].

Fig. 6: 3D Scene Flow and Motion mask decoder architecture details.

Method	PQ (↑)	AP (↑)	IoU (↑)
OneFormer	55.8	28.4	74.3
Our multi-task model	56.0	28.6	74.2

Method

PQ (↑)

AP (↑)

IoU (↑)

OneFormer

55.8

28.4

74.3

Our multi-task model

56.0

28.6

74.2

Method	IM	Semantics	N° frame at inference	D	Error metric (↓)	Accuracy metric (↑)
Monodepth2			1	K	0.115	0.903	4.863	0.193	0.877	0.959	0.981
LiteMono			1	K	0.101	0.729	4.454	0.178	0.897	0.965	0.983
Struct2Depth	✓	✓	1	K	0.141	1.026	5.290	0.215	0.816	0.945	0.979
RM-Depth	✓		1	K	0.107	0.687	4.476	0.181	0.883	0.964	0.984
Dynamo-Depth	✓		1	K	0.112	0.758	4.505	0.183	0.873	0.959	0.984
Our (wo 3D scene flow)			1	K	0.109	0.818	4.654	0.184	0.884	0.963	0.983
Struct2Depth	✓	✓	1	CS	0.145	1.737	7.280	0.205	0.813	0.942	0.978
Li et al.	✓	✓	1	CS	0.119	1.290	6.980	0.190	0.846	0.952	0.982
RM-Depth	✓		1	CS	0.100	0.839	5.774	0.154	0.895	0.976	0.993
Zhong et al.	✓		2	CS	0.098	0.946	5.553	0.148	0.908	0.977	0.992
ManyDepth	✓		2	CS	0.114	1.193	6.223	0.170	0.875	0.967	0.989
Our	✓	✓	1	CS	0.106	1.033	5.913	0.158	0.888	0.974	0.982

Method

Semantics

N° frame at inference

Error metric (↓)

Accuracy metric (↑)

Abs_rel

S_qrel

RMSE

RMSE_log

δ < 1.25

δ < 1.25²

δ < 1.25³

Monodepth2

0.115

0.903

4.863

0.193

0.877

0.959

0.981

LiteMono

0.101

0.729

4.454

0.178

0.897

0.965

0.983

Struct2Depth

✓

0.141

1.026

5.290

0.215

0.816

0.945

0.979

RM-Depth

✓

0.107

0.687

4.476

0.181

0.883

0.964

0.984

Dynamo-Depth

✓

0.112

0.758

4.505

0.183

0.873

0.959

0.984

Our (wo 3D scene flow)

0.109

0.818

4.654

0.184

0.884

0.963

0.983

Struct2Depth

✓

0.145

1.737

7.280

0.205

0.813

0.942

0.978

Li et al.

✓

0.119

1.290

6.980

0.190

0.846

0.952

0.982

RM-Depth

✓

0.100

0.839

5.774

0.154

0.895

0.976

0.993

Zhong et al.

✓

0.098

0.946

5.553

0.148

0.908

0.977

0.992

ManyDepth

✓

0.114

1.193

6.223

0.170

0.875

0.967

0.989

Our

✓

0.106

1.033

5.913

0.158

0.888

0.974

0.982

Model	Training Error	Test Error
CNN-GRU 64 units	1.25±1.02	5.06±6.64
CNN-LSTM 64 units	0.19±0.05	3.17±3.85
VAE-LSTM 64 units	0.54±0.26	4.70±4.80
VAE-LSTM 19 units	0.60±0.30	6.75±8.33
Our Swin-AttnPool (ImageNet pretrained)	2.91±2.23	11.03±10.03
Our Swin-AttnPool (Encoder unfrozen)	9.62±3.24	16.46±11.13
Our Swin-AttnPool (Encoder frozen)	1.64±1.63	5.41±6.06

Model

Training Error

Test Error

CNN-GRU 64 units

1.25±1.02

5.06±6.64

CNN-LSTM 64 units

0.19±0.05

3.17±3.85

VAE-LSTM 64 units

0.54±0.26

4.70±4.80

VAE-LSTM 19 units

0.60±0.30

6.75±8.33

Our Swin-AttnPool (ImageNet pretrained)

2.91±2.23

11.03±10.03

Our Swin-AttnPool (Encoder unfrozen)

9.62±3.24

16.46±11.13

Our Swin-AttnPool (Encoder frozen)

1.64±1.63

5.41±6.06

@misc{nguyen2024humaninsightsdrivenlatent, title={Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference}, author={Huy-Dung Nguyen, Anass Bairouk, Mirjana Maras, Wei Xiao, Tsun-Hsuan Wang, Patrick Chareyre, Ramin Hasani, Marc Blanchon and Daniela Rus}, year={2024}, eprint={2409.10095}, url={https://arxiv.org/abs/2409.10095}}

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Abstract

Model architecture

Model Overview

Qualitative results

Quantitative results

Passive Lane-Keeping Evaluation Results

Demo Videos

References

Citation