Selective Temporal Fusion using Recurrent Attention for End-to-End Autonomous Driving
DOI:
https://doi.org/10.5324/zpjv0r03Keywords:
End-to-End Autonomous Driving, Recurrent Neural Networks, Imitation Learning, Attention, Temporal Processing, CARLAAbstract
In end-to-end autonomous driving (E2E-AD), understanding the complex and dynamic environment of the driving scene is crucial. Temporal information supports this by extending perception beyond what is observable in a single frame. While some E2E-AD architectures, like TransFuser++, operate without temporal modeling, various methods for temporal fusion have been explored, from frame-stacking to memory-based methods and most recently, attention-based recurrent methods. However, existing recurrent attention methods lack a mechanism for forgetting information, distributing attention across all past features even when they are no longer relevant. In this paper, we present a recurrent attention-based temporal fusion module (TFM) with selective forgetting, designed as a drop-in extension for E2E-AD architectures. The TFM fuses current and past information using cross-attention, enabling temporal modeling with minimal impact on inference time, and allows for interpretable retention through attention weight visualization. We integrate a selection mechanism using a void token to allow selective forgetting of irrelevant past information. Applied to the TransFuser++ architecture, our method achieves a driving score of 83.69% on the closed-loop Bench2Drive benchmark and provides qualitative insights into how models retain past information. These results demonstrate its potential as a temporal extension to otherwise temporally unaware architectures.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Andreas Bentzen Winje, Florian Wintel, Gabriel Hanssen Kiss, Frank Lindseth

This work is licensed under a Creative Commons Attribution 4.0 International License.