Transformer neural networks – how global attention-mechanisms improve edge-case performance in L4 autonomy

Transformer neural networks have become the state-of-the-art architecture for large language models such as Chat-GPT and have contributed significantly to their success. Researchers and technology companies are now exploring their applicability to other tasks such as detection and classification on images or lidar pointclouds.

At driveblocks, we have been moving early on this technology and have incorporated attention-mechanisms in our neural networks from day one. This blog post is intended to give some insights into the reasoning behind this and the results we have been achieving by doing so.

The key advantage of the transformer architecture for detection and classification tasks in contrast to convolutional neural networks (CNNs) is its ability to incorporate global information from the whole input domain. CNNs are usually focused on information which are close to the object being predicted. While this is often sufficient, there are edge-cases where this behavior can trick the model into false conclusions. In contrast, the attention-mechanism in transformer neural networks allows to incorporate global information from the input space into each prediction. We found this to be beneficial for example in the scenarios shown below.

The top row compares a neural network which has only access to local information (left top image), as a classical CNN, with our lane perception model (right top image) which leverages an attention mechanism. The CNN has severe difficulties with correctly predicting the class of the detected lane markings. The reason for this is the yellow light in the tunnel leading to the false assumption that this lane marking is a yellow lane marking. In contrast, our model correctly identifies these lane markings as white lanes as it can leverage additional information about the scene. A similar effect can be observed in winter conditions in the bottom row. While the CNN falsely identifies snow at the side of the road as a white line, our internal model correctly classifies the guardrail.

In addition, we have found that the attention-mechanism leads to more robust predictions in complex scenes. The above figure shows two different scenes from a construction site where the CNN based network has difficulties with false positive predictions. In contrast, the access to global information from the image input domain allows the network to leverage additional information and prevent these false positives.

Overall, we have found that transformer neural networks significantly enhance robustness and edge-case performance in contrast to CNN based architectures. These qualitative findings have also translated into performance enhancements on quantitative metrics as well as improved generalization performance across various sensor positions and imaging chips.

Do you want to dive deeper into the advantages of transformer neural networks for autonomous driving? Feel free to reach out and we will setup a call with our team.