Master Thesis Project

Overview

This thesis work investigates how smaller, open-source LLMs can serve as effective judges in an automated feedback loop for traffic scene generation. Rather than relying on large proprietary models, the system demonstrates that properly configured judge models can enhance retrieval performance, mitigate syntax inconsistencies The evaluation loop with diversity score metric increases the diversity of scenarios generated, addressing a persistent challenge in the autonomous vehicle testing domain.

Research Context

Conducted as part of my Master's thesis at Universität Potsdam in collaboration with Continental Automotive (now Aumovio), This thesis work demonstrates how modern NLP techniques can be applied to safety-critical applications in autonomous vehicle testing and validation. Built upon the TTSG framework and utilizing the CARLA simulator for rendering and validation, the system bridges the gap between natural language descriptions and executable traffic scenarios.
(Note: The code repository developed for this project is proprietary and belongs to Aumovio (Continental Automotive), and therefore cannot be made publicly available.)

Framework Architecture

The overall framework integrates multiple components to create an intelligent traffic scene generation system. The architecture combines retrieval mechanisms, LLM-based evaluation, and iterative refinement to produce high-quality traffic scenarios.

Overall Framework Architecture

Technology Stack

Python Large Language Models CARLA Simulator TTSG Framework Natural Language Processing Prompt Engineering Automated Evaluation Transformers Pytorch LangChain Docker

Judge Model's Impact on Retrieval

Using CodeLlama 13B as the main model and Llama-3 8B and GPT-mini o4 as judge models, the system evaluates and guides the retrieval process to ensure relevant map layouts are successfully generated.

Judge Model with CodeLlama 13B

Evaluation Loop Results

The implementation of the evaluation loop demonstrated significant improvements in scenario diversity. The iterative feedback mechanism successfully increased the diversity score across multiple metrics:

Main Model GPT-4o GPT-4o mini-o4
Eval Loop FALSE TRUE TRUE
Weather 0.80 0.67 0.72
Daily Traffic 0.79 0.85 0.82
Blocking 0.75 0.90 0.53
Cut Off 0.60 0.72 0.83
Crushing 0.50 0.70 0.82
Intersection 0.83 0.68 0.69
Emergency 0.50 0.67 0.8
Two Wheels 0.714 0.56 0.8
Overall 0.685 0.719 0.75

Example: Emergency Scenario

To visualize the impact of the evaluation loop, we examined an emergency case scenario. The comparison below shows how the framework evolved the scenario generation before and after implementing the evaluation loop, demonstrating improved complexity, realism, and edge case handling.

Emergency Scenario: Before vs After Evaluation Loop

Key Findings

Future Directions