Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

Authors: Usman Syed, Ethan Light, Xingang Guo, Huan Zhang, Lianhui Qin, Yanfeng Ouyang, Bin Hu

Abstract: In this paper, we explore the capabilities of state-of-the-art large language
models (LLMs) such as GPT-4, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini
1.5 Pro, Llama 3, and Llama 3.1 in solving some selected undergraduate-level
transportation engineering problems. We introduce TransportBench, a benchmark
dataset that includes a sample of transportation engineering problems on a wide
range of subjects in the context of planning, design, management, and control
of transportation systems. This dataset is used by human experts to evaluate
the capabilities of various commercial and open-sourced LLMs, especially their
accuracy, consistency, and reasoning behaviors, in solving transportation
engineering problems. Our comprehensive analysis uncovers the unique strengths
and limitations of each LLM, e.g. our analysis shows the impressive accuracy
and some unexpected inconsistent behaviors of Claude 3.5 Sonnet in solving
TransportBench problems. Our study marks a thrilling first step toward
harnessing artificial general intelligence for complex transportation
challenges.

Source: http://arxiv.org/abs/2408.08302v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these