PokerBench: Training Large Language Models to become Professional Poker Players

Authors: Richard Zhuang, Akshat Gupta, Richard Yang, Aniket Rahane, Zhengyu Li, Gopala Anumanchipalli

Abstract: We introduce PokerBench – a benchmark for evaluating the poker-playing
abilities of large language models (LLMs). As LLMs excel in traditional NLP
tasks, their application to complex, strategic games like poker poses a new
challenge. Poker, an incomplete information game, demands a multitude of skills
such as mathematics, reasoning, planning, strategy, and a deep understanding of
game theory and human psychology. This makes Poker the ideal next frontier for
large language models. PokerBench consists of a comprehensive compilation of
11,000 most important scenarios, split between pre-flop and post-flop play,
developed in collaboration with trained poker players. We evaluate prominent
models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models,
finding that all state-of-the-art LLMs underperform in playing optimal poker.
However, after fine-tuning, these models show marked improvements. We validate
PokerBench by having models with different scores compete with each other,
demonstrating that higher scores on PokerBench lead to higher win rates in
actual poker games. Through gameplay between our fine-tuned model and GPT-4, we
also identify limitations of simple supervised fine-tuning for learning optimal
playing strategy, suggesting the need for more advanced methodologies for
effectively training language models to excel in games. PokerBench thus
presents a unique benchmark for a quick and reliable evaluation of the
poker-playing ability of LLMs as well as a comprehensive benchmark to study the
progress of LLMs in complex game-playing scenarios. The dataset and code will
be made available at: \url{https://github.com/pokerllm/pokerbench}.

Source: http://arxiv.org/abs/2501.08328v1

Archives

Categories

PokerBench: Training Large Language Models to become Professional Poker Players

About the Author

user

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You may also like these

Scaling Rich Style-Prompted Text-to-Speech Datasets

Enough Coin Flips Can Make LLMs Act Bayesian

Predictable Scale: Part I — Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

Shifting Long-Context LLMs Research from Input to Output