Poster
in
Workshop: The 2nd Workshop on Advances in Financial AI Workshop: Towards Agentic and Responsible Systems

LLM-Driven Active Listwise Tournaments for Portfolio Selection in Large Asset Universes

Kamer Yuksel ⋅ João Pedro Maia ⋅ Thiago Castro Ferreira ⋅ Mohamed Al-Badrashiny

Project Page [ OpenReview]

Abstract

We introduce a scalable framework that uses large language models (LLMs) as listwise fuzzy judges to rank large universes of financial assets through active, multi-item tournaments. While recent preference-learning methods such as PAIRS, which fits a Gaussian-process preference model from pairwise data, and Product-of-Experts (PoE) Bradley–Terry models, which achieve near full-ranking performance with only about 2% of all comparisons, show that LLM-based pairwise evaluations can be efficient, they still rely on single-shot pairwise judgments. Such one-off comparisons are inherently noisy, unstable, and difficult to scale, especially in equity universes containing thousands of stocks. Our method overcomes these limitations by prompting the LLM to rank multi-item subsets rather than isolated pairs. These richer listwise judgments are mapped into statistically coherent latent utilities using a Plackett–Luce model, producing a globally consistent ranking. An active learning loop further selects only the most informative subsets at each iteration, substantially reducing the number of LLM calls required for convergence. In contrast to existing financial applications—which typically use LLMs as black-box feature extractors, SWOT scorers, or prompt-based portfolio generators—our approach is the first to treat the LLM as a tournament judge explicitly and to fuse its qualitative assessments with probabilistic rank aggregation. The result is a principled, data-efficient, and highly scalable methodology that yields stable long-horizon asset rankings across very large investment universes. This framework bridges modern LLM-based preference learning and portfolio construction, providing a more coherent, interpretable, and sample-efficient alternative to both prior AI–finance approaches and existing pairwise-only LLM ranking techniques. Empirical evaluations are conducted strictly after the GPT-4o-mini model’s training cut-off, ensuring a fully out-of-sample assessment. In this forward-looking setting, the framework produces rankings with strong predictive power (NDCG up to 0.625) and delivers substantial portfolio outperformance: the top 10% portfolio achieves 55.2% higher annualized returns, 62.6% lower maximum drawdown, 107.3% higher Sharpe ratio, and a 311.4% improvement in Calmar ratio relative to the S&P 500. These results demonstrate that active listwise LLM tournaments can yield economically meaningful and robust real-world investment performance.

Chat is not available.