IEEE Logo SecureFinAI Lab Logo
Trismik Logo Vela Logo

Thanks to the AI4Finance Foundation open source community for their support.

Introduction

As AI continues to advance at a fast pace, more FinAI agents are being developed for the finance sector, such as FinRL trading agents [1,2,3], FinGPT agents [4,5] with multimodal capabilities [6], and regulatory reporting agents [7]. The Secure FinAI Contest 2026 encourages the development of FinAI agents based on the frameworks FinRL [2,3] and FinGPT [4].

We design four tasks. These challenges allow contestants to participate in various financial tasks and contribute to secure finance using state-of-the-art technologies with privacy-preserving and verifiable computation frameworks. We welcome students, researchers, and engineers who are passionate about finance, machine learning, and security to partake in the contest.

Tasks

Each team can choose to participate in one or more tasks. The prizes will be awarded for each task.

Task I: Adaptive Evaluation and Benchmarking Suite for Financial LLMs and Agents

This task focuses on benchmarking Financial Large Language Models (FinLLMs) and agents using an adaptive testing pipeline. Unlike traditional benchmarks, the adaptive pipeline partitions the test set into difficulty levels and dynamically selects test items based on model performance. This enables more efficient evaluation while preserving rigour. Participants are expected to submit models that can handle a diverse range of financial reasoning and comprehension tasks, optimised for both accuracy and inference efficiency.

Datasets: We utilize the standard evaluation suite referenced in BloombergGPT, integrated into the adaptive framework. This includes FPB (sentiment analysis), FiQA-SA (aspect-based sentiment), Headlines (market news classification), NER (entity recognition), and ConvFinQA (conversational financial Q&A).

Task II: Reliable Agentic FinSearch

This task benchmarks the reliability of financial search agents, specifically focusing on eliminating hallucinations and ensuring numerical precision. Participants are expected to finetune models and design agent pipelines that will be evaluated on their ability to retrieve and process financial data without errors.

Datasets:

  • The FinSearchComp benchmark consists of 635 financial questions (e.g., “What was the annual inflation rate in Australia in 2022?”) paired with their ground-truth answers (e.g., “6.6%,” allowing for minor rounding errors). These questions are designed to evaluate an agent’s proficiency of searching and reasoning in terms of numerical and temporal accuracy. The benchmark covers three types of tasks: (1) real-time retrieval of numerical data (Task 1), (2) simple lookup of historical data (Task 2), and (3) complex computation over historical data (Task 3). The dataset includes 244 questions for Task 1, 219 for Task 2, and 172 for Task 3.
  • The link of benchmark: https://huggingface.co/datasets/ByteSeedXpert/FinSearchComp

Task III: Prediction Market Arbitrage

This task focuses on developing trading agents that identify and execute arbitrage opportunities across two prediction markets, Kalshi and Polymarket, for a series of sports events with binary options. Models may incorporate sentiment signals in addition to market data to anticipate market moves when new information changes expectations during a game. Evaluation will be conducted via paper trading, where agents perform simulated trading on real market data without real capital.

Datasets:

  • Kalshi Data: Market metadata and public market snapshots collected via the Kalshi API, with optional WebSocket streams for real-time updates including order book changes, public trades, and price/ticker movements.

Kalshi API Docs — https://docs.kalshi.com/

  • Polymarket Data: Market metadata from Polymarket’s public Gamma API (series, events, markets, and outcome identifiers), paired with real-time updates from the CLOB WebSocket market feed to track live pricing and liquidity.

Polymarket Developer Docs — https://docs.polymarket.com/quickstart/overview

  • Sentiment Data: Sports news collected from RSS feeds across multiple sources, including breaking headlines and alerts, injury and lineup updates, and other game-related developments that can be used as sentiment inputs.

Task IV: AI for Venture Capital - Prediction of Startup Success

This task tests the ability of Large Language Models to act as Venture Capitalists by predicting the potential success of early-stage startups. Using the VCBench dataset, which consists of anonymized founder profiles, participants must predict whether a startup will achieve a significant liquidity event (IPO, M&A >$500M, or high-tier funding).

Goal & Constraints:

  • Objective: Predict the binary “Success” label for given founder profiles.
  • Optimization: Participants are encouraged to optimize input templates and output extraction methods alongside model fine-tuning.
  • Metric: F1-Score.

Task V: Agentic Trading

This task focuses on developing LLM agents for quantitative trading in a real-time paper trading setting. It formulates trading as a reasoning-to-action problem, where agents integrate real-time signals such as prices, newsflow, and asset-specific fundamentals and make trading actions. Evaluation will be conducted via paper trading on an Alpaca paper trading account, with metrics emphasizing profitability and risk.

Datasets:

Participants may fetch historical market data from Alpaca Market Data API:

[1] Wang, Keyi, et al. "FinRL Contests: Data‐Driven Financial Reinforcement Learning Agents for Stock and Crypto Trading." Artificial Intelligence for Engineering (2025). [IET] [arXiv]

[2] Liu, Xiao-Yang, et al. "Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 1835-1849. [NeurIPS]

[3] Liu, Xiao-Yang, et al. "Fingpt: Democratizing internet-scale data for financial large language models." arXiv preprint arXiv:2307.10485 (2023). [arXiv]

[4] Lin, Shengyuan, et al. "Evaluation and Benchmarking Suite for Financial Large Language Models and Agents." NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling. [NeurIPS 2025 Workshop]

[5] Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie. "FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs." arXiv preprint arXiv:2510.08886 (2025). [arXiv]

[6] Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie. "FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information." arXiv preprint arXiv:2505.20650 (2025). [arXiv] ## Contact Contact email: [finrlcontest@gmail.com](mailto:finrlcontest@gmail.com) Contestants can communicate any questions on * [Discord](https://discord.gg/dJY5cKzmkv).