OilBench

LLM WTI Crude Prediction Benchmark

Target AssetCL=F (WTI)

Model Leaderboard

Ranking by Simulated P&L

Baseline Target (Buy & Hold)

Holding $10,000 of WTI Crude from 2026-01-01 to 2026-03-14

$17,190.87

Rank	Model	Simulated P&L (?)	Avg Daily Miss (?)	Consistency Risk (?)	PnL Spread (?)	Internal Determinism (?)
1	gemini-3-flash-preview10x	$15,880.80±$76	$2.05±$0.067	$3.07±$0.132	$229	±$0.61
2	grok-4_1-fast10x	$15,467.10±$278	$1.83±$0.047	$2.72±$0.086	$799	±$0.96
3	gemini-2_5-flash10x	$15,400.31±$136	$1.99±$0.045	$2.96±$0.056	$454	±$0.47
4	gpt-5_410x	$15,109.74±$116	$1.63±$0.030	$2.59±$0.036	$373	±$1.17
5	claude-sonnet-4_610x	$14,950.86±$25	$1.67±$0.029	$2.49±$0.051	$75	±$0.27
6	kimi-k2_510x	$14,929.32±$187	$1.67±$0.064	$2.55±$0.084	$661	±$1.35
7	grok-4_20-beta10x	$14,869.73±$195	$1.77±$0.044	$2.56±$0.073	$833	±$0.61
8	claude-opus-4_610x	$14,706.89±$72	$1.59±$0.012	$2.53±$0.020	$195	±$0.11
9	minimax-m2_510x	$14,619.24±$452	$1.77±$0.089	$2.68±$0.156	$1,472	±$1.75

Interactive Inference Timeline

The chart below overlays the actual official daily settlement price of WTI Crude on the NYMEX against the exact target price predicted by the LLM.Hover over any day to view the contextual news, fundamentals, and economic reasoning the model generated to derive its prediction exactly 24 hours prior.

WTI Crude Tracking Benchmarks

Hover over the timeline to view predictions against actual settlements

Avg Daily Miss(?)

Mean Absolute Error (MAE): The average dollar amount the model's prediction missed the actual closing price on any given day.
Lower is better.

Consistency Risk(?)

Root Mean Square Error (RMSE): Penalizes larger misses more heavily. A higher risk means the model occasionally makes very wrong predictions.
Lower is better.

Simulated P&L(?)

Ending balance of a $10,000 algorithmic portfolio based entirely on the LLM's daily asset allocation decisions (0-100% Oil) over the benchmark period.
Higher is better.

PnL Spread Output(?)

The pure mathematical difference between the highest performing simulated algorithmic run and the lowest performing algorithmic run for the same model.
Lower is better.

Internal Determinism Score(?)

The average max-min spread of the predicted target price strictly calculated on an individual intra-day basis. Meaning, regardless of whether the model is objectively right or wrong about the market that day, how consistently does it stick to its own thesis?
Lower indicates high determinism. Higher indicates an erratic wildcard.