Category: Uncategorized

Releasing the GARCH Densities Dataset: 1,000 Trillion Simulations for Financial AI

huggingface.co/datasets/sitmo/garch_densities

We’ve released a new open dataset on Hugging Face: GARCH Densities, a large-scale benchmark for density estimation, option pricing, and risk modeling in quantitative finance.

Created with Paul Wilmott, this dataset contains simulations from the GJR-GARCH model with Hansen skewed-t innovations. Each row links a parameter set

    \[  \Theta = (\alpha, \gamma, \beta, \text{var0}, \eta, \lambda) \]

to the inverse-CDF quantiles of terminal returns over multiple maturities.

Example CDF and derived PDF curves

Dataset highlights:

  • ~1,000 trillion simulated price paths across 6D parameter space
  • Quantiles of normalized returns at 512 probability levels
  • Low-discrepancy Sobol sampling under \alpha+\gamma+\beta < 0.95
  • zstd-compressed Parquet shards (~145 MB each, streaming-friendly)
  • CC-BY-4.0 licensed — free for academic and commercial use

Applications:

  • Learning \Theta \mapsto return distributions (pretrained pricing models)
  • Fast neural surrogates for Monte Carlo simulation
  • Option pricing, VaR/CVaR, and volatility surface modeling
  • Benchmarking generative and density estimation models
Parameter distributions

Example usage:

from datasets import load_dataset

ds = load_dataset("sitmo/garch_densities")
train, test = ds["train"], ds["test"]

print(train)
print(train.features)

PyTorch integration:

import torch
from torch.utils.data import DataLoader

param_cols = ["alpha","gamma","beta","var0","eta","lam","ti"]
train = train.with_format("torch", columns=param_cols + ["x"])
loader = DataLoader(train, batch_size=256, shuffle=True)

batch = next(iter(loader))
params, targets = torch.stack([batch[c] for c in param_cols], 1), batch["x"]
print(params.shape, targets.shape)

Goal: accelerate research on pretrained neural pricing models and scientific foundation models in finance — replacing slow Monte Carlo with fast, learned surrogates.

We’d love to see how researchers use this dataset. If you find it useful, consider sharing it or referencing it in your work.

License: CC-BY-4.0
Authors: Thijs van den Berg & Paul Wilmott
Link: https://huggingface.co/datasets/sitmo/garch_densities

New release of CGMM v0.4

pip install cgmm

We’ve just released v0.4 of cgmm, our open-source library for Conditional Gaussian Mixture Modelling.

If you’re new to cgmm: it’s a flexible, data-driven way to model conditional distributions beyond Gaussian or linear assumptions. It can:

  • Model non-Gaussian distributions
  • Capture non-linear dependencies
  • Work in a fully data-driven way

This makes it useful in research and applied settings where complexity is the norm rather than the exception.

What’s new in this release:

  • Mixture of Experts (MoE): Softmax-gated experts with linear mean functions (Jordan & Jacobs, Hierarchical Mixtures of Experts and the EM Algorithm, Neural Computation, 1994)
  • Direct conditional likelihood optimization: EM algorithm implementation from Jaakkola & Haussler (Expectation-Maximization Algorithms for Conditional Likelihoods, ICML 2000)

New examples and applications include:

  • VIX volatility Monte Carlo simulation (non-linear, non-Gaussian SDEs)
  • Hourly and seasonal forecasts of temperature, windspeed, and light intensity
  • The IRIS dataset and other sklearn benchmarks
  • Generative modelling of MNIST handwritten digits
Example generated digits using cgmm

Explore more:

If you find this useful, please share it, give it a ⭐ on GitHub, or help spread the word. Your feedback and contributions make a big difference.

Not investment advice, the library continues to evolve, and we welcome ideas, feature requests, and real-world examples!

SITMO Machine Learning | Quantitative Finance