AANN 25/01/2024
Simultaneous quantile regression
In this example we will use the simultaneous quantile regression (SQR) as proposed by Tagasovska and Lopez-Paz (2019) to compute the quantiles of a prediction. SQR provides a way to get a prediction with a quantification of the uncertainty in that prediction. If you want to get technical, this is an instance of aleatoric uncertainty (i.e. the uncertainty intrinsic to a known stochastic process), as opposed to epistemic uncertainty (i.e. the uncertainty that derives from possible model misspecification.) Since I am mainly interested in the aleatoric uncertainty, I will ignore the orthonormal certificates that Tagasovska and Lopez-Paz proposed as a way to manage the epistemic uncertainty.
We will use SQR to estimate the quantiles of a distribution. Since the use of SQR is primarily about the loss function used in the optimisation, (not the network architecture itself), we will just use a simple feed-forward network to generate the estimates. We will also use simulated data so we have a known "truth" to compare to.
Data generating process
The input to our neural network is a datum, \(x_i=(x_i^1,\ldots,x_i^m)\), where
\[ x_i^j\sim\text{Normal}(\mu_i,1) \]
for \(j=1,\ldots,m\). Our goal is to use a set of observations, \(x_i\) to predict \(y_i=\mu_i\) where we assume a prior distribution \(\mu_i\sim\text{Normal}(0,10)\). This model is simple enough that we can calculate the quantiles analytically.
Loading packages
PyTorch has some nice random number generation functionality and this saves us needing to put the values into tensors later. We set the seed to keep things reproducible.
import time import sys import os from pathlib import Path import torch import torch.nn as nn import torch.optim as optim from torch.distributions import normal, uniform import pandas as pd import plotnine as p9 from plotnine import * current_directory = Path(os.getcwd()) sys.path.append(str(current_directory.parent)) import niceneuron.plot as nn_plot torch.manual_seed(0)
Define the input and output filenames
There is a replicate_size
(in addition to a number of replicates)
because we are using a permutation invariant network similar to the
previous post, and each data point consists of a set of observations.
loss_png = "loss.png" loss_csv = "loss.csv" coverage_png = "coverage.png" replicate_size = 10 train_num_replicates = 10000 train_num_steps = 10000 train_record_every = 500 test_num_replicates = 100
Define the network architecture
- The network we will use is based on the one from the previous post.
- We call the network
LocationNB
because it estimates the location parameter of a sample and is reminiscent of the neural Bayes structure. - To ensure it is permutation invariant we compute three summary statistics as a preprocessing step: the median and the first and second moments.
- A value for
tau
is given as input so the network knows what quantile to compute.
class LocationNB(nn.Module): def __init__(self, m): super(LocationNB, self).__init__() self._m = m # dataset size: {z_1,...,z_m} self._num_S = 3 # number of summary statistics self._q = 10 # latent dimension self._p = 1 # output dimension self._phi = nn.Sequential( nn.Linear(self._num_S + 1, self._q), nn.Sigmoid(), nn.Linear(self._q, self._p), ) def signed_sqrt(self, x): return torch.sign(x) * torch.sqrt(torch.abs(x)) def forward(self, x, tau): # assert x.dim() == 2 and x.shape[1] == self._m # assert tau.dim() == 1 and x.shape[0] == tau.shape[0] s0 = torch.median(x, dim=1).values s1 = torch.mean(x, dim=1) s2 = torch.mean(x**2, dim=1) tmp = torch.stack([s0, s1, s2], dim=1) tmp = self.signed_sqrt(tmp) tmp = torch.cat([tmp, tau.unsqueeze(1)], dim=1) return self._phi(tmp).squeeze(1)
Define the pinball loss function
We define the pinball loss as a PyTorch loss function so the network can learn to predict quantiles. To see why this function yields the quantile estimator you need to solve for the root of the derivative. Mathematically, the loss function \(\ell_{\tau}\) associated with a prediction of \(\hat{y}\) for a true value of \(y\) is
\[ \ell_{\tau}(\hat{y}, y) =\begin{cases} \tau (y - \hat{y}) & \text{if } y \geq \hat{y} \\ (1 - \tau) (\hat{y} - y) & \text{if } y < \hat{y} \end{cases} \]
IMPORTANT: This definition allows for the use of a different value of \(\tau\) for each element of the target tensor. This is the distinguishing factor of simultaneous quantile regression. Training the network to learn multiple quantiles simultaneously purportedly provides a partial resolution to the crossing quantile problem where the ordering of the quantiles is not respected.
class PinballLoss(nn.Module): def __init__(self): super(PinballLoss, self).__init__() def forward(self, predictions, targets, tau): err = targets - predictions loss = torch.where(err >= 0, tau * err, (tau - 1) * err) return torch.mean(loss)
Define data generating process
Because we are going to need separate training and testing datasets, it is helpful to have a function that makes it easy to simulate new data.
def rand_dataset(num_replicates: int, replicate_size: int): mu_i = normal.Normal(torch.tensor([0.0]), torch.tensor([10.0])).sample( sample_shape=torch.Size([num_replicates]) ) x_i = ( normal.Normal(loc=mu_i, scale=torch.tensor([1.0])) .sample(sample_shape=torch.Size([replicate_size])) .transpose(0, 1) .squeeze(2) ) y_i = mu_i.squeeze(1) return y_i, x_i
Prepare for training
Because it worked out of the box, I didn't see much need to go beyond a standard Adam optimizer.
loc_nb = LocationNB(replicate_size) pb_loss = PinballLoss() optimizer = optim.Adam(loc_nb.parameters(), lr=1e-3)
Run the training loop
At each step of the optimisation we simulate a new data set so we don't need to worry about shuffling or precomputing all the data beforehand. It would be nice to extend this to include a stopping criterion.
loc_nb.train() loss_history = [] training_start_time = time.time() for step in range(train_num_steps): # Sample training data train_y, train_x = rand_dataset(train_num_replicates, replicate_size) train_tau = ( uniform.Uniform(torch.tensor([0.0]), torch.tensor([1.0])) .sample(sample_shape=train_y.shape) .squeeze(1) ) # Forward pass preds = loc_nb(train_x, train_tau) loss = pb_loss(preds, train_y, train_tau) step_loss = loss.item() # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() if step % train_record_every == 0: print(f"Step {step} loss: {step_loss}") loss_history.append((step, step_loss)) training_finish_time = time.time() print(f"Training took: {training_finish_time - training_start_time}")
Visualise training results
We can visualise the training error across the steps of the optimisation.
loss_df = pd.DataFrame(loss_history, columns=["step", "loss"]) loss_df.to_csv(loss_csv) loss_p9 = nn_plot.plot_loss_curve(loss_df, x_var="step", x_lab="") loss_p9.save(loss_png, height = 2.9, width = 4.1)
Test the network performance
To test how well-calibrated the predicted quantiles are we generate a large dataset and then investigate how often the empirical quantiles match up with the desired ones. The performance is great, but by no means perfect.
test_y, test_x = rand_dataset(test_num_replicates, replicate_size) test_alpha = torch.linspace(0.05, 0.95, steps=20) test_results = [] for ix in range(test_alpha.shape[0]): tau_0 = (0.5 * test_alpha[ix]).repeat(test_num_replicates) tau_1 = 1 - tau_0 test_est_lower = loc_nb(test_x, tau_0) test_est_upper = loc_nb(test_x, tau_1) test_correct = torch.sum( (test_est_lower <= test_y) & (test_y <= test_est_upper) ).item() test_results.append((test_alpha[ix].item(), test_correct, test_num_replicates))
Visualisation
test_df = pd.DataFrame(test_results, columns=["alpha", "correct", "total"]) ( ggplot(test_df, aes(x="1-alpha", y="correct/total")) + geom_point() + geom_abline(intercept=0, slope=1, color="red") + scale_x_continuous(limits=(0, 1)) + scale_y_continuous(limits=(0, 1)) + theme_bw() ).save(coverage_png, height=2.9, width=4.1)
Discussion
Simultaneous quantile regression produces a generic quantile regression model. However, the method lacks finite-sample guarantees. Quantile regression can be combined with conformal prediction to yield conformalized quantile regression to overcome this limitation (as described by Romano et al (2019)).
Thanks
Thanks to Jackson Kwok for helpful comments on a draft of this.