Alignment Model Customization

The DART-ID alignment model is written in STAN, a statistical modelling language. STAN allows the user to specify inputs, set priors, and assign a target likelihood to minimize. The full model is available on GitHub and reproduced below:

// -*- mode: C -*-

data {
  int<lower=1> num_experiments;
  int<lower=1> num_peptides;
  int<lower=1> num_total_observations;
  int<lower=1> num_pep_exp_pairs;

  int<lower=1> muij_map[num_total_observations];
  int<lower=1> muij_to_exp[num_pep_exp_pairs];
  int<lower=1> muij_to_pep[num_pep_exp_pairs];
  int<lower=1, upper=num_experiments> experiment_id[num_total_observations];
  int<lower=1, upper=num_peptides> peptide_id[num_total_observations];
  
  real<lower=0> retention_times[num_total_observations];
  //real<lower=0> rt_means[num_experiments];
  //real<lower=0> rt_stds[num_experiments];
  real<lower=0> rt_mean;
  real<lower=0> rt_std;
  real<lower=0, upper=1> pep[num_total_observations];

  real<lower=0> max_retention_time;
}
parameters {
  // canonical retention time for given peptide
  real<lower=0, upper=max_retention_time> mu[num_peptides];

  // segmented linear regression parameters for the 
  // retention time of each pep-exp pair
  //
  // beta_0 = y-intercept
  // beta_1 = slope of first segment
  // beta_2 = slope of second segment
  // split_point = x-value of the split b/n the two segments
  real<lower=0> beta_1[num_experiments];
  real<lower=0> beta_2[num_experiments];
  real<lower= -1.0 * min(beta_1) * min(mu) > beta_0[num_experiments];
  real<lower=0> split_point[num_experiments];

  // linear regression parameters for the standard deviation/spread 
  // for each pep-exp pair
  //
  // sigma_slope_global = average sigma slope for the set of pep-exp pairs.
  //                      "moving prior" for sigma_slopes of pairs moving forwards
  // sigma_slope = slope of line
  // sigma_intercept = y-intercept of line - i.e., the smallest sigma_ij possible in that experiment
  real<lower=0> sigma_slope_global;
  real<lower=0> sigma_slope[num_experiments];
  real<lower=0.01> sigma_intercept[num_experiments];

}
transformed parameters {
  // alignment is based on a segemented linear regression
  
  // sigma_ij is the spread of the laplace distribution of 
  // peptide i in experiment j
  real<lower=0.01> sigma_ij[num_pep_exp_pairs];
  
  // muij is the mean retention time for peptide i in experiment j 
  real<lower=0, upper=max_retention_time> muij[num_pep_exp_pairs];
  
  for (i in 1:num_pep_exp_pairs) {
    if(mu[muij_to_pep[i]] < split_point[muij_to_exp[i]]) {
      // first segment of the linear regression (before split_point)
      muij[i] = beta_0[muij_to_exp[i]] + 
        beta_1[muij_to_exp[i]] * mu[muij_to_pep[i]];
    } else if( mu[muij_to_pep[i]] >=  split_point[muij_to_exp[i]]) {
      // second segment of the linear regression (after split_point)
      muij[i] = beta_0[muij_to_exp[i]] + 
        beta_1[muij_to_exp[i]] * split_point[muij_to_exp[i]] + 
        beta_2[muij_to_exp[i]] * (mu[muij_to_pep[i]] - split_point[muij_to_exp[i]]);
    }
    
    // simple linear regression
    //muij[i] = beta_0[muij_to_exp[i]] + beta_1[muij_to_exp[i]] * mu[muij_to_pep[i]];

    // make sure that the retention time is within the physical limits
    if( muij[i] > max_retention_time ) {
      muij[i] = max_retention_time * 0.99;
    }
  }

  // standard deviation grows linearly with time - simple linear regression
  for(i in 1:num_pep_exp_pairs) { 
    sigma_ij[i] = sigma_intercept[muij_to_exp[i]] + 
      sigma_slope[muij_to_exp[i]] / 100 * mu[muij_to_pep[i]];
  }

}
model {
  mu ~ normal(rt_mean, rt_std);

  sigma_slope_global ~ lognormal(0.1, 0.5);
  sigma_slope ~ lognormal(log(sigma_slope_global), 1);
  sigma_intercept ~ lognormal(0, 2);
  
  beta_0 ~ normal(0, 10);
  beta_1 ~ lognormal(0, 0.5);
  beta_2 ~ lognormal(0, 0.5);

  //for(i in 1:num_experiments){
  split_point ~ uniform(0, max_retention_time);
  //}
  
  for(i in 1:num_total_observations) {
    target += log_mix(pep[i],
      normal_lpdf(retention_times[i] | rt_mean, rt_std),
      double_exponential_lpdf(retention_times[i] | muij[muij_map[i]], sigma_ij[muij_map[i]]));
  }
}

Data

The first data block specifies the inputs to the model. Most variables are used for mapping values back and forth. If you need to pass in more data to the model to, for example, change the prior distributions, you need to specify them here and pass them into the model from the code in align.py as well.

Parameters

This section is for defining the parameters: the reference retetion time mu and the experiment fits.

beta_0, beta_1, beta_2, and split_point are the necessary parameters to define the two-segment linear model. If you wish to change the model, you will have to define your new parameters here.

sigma_slope_global, sigma_slope, sigma_intercept are used to model per-experiment RT variances as well as RT-dependent shifts in variances. Possible extensions would be to define a fixed per-experiment RT variance, or a global RT variance, or even a per-peptide RT variance.

Transformed Parameters

This section derives the peptide-experiment predicted retention time muij and variance sigma_ij from the parameters in the previous block. This section defines the logic behind the two-segment linear model for the reference RT and the linear model for the variance.

Model

This section defines the priors for the parameters. New experimental designs, such as longer gradients, a mix of gradient lengths, or a restriction in the data (RT in seconds instead of minutes) may necessitate a change in the prior distributions.

The final piece is the “objective function”, which is represented here as target being the log likelihood of the model over all observations.

Implementing the Model

We recommend that you test and validate any customized model to ensure that it is returning sensible results.

In order to interface the model with the DART-ID python code, you must compile the .stan model code with cmdstan. Link the new model to a model definition by adding the definition in models.py, and ensure that the new model binary is placed in the right location. (Will write up more detailed guide soon…)