Input File Format
DART-ID is used in the Slavov Lab to process the output of MaxQuant searches, but is designed to be able to interface with any other peptide search engine, such as SEQUEST in ProteomeDiscoverer.
DART-ID requires the input to be in tabular, text format (.csv
, .tsv
), where rows represent peptides/PSMs and columns represent features for those observations. As a minimum DART-ID requires four features:
sequence |
A unique identifier for a peptide, usually the canonical amino-acid sequence or a modified/annotated sequence |
raw_file |
A unique identifier for a spectrum file/mass-spectrometry run |
retention_time |
The retention time (elution time) of the peptide, in minutes. Can also be seconds, just make sure to update your prior distributions |
pep |
The error probability of the peptide-spectrum-match. can be provided by the search engine or by a separate program, e.g., Percolator |
DART-ID can also utilize these optional features:
charge |
Used to (optionally) append the ion charge state to the peptide sequence, so that peptides with different charge states are treated as different peptide species. Required if you set add_charge_to_sequence: true |
proteins , leading_protein |
The list of parent proteins and the most likely parent protein, respectively. Used to run the Fido protein inference algorithm. Required if you set run_pi: true |
retention_length |
The base peak width, i.e., the time range between when an ion first elutes to when it last elutes. Use this as a quality score in order to filter out poorly retained ions. Required if you use the retention_length PSM filter |
Mapping input files
There’s no need to manually change and rename your search engine output files to run DART-ID. Column mappings are defined in your yaml
configuration files:
# column mappings for MaxQuant
col_names:
sequence: "Modified sequence"
raw_file: "Raw file"
retention_time: "Retention time"
pep: "PEP"
# optional columns
charge: "Charge"
leading_protein: "Leading razor protein"
proteins: "Proteins"
retention_length: "Retention length"
If you want to adapt to, for example, SEQUEST/ProteomeDiscoverer output, then simply change the mappings above.
# column mappings for Sequest
col_names:
sequence: "Annotated Sequence"
raw_file: "Spectrum File"
retention_time: "RT [min]"
pep: "Percolator PEP"
# optional columns
leading_protein: "Master Protein Accessions"
proteins: "Protein Accessions"