Preprocess outputs from MS signal processing tools for analysis with MSstats

MSstatsPreprocess(
  input,
  annotation,
  feature_columns,
  remove_shared_peptides = TRUE,
  remove_single_feature_proteins = TRUE,
  feature_cleaning = list(remove_features_with_few_measurements = TRUE,
    summarize_multiple_psms = max),
  score_filtering = list(),
  exact_filtering = list(),
  pattern_filtering = list(),
  columns_to_fill = list(),
  aggregate_isotopic = FALSE,
  ...
)

Arguments

input

data.table processed by the MSstatsClean function.

annotation

annotation file generated by a signal processing tool.

feature_columns

character vector of names of columns that define spectral features.

remove_shared_peptides

logical, if TRUE shared peptides will be removed.

remove_single_feature_proteins

logical, if TRUE, proteins that only have one feature will be removed.

feature_cleaning

named list with maximum two (for MSstats converters) or three (for MSstatsTMT converter) elements. If handle_few_measurements is set to "remove", feature with less than three measurements will be removed (otherwise it should be equal to "keep"). summarize_multiple_psms is a function that will be used to aggregate multiple feature measurements in a run. It should return a scalar and accept an na.rm parameter. For MSstatsTMT converters, setting remove_psms_with_any_missing will remove features which have missing values in a run from that run.

score_filtering

a list of named lists that specify filtering options. Details are provided in the vignette.

exact_filtering

a list of named lists that specify filtering options. Details are provided in the vignette.

pattern_filtering

a list of named lists that specify filtering options. Details are provided in the vignette.

columns_to_fill

a named list of scalars. If provided, columns with names defined by the names of this list and values corresponding to its elements will be added to the output data.frame.

aggregate_isotopic

logical. If TRUE, isotopic peaks will by summed.

...

additional parameters to data.table::fread.

Value

data.table

Examples

evidence_path = system.file("tinytest/raw_data/MaxQuant/mq_ev.csv", package = "MSstatsConvert") pg_path = system.file("tinytest/raw_data/MaxQuant/mq_pg.csv", package = "MSstatsConvert") evidence = read.csv(evidence_path) pg = read.csv(pg_path) imported = MSstatsImport(list(evidence = evidence, protein_groups = pg), "MSstats", "MaxQuant")
#> INFO [2021-05-10 23:03:42] ** Raw data from MaxQuant imported successfully.
cleaned_data = MSstatsClean(imported, protein_id_col = "Proteins")
#> INFO [2021-05-10 23:03:42] ** Rows with values of Potentialcontaminant equal to + are removed #> INFO [2021-05-10 23:03:42] ** Rows with values of Reverse equal to + are removed #> INFO [2021-05-10 23:03:42] ** Rows with values of Potentialcontaminant equal to + are removed #> INFO [2021-05-10 23:03:42] ** Rows with values of Reverse equal to + are removed #> INFO [2021-05-10 23:03:42] ** + Contaminant, + Reverse, + Potential.contaminant proteins are removed. #> INFO [2021-05-10 23:03:42] ** Raw data from MaxQuant cleaned successfully.
annot_path = system.file("tinytest/raw_data/MaxQuant/annotation.csv", package = "MSstatsConvert") mq_annot = MSstatsMakeAnnotation(cleaned_data, read.csv(annot_path), Run = "Rawfile")
#> INFO [2021-05-10 23:03:42] ** Using provided annotation. #> INFO [2021-05-10 23:03:42] ** Run labels were standardized to remove symbols such as '.' or '%'.
# To filter M-peptides and oxidatin peptides m_filter = list(col_name = "PeptideSequence", pattern = "M", filter = TRUE, drop_column = FALSE) oxidation_filter = list(col_name = "Modifications", pattern = "Oxidation", filter = TRUE, drop_column = TRUE) msstats_format = MSstatsPreprocess( cleaned_data, mq_annot, feature_columns = c("PeptideSequence", "PrecursorCharge"), columns_to_fill = list(FragmentIon = NA, ProductCharge = NA), pattern_filtering = list(oxidation = oxidation_filter, m = m_filter) )
#> INFO [2021-05-10 23:03:42] ** The following options are used: #> - Features will be defined by the columns: PeptideSequence, PrecursorCharge #> - Shared peptides will be removed. #> - Proteins with a single feature will be removed. #> - Features with less than 3 measurements across runs will be removed. #> INFO [2021-05-10 23:03:42] ** Sequences containing Oxidation are removed. #> INFO [2021-05-10 23:03:42] ** Sequences containing M are removed. #> INFO [2021-05-10 23:03:42] ** Features with all missing measurements across runs are removed. #> INFO [2021-05-10 23:03:42] ** Shared peptides are removed. #> INFO [2021-05-10 23:03:42] ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows: max #> INFO [2021-05-10 23:03:42] ** Features with one or two measurements across runs are removed. #> INFO [2021-05-10 23:03:42] Proteins with a single feature are removed. #> INFO [2021-05-10 23:03:42] ** Run annotation merged with quantification data.
# Output in the standard MSstats format head(msstats_format)
#> Run PeptideSequence PrecursorCharge #> 1: 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1 AEAPAAAPAAK 2 #> 2: 121219_S_CCES_01_02_LysC_Try_1to10_Mixt_1_2 AEAPAAAPAAK 2 #> 3: 121219_S_CCES_01_03_LysC_Try_1to10_Mixt_1_3 AEAPAAAPAAK 2 #> 4: 121219_S_CCES_01_05_LysC_Try_1to10_Mixt_2_2 AEAPAAAPAAK 2 #> 5: 121219_S_CCES_01_06_LysC_Try_1to10_Mixt_2_3 AEAPAAAPAAK 2 #> 6: 121219_S_CCES_01_08_LysC_Try_1to10_Mixt_3_2 AEAPAAAPAAK 2 #> Intensity ProteinName Condition BioReplicate Experiment IsotopeLabelType #> 1: 4023100 P06959 1 1 1_1 L #> 2: 5132500 P06959 1 1 1_2 L #> 3: 2761600 P06959 1 1 1_3 L #> 4: 4091800 P06959 2 2 2_2 L #> 5: 4727000 P06959 2 2 2_3 L #> 6: 2258400 P06959 3 3 3_2 L #> FragmentIon ProductCharge #> 1: NA NA #> 2: NA NA #> 3: NA NA #> 4: NA NA #> 5: NA NA #> 6: NA NA