% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/ScpModel-Workflow.R
\name{ScpModel-Workflow}
\alias{ScpModel-Workflow}
\alias{scpModelWorkflow}
\alias{scpModelFilterPlot}
\title{Modelling single-cell proteomics data}
\usage{
scpModelWorkflow(object, formula, i = 1, name = "model", verbose = TRUE)

scpModelFilterPlot(object, name)
}
\arguments{
\item{object}{An object that inherits from the
\code{SummarizedExperiment} class.}

\item{formula}{A \code{formula} object controlling which variables are
to be modelled.}

\item{i}{A \code{logical}, \code{numeric} or \code{character} indicating which
assay of \code{object} to use as input for modelling. Only a single
assay can be provided. Defaults to the first assays.}

\item{name}{A \code{character(1)} providing the name to use to store or
retrieve the modelling results. When retrieving a model and
\code{name} is missing, the name of the first model found in
\code{object} is used.}

\item{verbose}{A \code{logical(1)} indicating whether to print progress
to the console.}
}
\description{
Function to estimate a linear model for each feature (peptide or
protein) of a single-cell proteomics data set. This is the
modelling step of the \emph{scplainer} workflow.
}
\section{Input data}{


The main input is \code{object} that inherits from the
\code{SummarizedExperiment} class. The quantitative data will be
retrieve using \code{assay(object)}. If \code{object} contains multiple
assays, you can specify which assay to take as input thanks to the
argument \code{i}, the function will then assume \code{assay(object, i)} as
quantification input .

The objective of modelling single-cell proteomics data is to
estimate, for each feature (peptide or protein), the effect of
known cell annotations on the measured intensities. These annotations
may contain biological information such as the cell line,
FACS-derived cell type, treatment, etc. We also highly recommend
including technical information, such as the MS acquisition run
information or the chemical label (in case of multiplexed
experiments). These annotation must be available from
\code{colData(object)}. \code{formula} specifies which annotations to use
during modelling.
}

\section{Data modelling workflow}{


The modelling worflow starts with generating a model matrix for
each feature given the \code{colData(object)} and \code{formula}. The model
matrix for peptide \eqn{i}, denoted \eqn{X_i}, is adapted to the
pattern of missing values (see section below). Then, the functions
fits the model matrix against the quantitative data. In other
words, the function determines for each feature \eqn{i} (row in
the input data) the contribution of each variable in the model.
More formally, the general model definition is:

\deqn{Y_i = \beta_i X^T_{(i)} + \epsilon_i}

where \eqn{Y} is the feature by cell quantification matrix,
\eqn{\beta_i} contains the estimated coefficients for feature
\eqn{i} with as many coefficients as variables to estimate,
\eqn{X^T_{(i)}} is the model matrix generated for feature \eqn{i},
and \eqn{\epsilon} is the feature by cell matrix with
residuals.

The coefficients are estimated using penalized least squares
regression. Next, the function computes the residual matrix and
the effect matrices. An effect matrix contains the data that is
captured by a given cell annotation. Formally, for each feature
\eqn{i}:

\deqn{\hat{M^f_i} = \hat{\beta^f_i} X^{fT}_{(i)} }

where \eqn{\hat{M^f}} is a cell by feature matrix containing the
variables associated to annotation \eqn{f}, \eqn{\hat{\beta^f_i}}
are the estimated coefficients associated to annotation \eqn{f}
and estimated for feature \eqn{i}, and \eqn{X^{fT}_{(i)}} is the
model matrix for peptide \eqn{i} containing only the variables to
annotation \eqn{f}.

All the results are stored in an \link{ScpModel} object which is stored
in the \code{object}'s metadata. Note that multiple models can be
estimated for the same \code{object}. In that case, provide the \code{name}
argument to store the results in a separate \code{ScpModel}.
}

\section{Feature filtering}{


The proportion of missing values for each features is high in
single-cell proteomics data. Many features can typically contain
more coefficients to estimate than observed values. These features
cannot be estimated and will be ignored during further steps.
These features are identified by computing the ratio between the
number of observed values and the number of coefficients to
estimate. We call it the \strong{n/p ratio}. Once the model is
estimated, use \code{scpModelFilterPlot(object)} to explore the
distribution of n/p ratios across the features. You can also
extract the n/p ratio for each feature using
\code{scpModelFilterNPRatio(object)}. By default, any feature that has
an n/p ratio lower than 1 is ignored. However, feature with an
n/p ratio close to 1 may lead to unreliable outcome because there
are not enough observed data. You could consider the n/p ratio as
the average number of replicate per coefficient to estimate.
Therefore, you may want to increase the n/p threshold. You can do
so using \code{scpModelFilter(object) <- npThreshold}.
}

\section{About missing values}{


The data modelling workflow is designed to take the presence of
missing values into account. We highly recommend to \strong{not impute}
the data before modelling. Instead, the modelling approach will
ignore missing values and will generate a model matrix using only
the observed values for each feature. However, the model matrices
for some features may contain highly correlated variables, leading
to near singular designs. We include a small ridge penalty to
reduce numerical instability associated to correlated variables.
}

\examples{

data("leduc_minimal")
leduc_minimal
## Overview of available cell annotations
colData(leduc_minimal)

####---- Model data ----####

f <- ~ 1 + ## intercept
    Channel + Set + ## batch variables
    MedianIntensity +## normalization
    SampleType ## biological variable
leduc_minimal <- scpModelWorkflow(leduc_minimal, formula = f)

####---- n/p feature filtering ----####

## Get n/p ratios
head(scpModelFilterNPRatio(leduc_minimal))

## Plot n/p ratios
scpModelFilterPlot(leduc_minimal)

## Change n/p ratio threshold
scpModelFilterThreshold(leduc_minimal) <- 2
scpModelFilterPlot(leduc_minimal)
}
\references{
scplainer: using linear models to understand mass
spectrometry-based single-cell proteomics data Christophe
Vanderaa, Laurent Gatto bioRxiv 2023.12.14.571792; doi:
https://doi.org/10.1101/2023.12.14.571792.
}
\seealso{
This function is part of the \emph{scplainer} workflow, which also
consists of \link{ScpModel-VarianceAnalysis},
\link{ScpModel-DifferentialAnalysis}, \link{ScpModel-ComponentAnalysis} to
explore the model results

\linkS4class{ScpModel} provides functions to extract information from
the \code{ScpModel} object.

\link{scpKeepEffect} and \link{scpRemoveBatchEffect} perform batch
correction for downstream analyses.
}
\author{
Christophe Vanderaa, Laurent Gatto
}
