% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/quality_control.R
\name{tof_assess_clusters_entropy}
\alias{tof_assess_clusters_entropy}
\title{Assess a clustering result by calculating the shannon entropy of each cell's
mahalanobis distance to all cluster centroids and flagging outliers.}
\usage{
tof_assess_clusters_entropy(
  tof_tibble,
  cluster_col,
  marker_cols = where(tof_is_numeric),
  entropy_threshold,
  entropy_quantile = 0.9,
  num_closest_clusters,
  augment = FALSE
)
}
\arguments{
\item{tof_tibble}{A `tof_tbl` or `tibble`.}

\item{cluster_col}{An unquoted column name indicating which column in `tof_tibble`
stores the cluster ids for the cluster to which each cell belongs.
Cluster labels can be produced via any method the user chooses - including manual gating,
any of the functions in the `tof_cluster_*` function family, or any other method.}

\item{marker_cols}{Unquoted column names indicating which column in `tof_tibble`
should be interpreted as markers to be used in the mahalanobis distance calculation.
Defaults to all numeric columns. Supports tidyselection.}

\item{entropy_threshold}{A scalar indicating the entropy threshold above
which a cell should be considered anomalous. If unspecified, a threshold will
be computed using `entropy_quantile` (see below). (Note: Entropy is often between
0 and 1, but can be larger with many classes/clusters).}

\item{entropy_quantile}{A scalar between 0 and 1 indicating the entropy quantile
above which a cell should be considered anomalous. Defaults to 0.9, which means
that cells with an entropy above the 90th percentile will be flagged. Ignored
if entropy_threshold is specified directly.}

\item{num_closest_clusters}{An integer indicating how many of a cell's closest
cluster centroids should have their mahalanobis distance included in the entropy
calculation. Playing with this argument will allow you to ignore distances to
clusters that are far away from each cell (and thus may distort the result, as
many distant centroids with large distances can artificially inflate a cells'
entropy value; that being said, this is rarely an issue empirically).
Defaults to all clusters in tof_tibble.}

\item{augment}{A boolean value indicating if the output should column-bind the
computed flags for each cell (see below) as new columns in `tof_tibble` (TRUE) or if
a tibble including only the computed flags should be returned (FALSE, the default).}
}
\value{
If augment = FALSE (the default), a tibble with 2 + NUM_CLUSTERS columns.
where NUM_CLUSTERS is the number of unique clusters in cluster_col.
Two of the columns will be "entropy" (the entropy value for each cell) and "flagged_cell"
(a boolean value indicating if each cell had an entropy value above entropy_threshold).
The other NUM_CLUSTERS columns will contain the mahalanobis distances from each cell
to each of the clusters in cluster_col (named ".mahalanobis_\{cluster_name\}").
If augment = TRUE, the same 2 + NUM_CLUSTERS columns will be column-bound to
tof_tibble, and the resulting tibble will be returned.
}
\description{
This function evaluates the result of a clustering procedure by calculating
the mahalanobis distance between each cell and the centroids of all clusters
in the dataset and finding the shannon entropy of the resulting vector of distances.
All cells with an entropy threshold above a user-specified threshold are flagged
as potentially anomalous. Entropy is minimized (to 0) when a cell is close to
one (or a small number) of clusters, but far from the rest of them. If a cell is
close to multiple cluster centroids (i.e. has an ambiguous phenotype),
its entropy will be large.
}
\examples{

# simulate data
sim_data <-
    dplyr::tibble(
        cd45 = c(rnorm(n = 1000, sd = 1.5), rnorm(n = 1000, mean = 2), rnorm(n = 1000, mean = -2)),
        cd38 = c(rnorm(n = 1000, sd = 1.5), rnorm(n = 1000, mean = 2), rnorm(n = 1000, mean = -2)),
        cd34 = c(rnorm(n = 1000, sd = 1.5), rnorm(n = 1000, mean = 2), rnorm(n = 1000, mean = -2)),
        cd19 = c(rnorm(n = 1000, sd = 1.5), rnorm(n = 1000, mean = 2), rnorm(n = 1000, mean = -2)),
        cluster_id = c(rep("a", 1000), rep("b", 1000), rep("c", 1000))
    )

# imagine a "reference" dataset in which "cluster a" isn't present
sim_data_reference <-
    sim_data |>
    dplyr::filter(cluster_id \%in\% c("b", "c"))

# if we cluster into the reference dataset, we will force all cells in
# cluster a into a population where they don't fit very well
sim_data <-
    sim_data |>
    tof_cluster(
        healthy_tibble = sim_data_reference,
        healthy_label_col = cluster_id,
        method = "ddpr"
    )

# we can evaluate the clustering quality by calculating by the entropy of the
# mahalanobis distance vector for each cell to all cluster centroids
entropy_result <-
    sim_data |>
    tof_assess_clusters_entropy(
        cluster_col = .mahalanobis_cluster,
        marker_cols = starts_with("cd"),
        entropy_quantile = 0.8,
        augment = TRUE
    )

# most cells in "cluster a" are flagged, and few cells in the other clusters are
flagged_cluster_proportions <-
    entropy_result |>
    dplyr::group_by(cluster_id) |>
    dplyr::summarize(
        prop_flagged = mean(flagged_cell)
    )

}
