---
output:
  html_document
bibliography: ref.bib
---

# Problems with ambient RNA {#ambient-problems}

<script>
document.addEventListener("click", function (event) {
    if (event.target.classList.contains("rebook-collapse")) {
        event.target.classList.toggle("active");
        var content = event.target.nextElementSibling;
        if (content.style.display === "block") {
            content.style.display = "none";
        } else {
            content.style.display = "block";
        }
    }
})
</script>

<style>
.rebook-collapse {
  background-color: #eee;
  color: #444;
  cursor: pointer;
  padding: 18px;
  width: 100%;
  border: none;
  text-align: left;
  outline: none;
  font-size: 15px;
}

.rebook-content {
  padding: 0 18px;
  display: none;
  overflow: hidden;
  background-color: #f1f1f1;
}
</style>

## Background

Ambient contamination is a phenomenon that is generally most pronounced in massively multiplexed scRNA-seq protocols.
Briefly, extracellular RNA (most commonly released upon cell lysis) is captured along with each cell in its reaction chamber, contributing counts to genes that are not otherwise expressed in that cell (see [Advanced Section 7.2](http://bioconductor.org/books/3.23/OSCA.advanced/droplet-processing.html#qc-droplets)).
Differences in the ambient profile across samples are not uncommon when dealing with strong experimental perturbations where strong expression of a gene in a condition-specific cell type can "bleed over" into all other cell types in the same sample.
This is problematic for DE analyses between conditions, as DEGs detected for a particular cell type may be driven by differences in the ambient profiles rather than any intrinsic change in gene regulation. 

To illustrate, we consider the _Tal1_-knockout (KO) chimera data from @pijuansala2019single.
This is very similar to the WT chimera dataset we previously examined, only differing in that the _Tal1_ gene was knocked out in the injected cells.
_Tal1_ is a transcription factor that has known roles in erythroid differentiation; the aim of the experiment was to determine if blocking of the erythroid lineage diverted cells to other developmental fates.
(To cut a long story short: yes, it did.)


``` r
library(MouseGastrulationData)
sce.tal1 <- Tal1ChimeraData()
counts(sce.tal1) <- as(counts(sce.tal1), "CsparseMatrix") 

library(scuttle)
rownames(sce.tal1) <- uniquifyFeatureNames(
    rowData(sce.tal1)$ENSEMBL, 
    rowData(sce.tal1)$SYMBOL
)
sce.tal1
```

```
## class: SingleCellExperiment 
## dim: 29453 56122 
## metadata(0):
## assays(1): counts
## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(56122): cell_1 cell_2 ... cell_56121 cell_56122
## colData names(9): cell barcode ... pool sizeFactor
## reducedDimNames(1): pca.corrected
## mainExpName: NULL
## altExpNames(0):
```

We will perform a DE analysis between WT and KO cells labelled as "neural crest".
We observe that the strongest DEGs are the hemoglobins, which are downregulated in the injected cells.
This is rather surprising as these cells are distinct from the erythroid lineage and should not express hemoglobins at all. 
The most sober explanation is that the background samples contain more hemoglobin transcripts in the ambient solution due to leakage from erythrocytes (or their precursors) during sorting and dissociation.


``` r
library(scran)
summed.tal1 <- aggregateAcrossCells(sce.tal1, 
    ids=DataFrame(sample=sce.tal1$sample,
        label=sce.tal1$celltype.mapped)
)
summed.tal1$block <- summed.tal1$sample %% 2 == 0 # Add blocking factor.

# Subset to our neural crest cells.
summed.neural <- summed.tal1[,summed.tal1$label=="Neural crest"]
summed.neural
```

```
## class: SingleCellExperiment 
## dim: 29453 4 
## metadata(0):
## assays(1): counts
## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames: NULL
## colData names(13): cell barcode ... ncells block
## reducedDimNames(1): pca.corrected
## mainExpName: NULL
## altExpNames(0):
```

``` r
# Standard edgeR analysis, as described in previous chapters.
res.neural <- pseudoBulkDGE(summed.neural, 
    label=summed.neural$label,
    design=~factor(block) + tomato,
    coef="tomatoTRUE",
    condition=summed.neural$tomato)
summarizeTestsPerLabel(decideTestsPerLabel(res.neural))
```

```
##               -1     0   1    NA
## Neural crest 262 10009 379 18803
```

``` r
# Summary of the direction of log-fold changes.
tab.neural <- res.neural[[1]]
tab.neural <- tab.neural[order(tab.neural$PValue),]
head(tab.neural, 10)
```

```
## DataFrame with 10 rows and 5 columns
##                   logFC    logCPM         F      PValue         FDR
##               <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Hbb-bh1       -8.091036   9.15972 11467.738 1.74902e-32 1.86271e-28
## Hba-x         -7.724801   8.53284  8639.398 4.48170e-31 2.38650e-27
## Hbb-y         -8.415624   8.35705  8044.164 1.02903e-30 3.65304e-27
## Xist          -7.555706   8.21232  5308.545 1.17657e-28 3.13262e-25
## Hba-a1        -8.596670   6.74429  2946.948 8.55298e-26 1.82178e-22
## Hba-a2        -8.866231   5.81300  1376.524 1.22479e-22 2.17401e-19
## Cdkn1c        -8.864545   4.96097   773.323 3.65284e-20 5.55753e-17
## Uba52         -0.879666   8.38618   463.147 1.02832e-16 1.36895e-13
## Fdps           0.981419   7.21805   377.545 9.53933e-16 1.12882e-12
## Gt(ROSA)26Sor  1.481295   5.71617   369.127 1.21796e-15 1.29713e-12
```



As an aside, it is worth mentioning that the "replicates" in this study are more technical than biological,
so some exaggeration of the significance of the effects is to be expected.
Nonetheless, it is a useful dataset to demonstrate some strategies for mitigating issues caused by ambient contamination.

## Filtering out affected DEGs 

### By estimating ambient contamination

As shown above, the presence of ambient contamination makes it difficult to interpret multi-condition DE analyses.
To mitigate its effects, we need to obtain an estimate of the ambient "expression" profile from the raw count matrix for each sample.
We follow the approach used in `emptyDrops()` [@lun2018distinguishing] and consider all barcodes with total counts below 100 to represent empty droplets.
We then sum the counts for each gene across these barcodes to obtain an expression vector representing the ambient profile for each sample.


``` r
library(DropletUtils)
ambient <- vector("list", ncol(summed.neural))

# Looping over all raw (unfiltered) count matrices and
# computing the ambient profile based on its low-count barcodes.
# Turning off rounding, as we know this is count data.
for (s in seq_along(ambient)) {
    raw.tal1 <- Tal1ChimeraData(type="raw", samples=s)[[1]]
    counts(raw.tal1) <- as(counts(raw.tal1), "CsparseMatrix")
    ambient[[s]] <- ambientProfileEmpty(counts(raw.tal1), 
        good.turing=FALSE, round=FALSE)
}

# Cleaning up the output for pretty printing.
ambient <- do.call(cbind, ambient)
colnames(ambient) <- seq_len(ncol(ambient))
rownames(ambient) <- uniquifyFeatureNames(
    rowData(raw.tal1)$ENSEMBL, 
    rowData(raw.tal1)$SYMBOL
)
head(ambient)
```

```
##          1  2  3  4
## Xkr4     1  0  0  0
## Gm1992   0  0  0  0
## Gm37381  1  0  1  0
## Rp1      0  1  0  1
## Sox17   76 76 31 53
## Gm37323  0  0  0  0
```



For each sample, we determine the maximum proportion of the count for each gene that could be attributed to ambient contamination.
This is done by scaling the ambient profile in `ambient` to obtain a per-gene expected count from ambient contamination, with which we compute the $p$-value for observing a count equal to or lower than that in `summed.neural`. 
We perform this for a range of scaling factors and identify the largest factor that yields a $p$-value above a given threshold.
The scaled ambient profile represents the upper bound of the contribution to each sample from ambient contamination.
We deliberately use an upper bound so that our next step will aggressively remove any gene that is potentially problematic.


``` r
max.ambient <- ambientContribMaximum(counts(summed.neural), 
    ambient, mode="proportion")
head(max.ambient)
```

```
##           [,1]   [,2]  [,3] [,4]
## Xkr4       NaN    NaN   NaN  NaN
## Gm1992     NaN    NaN   NaN  NaN
## Gm37381    NaN    NaN   NaN  NaN
## Rp1        NaN    NaN   NaN  NaN
## Sox17   0.1775 0.1833 0.468    1
## Gm37323    NaN    NaN   NaN  NaN
```

Genes in which over 10% of the counts are ambient-derived are subsequently discarded from our analysis.
For balanced designs, this threshold prevents ambient contribution from biasing the true fold-change by more than 10%, which is a tolerable margin of error for most applications.
(Unbalanced designs may warrant the use of a weighted average to account for sample size differences between groups.)
This approach yields a slightly smaller list of DEGs without the hemoglobins, which is encouraging as it suggests that any other, less obvious effects of ambient contamination have also been removed.


``` r
# Averaging the ambient contribution across samples.
contamination <- rowMeans(max.ambient, na.rm=TRUE)
non.ambient <- contamination <= 0.1
summary(non.ambient)
```

```
##    Mode   FALSE    TRUE    NA's 
## logical    1475   15306   12672
```

``` r
okay.genes <- names(non.ambient)[which(non.ambient)]
tab.neural2 <- tab.neural[rownames(tab.neural) %in% okay.genes,]

table(Direction=tab.neural2$logFC > 0, Significant=tab.neural2$FDR <= 0.05)
```

```
##          Significant
## Direction FALSE TRUE
##     FALSE  4907  229
##     TRUE   4882  352
```

``` r
head(tab.neural2, 10)
```

```
## DataFrame with 10 rows and 5 columns
##                   logFC    logCPM         F      PValue         FDR
##               <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Xist          -7.555706   8.21232  5308.545 1.17657e-28 3.13262e-25
## Uba52         -0.879666   8.38618   463.147 1.02832e-16 1.36895e-13
## Fdps           0.981419   7.21805   377.545 9.53933e-16 1.12882e-12
## Gt(ROSA)26Sor  1.481295   5.71617   369.127 1.21796e-15 1.29713e-12
## Grb10         -1.403141   6.58314   357.417 1.72628e-15 1.67135e-12
## Mcts2          1.137662   6.42689   346.482 2.41461e-15 2.14297e-12
## H13           -1.481663   5.90902   326.189 4.62548e-15 3.78934e-12
## Msmo1          1.493789   5.43923   310.076 7.96675e-15 5.83631e-12
## Snrpb          0.564202  10.18934   309.171 8.22016e-15 5.83631e-12
## Mest           0.549347  10.98269   305.278 9.41488e-15 6.26678e-12
```



A softer approach is to simply report the average contaminating percentage for each gene in the table of DE statistics.
Readers can then make up their own minds as to whether a particular DEG's effect is driven by ambient contamination.
Indeed, it is worth remembering that `maximumAmbience()` will report the maximum possible contamination rather than attempting to estimate the actual level of contamination, and filtering on the former may be too conservative.
This is especially true for cell populations that are contributing to the differences in the ambient pool; in the most extreme case, the reported maximum contamination would be 100% for cell types with an expression profile that is identical to the ambient pool.


``` r
tab.neural3 <- tab.neural
tab.neural3$contamination <- contamination[rownames(tab.neural3)]
head(tab.neural3)
```

```
## DataFrame with 6 rows and 6 columns
##             logFC    logCPM         F      PValue         FDR contamination
##         <numeric> <numeric> <numeric>   <numeric>   <numeric>     <numeric>
## Hbb-bh1  -8.09104   9.15972  11467.74 1.74902e-32 1.86271e-28     0.9900717
## Hba-x    -7.72480   8.53284   8639.40 4.48170e-31 2.38650e-27     0.9945348
## Hbb-y    -8.41562   8.35705   8044.16 1.02903e-30 3.65304e-27     0.9674483
## Xist     -7.55571   8.21232   5308.54 1.17657e-28 3.13262e-25     0.0605735
## Hba-a1   -8.59667   6.74429   2946.95 8.55298e-26 1.82178e-22     0.8626846
## Hba-a2   -8.86623   5.81300   1376.52 1.22479e-22 2.17401e-19     0.7351403
```

### With prior knowledge 

Another strategy to estimating the ambient proportions involves the use of prior knowledge of mutually exclusive gene expression profiles [@young2018soupx].
In this case, we assume (reasonably) that hemoglobins should not be expressed in neural crest cells and use this to estimate the contamination in each sample.
This is achieved with the `controlAmbience()` function, which scales the ambient profile so that the hemoglobin coverage is the same as the corresponding sample of `summed.neural`.
From these profiles, we compute proportions of ambient contamination that are used to mark or filter out affected genes in the same manner as described above.


``` r
is.hbb <- grep("^Hb[ab]-", rownames(summed.neural))
ctrl.ambient <- ambientContribNegative(counts(summed.neural), ambient,
    features=is.hbb,  mode="proportion")
head(ctrl.ambient)
```

```
##            [,1]    [,2]   [,3] [,4]
## Xkr4        NaN     NaN    NaN  NaN
## Gm1992      NaN     NaN    NaN  NaN
## Gm37381     NaN     NaN    NaN  NaN
## Rp1         NaN     NaN    NaN  NaN
## Sox17   0.06774 0.08798 0.4796    1
## Gm37323     NaN     NaN    NaN  NaN
```

``` r
ctrl.non.ambient <- rowMeans(ctrl.ambient, na.rm=TRUE) <= 0.1
summary(ctrl.non.ambient)
```

```
##    Mode   FALSE    TRUE    NA's 
## logical    1388   15393   12672
```

``` r
okay.genes <- names(ctrl.non.ambient)[which(ctrl.non.ambient)]
tab.neural4 <- tab.neural[rownames(tab.neural) %in% okay.genes,]
head(tab.neural4)
```

```
## DataFrame with 6 rows and 5 columns
##                   logFC    logCPM         F      PValue         FDR
##               <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Xist          -7.555706   8.21232  5308.545 1.17657e-28 3.13262e-25
## Uba52         -0.879666   8.38618   463.147 1.02832e-16 1.36895e-13
## Fdps           0.981419   7.21805   377.545 9.53933e-16 1.12882e-12
## Gt(ROSA)26Sor  1.481295   5.71617   369.127 1.21796e-15 1.29713e-12
## Grb10         -1.403141   6.58314   357.417 1.72628e-15 1.67135e-12
## Mcts2          1.137662   6.42689   346.482 2.41461e-15 2.14297e-12
```



Any highly expressed cell type-specific gene is a candidate for this procedure,
most typically in cell types that are highly specialized towards manufacturing a protein product.
Aside from hemoglobin, we could use immunoglobulins in populations containing B cells,
or insulin and glucagon in pancreas datasets ([Advanced Figure 6.3](http://bioconductor.org/books/3.23/OSCA.advanced/marker-detection-redux.html#fig:viol-gcg-lawlor)).
The experimental setting may also provide some genes that must only be present in the ambient solution;
for example, the mitochondrial transcripts can be used to estimate ambient contamination in single-nucleus RNA-seq,
while _Xist_ can be used for datasets involving mixtures of male and female cells
(where the contaminating percentages are estimated from the profiles of male cells only).

If appropriate control features are available, this approach allows us to obtain a more accurate estimate of the contamination in each pseudo-bulk sample compared to the upper bound provided by `maximumAmbience()`.
This avoids the removal of genuine DEGs due to overestimation fo the ambient contamination from the latter. 
However, the performance of this approach is fully dependent on the suitability of the control features - if a "control" feature is actually genuinely expressed in a cell type, the ambient contribution will be overestimated.
A simple mitigating strategy is to simply take the lower of the proportions from `controlAmbience()` and `maximumAmbience()`, with the idea being that the latter will avoid egregious overestimation when the control set is misspecified.

### Without an ambient profile

An estimate of the ambient profile is rarely available for public datasets where only the per-cell count matrices are provided.
In such cases, we must instead use the rest of the dataset to infer something about the effects of ambient contamination.
The most obvious approach is construct a proxy ambient profile by summing the counts for all cells from each sample, which can be used in place of the actual profile in the previous calculations.


``` r
proxy.ambient <- aggregateAcrossCells(summed.tal1,
    ids=summed.tal1$sample)

# Using 'proxy.ambient' instead of the estimaed 'ambient'.
max.ambient.proxy <- ambientContribMaximum(counts(summed.neural), 
    counts(proxy.ambient), mode="proportion")
head(max.ambient.proxy)
```

```
##           [,1]   [,2]   [,3]   [,4]
## Xkr4       NaN    NaN    NaN    NaN
## Gm1992     NaN    NaN    NaN    NaN
## Gm37381    NaN    NaN    NaN    NaN
## Rp1        NaN    NaN    NaN    NaN
## Sox17   0.7427 0.9891 0.5283 0.9067
## Gm37323    NaN    NaN    NaN    NaN
```

``` r
con.ambient.proxy <- ambientContribNegative(counts(summed.neural), 
    counts(proxy.ambient), features=is.hbb,  mode="proportion")
head(con.ambient.proxy)
```

```
##         [,1] [,2]   [,3] [,4]
## Xkr4     NaN  NaN    NaN  NaN
## Gm1992   NaN  NaN    NaN  NaN
## Gm37381  NaN  NaN    NaN  NaN
## Rp1      NaN  NaN    NaN  NaN
## Sox17      1    1 0.6032    1
## Gm37323  NaN  NaN    NaN  NaN
```

This assumes equal contributions from all labels to the ambient pool, which is not entirely unrealistic (Figure \@ref(fig:proxy-ambience)) though some discrepancies can be expected due to the presence of particularly fragile cell types or extracellular RNA.


``` r
par(mfrow=c(2,2))
for (i in seq_len(ncol(proxy.ambient))) {
    true <- ambient[,i]
    proxy <- assay(proxy.ambient)[,i]
    logged <- edgeR::cpm(cbind(proxy, true), log=TRUE, prior.count=2)
    logFC <- logged[,1] - logged[,2]
    abundance <- rowMeans(logged)
    plot(abundance, logFC, main=paste("Sample", i))
}
```

<div class="figure">
<img src="ambient-problems_files/figure-html/proxy-ambience-1.png" alt="MA plots of the log-fold change of the proxy ambient profile over the real profile for each sample in the _Tal1_ chimera dataset." width="672" />
<p class="caption">(\#fig:proxy-ambience)MA plots of the log-fold change of the proxy ambient profile over the real profile for each sample in the _Tal1_ chimera dataset.</p>
</div>

Alternatively, we may choose to mitigate the effect of ambient contamination by focusing on label-specific DEGs.
Contamination-driven DEGs should be systematically present in comparisons for all labels, and thus can be eliminated by simply ignoring all genes that are significant in a majority of these comparisons (Section \@ref(cross-label-meta-analyses)).
The obvious drawback of this approach is that it discounts genuine DEGs that have a consistent effect in most/all labels, though one could perhaps argue that such "global" DEGs are not particularly interesting anyway.
It is also complicated by fluctuations in detection power across comparisons involving different numbers of cells - or replicates, after filtering pseudo-bulk profiles by the number of cells.


``` r
res.tal1 <- pseudoBulkSpecific(summed.tal1, 
    label=summed.tal1$label,
    design=~factor(block) + tomato,
    coef="tomatoTRUE",
    condition=summed.tal1$tomato)

# Inspecting our neural crest results again.
tab.neural.again <- res.tal1[["Neural crest"]]
head(tab.neural.again[order(tab.neural.again$PValue),], 10)
```

```
## DataFrame with 10 rows and 6 columns
##                   logFC    logCPM         F      PValue         FDR
##               <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Fdps           0.981419   7.21805  377.5448 9.53933e-16 1.01594e-11
## Msmo1          1.493789   5.43923  310.0759 1.17852e-14 6.27561e-11
## Hmgcs1         1.249854   5.70837  180.3105 2.35090e-12 8.34569e-09
## Idi1           1.173660   5.37688  164.6070 5.94334e-12 1.58241e-08
## Gt(ROSA)26Sor  1.481295   5.71617  369.1270 1.72586e-11 3.67609e-08
## Sox9           0.537554   7.17373  106.9820 4.12869e-10 7.32843e-07
## Insig1         1.257331   4.06887   90.9160 1.90366e-09 2.89628e-06
## Nkd1           0.719059   5.92690   98.2868 2.21584e-09 2.94984e-06
## Acat2          0.508867   6.80012   86.0689 3.15191e-09 3.72976e-06
## Fdft1          0.841061   5.32293   90.4714 4.18985e-09 4.46219e-06
##               OtherAverage
##                  <numeric>
## Fdps            -0.0910613
## Msmo1            0.0269792
## Hmgcs1          -0.0617747
## Idi1            -0.0820136
## Gt(ROSA)26Sor    0.5408970
## Sox9            -0.0424763
## Insig1          -0.2879371
## Nkd1             0.0333396
## Acat2           -0.0520665
## Fdft1            0.0336464
```

``` r
# By comparison, the hemoglobins are all the way at the bottom.
head(tab.neural.again[is.hbb,], 10)
```

```
## DataFrame with 8 rows and 6 columns
##             logFC    logCPM          F    PValue       FDR OtherAverage
##         <numeric> <numeric>  <numeric> <numeric> <numeric>    <numeric>
## Hbb-bt   -7.76723   1.33059    61.0321  1.000000  1.000000     -7.86818
## Hbb-bs   -5.84810   3.42835   269.7717  1.000000  1.000000     -8.15477
## Hbb-bh2        NA        NA         NA        NA        NA     -8.98450
## Hbb-bh1  -8.09104   9.15972 11467.7377  0.925765  1.000000     -8.08037
## Hbb-y    -8.41562   8.35705  8044.1641  0.420674  1.000000     -8.21261
## Hba-x    -7.72480   8.53284  8639.3978  0.083283  0.831009     -7.38501
## Hba-a1   -8.59667   6.74429  2946.9482  0.283210  1.000000     -8.05722
## Hba-a2   -8.86623   5.81300  1376.5242  0.260894  1.000000     -7.96513
```



The common theme here is that, in the absence of an ambient profile, we are using all labels as a proxy for the ambient effect.
This can have unpredictable consequences as the results for each label are now dependent on the behavior of the entire dataset.
For example, the metrics are susceptible to the idiosyncrasies of clustering where one cell type may be represented in multple related clusters that distort the percentages in `up.de` and `down.de` or the average log-fold change.
The metrics may also be invalidated in analyses of a subset of the data - for example, a subclustering analysis focusing on a particular cell type may mark all relevant DEGs as problematic because they are consistently DE in all subtypes.

## Subtracting ambient counts

It is worth commenting on the seductive idea of subtracting the ambient counts from the pseudo-bulk samples.
This may seem like the most obvious approach for removing ambient contamination, but unfortunately, subtracted counts have unpredictable statistical properties due the distortion of the mean-variance relationship.
Minor relative fluctuations at very large counts become large fold-changes after subtraction, manifesting as spurious DE in genes where a substantial proportion of counts is derived from the ambient solution.
For example, several hemoglobin genes retain strong DE even after subtraction of the scaled ambient profile.


``` r
scaled.ambient <- controlAmbience(counts(summed.neural), ambient,
    features=is.hbb,  mode="profile")
subtracted <- counts(summed.neural) - scaled.ambient
subtracted <- round(subtracted)
subtracted[subtracted < 0] <- 0
subtracted[is.hbb,]
```

```
##         [,1] [,2] [,3] [,4]
## Hbb-bt     0    0    7   18
## Hbb-bs     1    2   31   42
## Hbb-bh2    0    0    0    0
## Hbb-bh1    2    0    0    0
## Hbb-y      0    0   39  107
## Hba-x      1    1    0    0
## Hba-a1     0    0  365  452
## Hba-a2     0    0  314  329
```



Another tempting approach is to use interaction models to implicitly subtract the ambient effect during GLM fitting.
The assumption is that, for a genuine DEG, the log-fold change within cells is larger in magnitude than that in the ambient solution.
This is based on the expectation that any DE in the latter is "diluted" by contributions from cell types where that gene is not DE.
Unfortunately, this is not always the case; a DE analysis of the ambient counts indicates that the hemoglobin log-fold change is actually stronger in the neural crest cells compared to the ambient solution, which leads to the rather awkward conclusion that the WT neural crest cells are expressing hemoglobin beyond that explained by ambient contamination.
(This is probably an artifact of how cell calling is performed.)


``` r
library(edgeR)
y.ambient <- DGEList(ambient, samples=colData(summed.neural))
y.ambient <- y.ambient[filterByExpr(y.ambient, group=y.ambient$samples$tomato),]
y.ambient <- calcNormFactors(y.ambient)

design <- model.matrix(~factor(block) + tomato, y.ambient$samples)
y.ambient <- estimateDisp(y.ambient, design)
fit.ambient <- glmQLFit(y.ambient, design, robust=TRUE)
res.ambient <- glmQLFTest(fit.ambient, coef=ncol(design))

summary(decideTests(res.ambient))
```

```
##        tomatoTRUE
## Down         1856
## NotSig       7783
## Up           1599
```

``` r
topTags(res.ambient, n=10)
```

```
## Coefficient:  tomatoTRUE 
##          logFC logCPM     F    PValue       FDR
## Hbb-y   -5.267 12.803 14931 9.921e-45 1.115e-40
## Hbb-bh1 -5.075 13.725 13163 7.603e-44 3.679e-40
## Hba-x   -4.827 13.122 12956 9.820e-44 3.679e-40
## Hba-a1  -4.662 10.734 10531 2.789e-42 7.834e-39
## Hba-a2  -4.521  9.480  7864 3.105e-40 6.979e-37
## Blvrb   -4.319  7.649  3970 1.868e-35 3.498e-32
## Car2    -3.499  8.534  3893 2.556e-35 4.104e-32
## Xist    -4.376  7.484  3837 3.231e-35 4.539e-32
## Gypa    -5.138  7.213  3772 4.240e-35 5.295e-32
## Hbb-bs  -4.941  7.209  3482 1.531e-34 1.720e-31
```



<!--
(One possible explanation for this phenomenon is that erythrocyte fragments are present in the cell-containing libraries but are not used to estimate the ambient profile, presumably because the UMI counts are too high for fragment-containing libraries to be treated as empty.
Technically speaking, this is not incorrect as, after all, those libraries are not actually empty ([Advanced Section 7.2](http://bioconductor.org/books/3.23/OSCA.advanced/droplet-processing.html#qc-droplets)).
In effect, every cell in the WT sample is a fractional multiplet with partial erythrocyte identity from the included fragments, which results in stronger log-fold changes between genotypes for hemoglobin compared to those for the ambient solution.)
-->

In addition, there are other issues with implicit subtraction in the fitted GLM that warrant caution with its use.
This strategy precludes detection of DEGs that are common to all cell types as there is no longer a dilution effect being applied to the log-fold change in the ambient solution.
It requires inclusion of the ambient profiles in the model, which is cause for at least some concern as they are unlikely to have the same degree of variability as the cell-derived pseudo-bulk profiles.
Interpretation is also complicated by the fact that we are only interested in log-fold changes that are more extreme in the cells compared to the ambient solution; a non-zero interaction term is not sufficient for removing spurious DE.

<!--
Full interaction code, in case anyone's unconvinced.


``` r
s <- factor(rep(1:4, 2))
new.geno <- rep(rep(c("KO", "WT"), each=2), 2)
is.ambient <- rep(c("N", "Y"), each=4)
design.amb <- model.matrix(~0 + s + new.geno:is.ambient)

# Get to full rank:
design.amb <- design.amb[,!grepl("is.ambientY", colnames(design.amb))] 

# Syntactically valid colnames:
colnames(design.amb) <- make.names(colnames(design.amb)) 
design.amb
```


``` r
y.amb <- DGEList(cbind(counts(summed.neural), ambient)
y.amb <- y.amb[filterByExpr(y.amb, group=s),]
y.amb <- calcNormFactors(y.amb)
y.amb <- estimateDisp(y.amb, design.amb)
fit.amb <- glmQLFit(y.amb, design.amb, robust=TRUE)    

res.ko <- glmTreat(fit.amb, coef="new.genoKO.is.ambientN")
summary(decideTests(res.ko))
topTags(res.ko, n=10)

res.wt <- glmTreat(fit.amb, coef="new.genoWT.is.ambientN")
summary(decideTests(res.wt))
topTags(res.wt, n=10)

con <- makeContrasts(new.genoKO.is.ambientN - new.genoWT.is.ambientN, levels=design.amb)
res.amb <- glmTreat(fit.amb, contrast=con)
summary(decideTests(res.amb))
topTags(res.amb, n=10)
```


``` r
tab.exp <- res.exp$table
tab.amb <- res.amb$table
okay <- sign(tab.exp$logFC)==sign(tab.amb$logFC)
summary(okay)
iut.p <- pmax(tab.exp$PValue, tab.amb$PValue)
iut.p[!okay] <- 1
final <- data.frame(row.names=rownames(tab.exp),
    logFC=tab.exp$logFC, interaction=tab.amb$logFC,
    PValue=iut.p, FDR=p.adjust(iut.p, method="BH"))
final <- final[order(final$PValue),]
sum(final$FDR <= 0.05)
head(final, 10)
```
-->

See also comments in [Advanced Section 7.3](http://bioconductor.org/books/3.23/OSCA.advanced/droplet-processing.html#removing-ambient-contamination) for more comments on the removal of ambient contamination, mostly for visualization purposes.

## Session Info {-}

<button class="rebook-collapse">View session info</button>
<div class="rebook-content">
```
R Under development (unstable) (2025-10-20 r88955)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] edgeR_4.9.1                  limma_3.67.0                
 [3] DropletUtils_1.31.0          scran_1.39.0                
 [5] scuttle_1.21.0               MouseGastrulationData_1.25.0
 [7] SpatialExperiment_1.21.0     SingleCellExperiment_1.33.0 
 [9] SummarizedExperiment_1.41.0  Biobase_2.71.0              
[11] GenomicRanges_1.63.1         Seqinfo_1.1.0               
[13] IRanges_2.45.0               S4Vectors_0.49.0            
[15] BiocGenerics_0.57.0          generics_0.1.4              
[17] MatrixGenerics_1.23.0        matrixStats_1.5.0           
[19] BiocStyle_2.39.0             rebook_1.21.0               

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 httr2_1.2.2              
 [3] CodeDepends_0.6.6         rlang_1.1.6              
 [5] magrittr_2.0.4            otel_0.2.0               
 [7] compiler_4.6.0            RSQLite_2.4.5            
 [9] DelayedMatrixStats_1.33.0 dir.expiry_1.19.0        
[11] png_0.1-8                 vctrs_0.6.5              
[13] pkgconfig_2.0.3           crayon_1.5.3             
[15] fastmap_1.2.0             dbplyr_2.5.1             
[17] magick_2.9.0              XVector_0.51.0           
[19] rmarkdown_2.30            graph_1.89.1             
[21] purrr_1.2.0               bit_4.6.0                
[23] xfun_0.54                 bluster_1.21.0           
[25] cachem_1.1.0              beachmat_2.27.0          
[27] jsonlite_2.0.0            blob_1.2.4               
[29] rhdf5filters_1.23.3       DelayedArray_0.37.0      
[31] Rhdf5lib_1.33.0           BiocParallel_1.45.0      
[33] irlba_2.3.5.1             parallel_4.6.0           
[35] cluster_2.1.8.1           R6_2.6.1                 
[37] bslib_0.9.0               jquerylib_0.1.4          
[39] Rcpp_1.1.0.8.1            bookdown_0.46            
[41] knitr_1.50                R.utils_2.13.0           
[43] splines_4.6.0             Matrix_1.7-4             
[45] igraph_2.2.1              tidyselect_1.2.1         
[47] abind_1.4-8               yaml_2.3.12              
[49] codetools_0.2-20          curl_7.0.0               
[51] lattice_0.22-7            tibble_3.3.0             
[53] withr_3.0.2               KEGGREST_1.51.1          
[55] BumpyMatrix_1.19.0        evaluate_1.0.5           
[57] BiocFileCache_3.1.0       ExperimentHub_3.1.0      
[59] Biostrings_2.79.2         pillar_1.11.1            
[61] BiocManager_1.30.27       filelock_1.0.3           
[63] BiocVersion_3.23.1        sparseMatrixStats_1.23.0 
[65] glue_1.8.0                metapod_1.19.1           
[67] tools_4.6.0               AnnotationHub_4.1.0      
[69] BiocNeighbors_2.5.0       ScaledMatrix_1.19.0      
[71] locfit_1.5-9.12           XML_3.99-0.20            
[73] rhdf5_2.55.12             grid_4.6.0               
[75] AnnotationDbi_1.73.0      HDF5Array_1.39.0         
[77] BiocSingular_1.27.1       cli_3.6.5                
[79] rsvd_1.0.5                rappdirs_0.3.3           
[81] S4Arrays_1.11.1           dplyr_1.1.4              
[83] R.methodsS3_1.8.2         sass_0.4.10              
[85] digest_0.6.39             SparseArray_1.11.9       
[87] dqrng_0.4.1               rjson_0.2.23             
[89] R.oo_1.27.1               memoise_2.0.1            
[91] htmltools_0.5.9           lifecycle_1.0.4          
[93] h5mread_1.3.1             httr_1.4.7               
[95] statmod_1.5.1             bit64_4.6.0-1            
```
</div>
