# HCA human bone marrow (10X Genomics) 

<script>
document.addEventListener("click", function (event) {
    if (event.target.classList.contains("rebook-collapse")) {
        event.target.classList.toggle("active");
        var content = event.target.nextElementSibling;
        if (content.style.display === "block") {
            content.style.display = "none";
        } else {
            content.style.display = "block";
        }
    }
})
</script>

<style>
.rebook-collapse {
  background-color: #eee;
  color: #444;
  cursor: pointer;
  padding: 18px;
  width: 100%;
  border: none;
  text-align: left;
  outline: none;
  font-size: 15px;
}

.rebook-content {
  padding: 0 18px;
  display: none;
  overflow: hidden;
  background-color: #f1f1f1;
}
</style>

## Introduction

Here, we use an example dataset from the [Human Cell Atlas immune cell profiling project on bone marrow](https://preview.data.humancellatlas.org), which contains scRNA-seq data for 380,000 cells generated using the 10X Genomics technology.
This is a fairly big dataset that represents a good use case for the techniques in [Advanced Chapter 14](http://bioconductor.org/books/3.23/OSCA.advanced/dealing-with-big-data.html#dealing-with-big-data).

## Data loading

This dataset is loaded via the *[HCAData](https://bioconductor.org/packages/3.23/HCAData)* package, which provides a ready-to-use `SingleCellExperiment` object. 


``` r
library(HCAData)
sce.bone <- HCAData('ica_bone_marrow', as.sparse=TRUE)
sce.bone$Donor <- sub("_.*", "", sce.bone$Barcode)
```

We use symbols in place of IDs for easier interpretation later.


``` r
library(EnsDb.Hsapiens.v86)
rowData(sce.bone)$Chr <- mapIds(EnsDb.Hsapiens.v86, keys=rownames(sce.bone),
    column="SEQNAME", keytype="GENEID")

library(scater)
rownames(sce.bone) <- uniquifyFeatureNames(rowData(sce.bone)$ID,
    names = rowData(sce.bone)$Symbol)
```

## Quality control

Cell calling was not performed (see [here](https://s3.amazonaws.com/preview-ica-expression-data/Brief+ICA+Read+Me.pdf)) so we will perform QC using all metrics and block on the donor of origin during outlier detection.
We perform the calculation across multiple cores to speed things up.


``` r
library(BiocParallel)
bpp <- MulticoreParam(8)
sce.bone <- unfiltered <- addPerCellQC(sce.bone, BPPARAM=bpp,
    subsets=list(Mito=which(rowData(sce.bone)$Chr=="MT")))

qc <- quickPerCellQC(colData(sce.bone), batch=sce.bone$Donor,
    sub.fields="subsets_Mito_percent")
sce.bone <- sce.bone[,!qc$discard]
```


``` r
unfiltered$discard <- qc$discard

gridExtra::grid.arrange(
    plotColData(unfiltered, x="Donor", y="sum", colour_by="discard") +
        scale_y_log10() + ggtitle("Total count"),
    plotColData(unfiltered, x="Donor", y="detected", colour_by="discard") +
        scale_y_log10() + ggtitle("Detected features"),
    plotColData(unfiltered, x="Donor", y="subsets_Mito_percent",
        colour_by="discard") + ggtitle("Mito percent"),
    ncol=2
)
```

<div class="figure">
<img src="hca-bone-marrow_files/figure-html/unref-hca-bone-qc-1.png" alt="Distribution of QC metrics in the HCA bone marrow dataset. Each point represents a cell and is colored according to whether it was discarded." width="672" />
<p class="caption">(\#fig:unref-hca-bone-qc)Distribution of QC metrics in the HCA bone marrow dataset. Each point represents a cell and is colored according to whether it was discarded.</p>
</div>


``` r
plotColData(unfiltered, x="sum", y="subsets_Mito_percent", 
    colour_by="discard") + scale_x_log10()
```

<div class="figure">
<img src="hca-bone-marrow_files/figure-html/unref-hca-bone-mito-1.png" alt="Percentage of mitochondrial reads in each cell in the HCA bone marrow dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded." width="672" />
<p class="caption">(\#fig:unref-hca-bone-mito)Percentage of mitochondrial reads in each cell in the HCA bone marrow dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded.</p>
</div>

## Normalization

For a minor speed-up, we use already-computed library sizes rather than re-computing them from the column sums.


``` r
sce.bone <- logNormCounts(sce.bone, size_factors = sce.bone$sum)
```


``` r
summary(sizeFactors(sce.bone))
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0489  0.4699  0.6479  1.0000  0.8893 42.3813
```

## Variance modeling

We block on the donor of origin to mitigate batch effects during HVG selection.
We select a larger number of HVGs to capture any batch-specific variation that might be present.


``` r
library(scran)
set.seed(1010010101)
dec.bone <- modelGeneVarByPoisson(sce.bone, 
    block=sce.bone$Donor, BPPARAM=bpp)
top.bone <- getTopHVGs(dec.bone, n=5000)
```


``` r
par(mfrow=c(4,2))
blocked.stats <- dec.bone$per.block
for (i in colnames(blocked.stats)) {
    current <- blocked.stats[[i]]
    plot(current$mean, current$total, main=i, pch=16, cex=0.5,
        xlab="Mean of log-expression", ylab="Variance of log-expression")
    curfit <- metadata(current)
    curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
}
```

<div class="figure">
<img src="hca-bone-marrow_files/figure-html/unref-hca-bone-var-1.png" alt="Per-gene variance as a function of the mean for the log-expression values in the HCA bone marrow dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances." width="576" />
<p class="caption">(\#fig:unref-hca-bone-var)Per-gene variance as a function of the mean for the log-expression values in the HCA bone marrow dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances.</p>
</div>

## Data integration

Here we use multiple cores, randomized SVD^[The randomized SVD may give slightly different results on different systems, so the MNN-corrected values may themselves vary across systems.] and approximate nearest-neighbor detection to speed up this step.


``` r
library(batchelor)
library(BiocNeighbors)

set.seed(1010001)
merged.bone <- fastMNN(sce.bone, batch = sce.bone$Donor, subset.row = top.bone,
     BSPARAM=BiocSingular::RandomParam(deferred = TRUE), 
     BNPARAM=AnnoyParam(),
     BPPARAM=bpp)

reducedDim(sce.bone, 'MNN') <- reducedDim(merged.bone, 'corrected')
```

We use the percentage of variance lost as a diagnostic measure:


``` r
metadata(merged.bone)$merge.info$lost.var
```

```
##      MantonBM1 MantonBM2 MantonBM3 MantonBM4 MantonBM5 MantonBM6 MantonBM7
## [1,]  0.007133  0.006508  0.000000  0.000000  0.000000  0.000000  0.000000
## [2,]  0.006314  0.006883  0.023528  0.000000  0.000000  0.000000  0.000000
## [3,]  0.005117  0.003096  0.005115  0.019703  0.000000  0.000000  0.000000
## [4,]  0.001991  0.001888  0.001890  0.001766  0.023451  0.000000  0.000000
## [5,]  0.002391  0.001914  0.001735  0.002805  0.002563  0.023692  0.000000
## [6,]  0.003053  0.003180  0.002958  0.002522  0.003211  0.003342  0.024807
## [7,]  0.001826  0.001591  0.002290  0.001881  0.001473  0.002174  0.001908
##      MantonBM8
## [1,]   0.00000
## [2,]   0.00000
## [3,]   0.00000
## [4,]   0.00000
## [5,]   0.00000
## [6,]   0.00000
## [7,]   0.03235
```

## Dimensionality reduction

We set `external_neighbors=TRUE` to replace the internal nearest neighbor search in the UMAP implementation with our parallelized approximate search.
We also set the number of threads to be used in the UMAP iterations.


``` r
set.seed(01010100)
sce.bone <- runUMAP(sce.bone, dimred="MNN",
    external_neighbors=TRUE, 
    BNPARAM=AnnoyParam(),
    BPPARAM=bpp,
    n_threads=bpnworkers(bpp))
```

## Clustering

Graph-based clustering generates an excessively large intermediate graph so we will instead use a two-step approach with $k$-means.
We generate 1000 small clusters that are subsequently aggregated into more interpretable groups with a graph-based method.
If more resolution is required, we can increase `centers` in addition to using a lower `k` during graph construction.


``` r
library(bluster)

set.seed(1000)
colLabels(sce.bone) <- clusterRows(reducedDim(sce.bone, "MNN"),
    TwoStepParam(KmeansParam(centers=1000), NNGraphParam(k=5)))

table(colLabels(sce.bone))
```

```
## 
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 18859 15812 36360 47699 26528 10869 65650 18584 35321  8009 14930  3601  4206 
##    14    15    16 
##  3155  4824  2318
```

We observe mostly balanced contributions from different samples to each cluster (Figure \@ref(fig:unref-hca-bone-ab)), consistent with the expectation that all samples are replicates from different donors.


``` r
tab <- table(Cluster=colLabels(sce.bone), Donor=sce.bone$Donor)
library(pheatmap)
pheatmap(log10(tab+10), color=viridis::viridis(100))
```

<div class="figure">
<img src="hca-bone-marrow_files/figure-html/unref-hca-bone-ab-1.png" alt="Heatmap of log~10~-number of cells in each cluster (row) from each sample (column)." width="672" />
<p class="caption">(\#fig:unref-hca-bone-ab)Heatmap of log~10~-number of cells in each cluster (row) from each sample (column).</p>
</div>




``` r
# TODO: add scrambling option in scater's plotting functions.
scrambled <- sample(ncol(sce.bone))

gridExtra::grid.arrange(
    plotUMAP(sce.bone, colour_by="label", text_by="label"),
    plotUMAP(sce.bone[,scrambled], colour_by="Donor")
)
```

<div class="figure">
<img src="hca-bone-marrow_files/figure-html/unref-hca-bone-umap-1.png" alt="UMAP plots of the HCA bone marrow dataset after merging. Each point represents a cell and is colored according to the assigned cluster (top) or the donor of origin (bottom)." width="672" />
<p class="caption">(\#fig:unref-hca-bone-umap)UMAP plots of the HCA bone marrow dataset after merging. Each point represents a cell and is colored according to the assigned cluster (top) or the donor of origin (bottom).</p>
</div>

## Differential expression

We identify marker genes for each cluster while blocking on the donor.


``` r
markers.bone <- findMarkers(sce.bone, block = sce.bone$Donor, 
    direction = 'up', lfc = 1, BPPARAM=bpp)
```

We visualize the top markers for a randomly chosen cluster^[The exact cluster chosen varies across systems due to the MNN-corrected values themselves varying across systems.] using a heatmap in Figure \@ref(fig:unref-hca-bone-heatmap).
The presence of upregulated genes like _LYZ_, _S100A8_ and _VCAN_ is consistent with a monocyte identity for this cluster.




``` r
top.markers <- markers.bone[[cluster.choice]]
best <- top.markers[top.markers$Top <= 10,]
lfcs <- getMarkerEffects(best)

library(pheatmap)
pheatmap(lfcs, breaks=seq(-5, 5, length.out=101))
```

<div class="figure">
<img src="hca-bone-marrow_files/figure-html/unref-hca-bone-heatmap-1.png" alt="Heatmap of log~2~-fold changes for the top marker genes (rows) of cluster 4 compared to all other clusters (columns)." width="672" />
<p class="caption">(\#fig:unref-hca-bone-heatmap)Heatmap of log~2~-fold changes for the top marker genes (rows) of cluster 4 compared to all other clusters (columns).</p>
</div>



## Cell type classification

We perform automated cell type classification using a reference dataset to annotate each cluster based on its pseudo-bulk profile. 
This is faster than the per-cell approaches described in Chapter \@ref(cell-type-annotation) at the cost of the resolution required to detect heterogeneity inside a cluster.
Nonetheless, it is often sufficient for a quick assignment of cluster identity, and indeed, cluster 4 is also identified as consisting of monocytes from this analysis.


``` r
se.aggregated <- sumCountsAcrossCells(sce.bone, id=colLabels(sce.bone), BPPARAM=bpp)

library(celldex)
hpc <- HumanPrimaryCellAtlasData()

library(SingleR)
anno.single <- SingleR(se.aggregated, ref = hpc, labels = hpc$label.main,
    assay.type.test="sum")
anno.single
```

```
## DataFrame with 16 rows and 4 columns
##                             scores      labels delta.next pruned.labels
##                           <matrix> <character>  <numeric>   <character>
## 1   0.384401:0.751148:0.651234:...         GMP  0.0913786           GMP
## 2   0.343557:0.567261:0.479100:...     T_cells  0.4298632       T_cells
## 3   0.323043:0.647364:0.558334:...     T_cells  0.0959201       T_cells
## 4   0.299294:0.745584:0.535751:...    Monocyte  0.2935059      Monocyte
## 5   0.310761:0.672644:0.540285:...      B_cell  0.6024293        B_cell
## ...                            ...         ...        ...           ...
## 12  0.294203:0.707235:0.528198:...    Monocyte  0.3586359      Monocyte
## 13  0.343741:0.731258:0.600058:...    Monocyte  0.1019188            NA
## 14  0.369798:0.652467:0.582201:...      B_cell  0.1976631            NA
## 15  0.378580:0.690882:0.781190:...         MEP  0.0614135           MEP
## 16  0.333963:0.679341:0.559147:...         GMP  0.1114087           GMP
```



<!-- ```{r saveRDS, include=FALSE} -->
<!-- ## TEMP FIX -->
<!-- saveRDS(sce.bone, 'sce.bone.rds') -->
<!-- saveRDS(markers.bone, 'markers.bone.rds') -->

<!-- sce.bone <- readRDS('sce.bone.rds') -->
<!-- markers.bone <- readRDS('markers.bone.rds') -->
<!-- ``` -->

<!--
## Gene set Enrichment Analysis

To interpret differential expression (DE) results, gene set enrichment analysis (GSEA) can be performed to annotate DE results. The first step is to acquire a mapping of ENTREZID to gene symbol.


``` r
## Get mappings of ENTREZID to Symbol
library(org.Hs.eg.db)
keys_entrez <- keys(org.Hs.eg.db, 'ENTREZID')
mapping_es <- AnnotationDbi::select(org.Hs.eg.db,
                                    keys = keys_entrez,
                                    columns = c('ENTREZID', 'SYMBOL'),
                                    keytype = 'ENTREZID')
mapping_es$ENTREZID <- as.integer(mapping_es$ENTREZID)
```

The next is to parse the gene sets (here we use a "tidy" representation of MSigDB gene sets) and remap them.


``` r
## Get pathways of interest - convert to list with symbol
## devtools::install_github('stephenturner/msigdf') # "tidy" MSigDB genesets
library(msigdf)
library(dplyr)
mdb <- dplyr::inner_join(msigdf.human, mapping_es,
                         by = c('entrez' = 'ENTREZID')) %>%
    dplyr::filter(collection == 'c7') %>%
    dplyr::select(-collection, -entrez) %>%
    dplyr::group_nest(geneset)
pathways <- purrr::map(mdb$data, function(x) {
    as.character(x$SYMBOL)
})
names(pathways) <- mdb$geneset
```

Finally, the DE results from the `findMarkers()` function are used to test for differential gene set enrichment using the `fgsea` package. Here we compare cluster 3 against cluster 15, which are at opposite ends of a UMAP group.


``` r
## Get stats based on markers search
## Compare cluster 3 (Dendritic cells) and 15 (HSCs)
stats <- markers.bone[[3]]$logFC.15
names(stats) <- rownames(markers.bone[[3]])

## Run fast gene set enrichment analysis
## Returns a data.table object
library(fgsea)
gse <- fgsea(pathways = pathways, 
             stats = stats,
             minSize = 15,
             maxSize = 500,
             nperm = 5000)
```

The enrichment scores of the top up and down pathways can then be plotted as a table:


``` r
## Get top up and down pathways based on NES
gse.up <- gse[order(gse$NES, decreasing = TRUE)[1:2], ]$pathway
gse.dn <- gse[order(gse$NES)[1:2], ]$pathway
topPathways <- c(gse.up, rev(gse.dn))

## Plot the GSEA table
plotGseaTable(pathways[topPathways], stats,
              gse, 
              gseaParam = 0.5)
```

Or select pathways can also be plotted as a more traditional running score plot:


``` r
## Traditional GSEA plot
plotEnrichment(pathways[[topPathways[1]]],
               stats)
```

### Trajectory analysis

Slingshot relies on providing a set of clusters on which to construct a trajectory.


``` r
library(slingshot)

## Clusters to attempt trajectory on:
slc.clusters <- c(15, 22, 11, 7, 1, 3)

## Subsample dataset for speed
slc.dat <- sce.bone[, sce.bone$cluster %in% slc.clusters]
slc.subdat <- slc.dat[, sample(ncol(slc.dat), 20000)]

## Run slingshot
slc <- slingshot(slc.subdat,
                 clusterLabels = 'cluster',
                 reducedDim = 'UMAP')
```

The calculated principal curve can then be visualized on the UMAP space.


``` r
## Slingshot trajectory plot
library(RColorBrewer)
colors <- colorRampPalette(brewer.pal(11, 'Spectral')[-6])(100)
plotcol <- colors[cut(slc$slingPseudotime_1, breaks = 100)]
plot(reducedDims(slc)$UMAP, col = plotcol, pch=16, asp = 1)
lines(SlingshotDataSet(slc), lwd = 2, col = 'black')
```
-->

## Session Info {-}

<button class="rebook-collapse">View session info</button>
<div class="rebook-content">
```
R Under development (unstable) (2025-10-20 r88955)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SingleR_2.13.0              celldex_1.21.0             
 [3] pheatmap_1.0.13             bluster_1.21.0             
 [5] BiocNeighbors_2.5.0         batchelor_1.27.0           
 [7] scran_1.39.0                BiocParallel_1.45.0        
 [9] scater_1.39.0               ggplot2_4.0.1              
[11] scuttle_1.21.0              EnsDb.Hsapiens.v86_2.99.0  
[13] ensembldb_2.35.0            AnnotationFilter_1.35.0    
[15] GenomicFeatures_1.63.1      AnnotationDbi_1.73.0       
[17] rhdf5_2.55.12               HCAData_1.27.0             
[19] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.0
[21] Biobase_2.71.0              GenomicRanges_1.63.1       
[23] Seqinfo_1.1.0               IRanges_2.45.0             
[25] S4Vectors_0.49.0            BiocGenerics_0.57.0        
[27] generics_0.1.4              MatrixGenerics_1.23.0      
[29] matrixStats_1.5.0           BiocStyle_2.39.0           
[31] rebook_1.21.0              

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3        jsonlite_2.0.0           
  [3] CodeDepends_0.6.6         magrittr_2.0.4           
  [5] gypsum_1.7.0              ggbeeswarm_0.7.3         
  [7] farver_2.1.2              rmarkdown_2.30           
  [9] BiocIO_1.21.0             vctrs_0.6.5              
 [11] DelayedMatrixStats_1.33.0 memoise_2.0.1            
 [13] Rsamtools_2.27.0          RCurl_1.98-1.17          
 [15] htmltools_0.5.9           S4Arrays_1.11.1          
 [17] AnnotationHub_4.1.0       curl_7.0.0               
 [19] Rhdf5lib_1.33.0           SparseArray_1.11.9       
 [21] alabaster.base_1.11.1     sass_0.4.10              
 [23] bslib_0.9.0               httr2_1.2.2              
 [25] cachem_1.1.0              ResidualMatrix_1.21.0    
 [27] GenomicAlignments_1.47.0  igraph_2.2.1             
 [29] lifecycle_1.0.4           pkgconfig_2.0.3          
 [31] rsvd_1.0.5                Matrix_1.7-4             
 [33] R6_2.6.1                  fastmap_1.2.0            
 [35] digest_0.6.39             RSpectra_0.16-2          
 [37] dqrng_0.4.1               irlba_2.3.5.1            
 [39] ExperimentHub_3.1.0       RSQLite_2.4.5            
 [41] beachmat_2.27.0           labeling_0.4.3           
 [43] filelock_1.0.3            httr_1.4.7               
 [45] abind_1.4-8               compiler_4.6.0           
 [47] bit64_4.6.0-1             withr_3.0.2              
 [49] S7_0.2.1                  viridis_0.6.5            
 [51] DBI_1.2.3                 alabaster.ranges_1.11.0  
 [53] alabaster.schemas_1.11.0  HDF5Array_1.39.0         
 [55] rappdirs_0.3.3            DelayedArray_0.37.0      
 [57] rjson_0.2.23              tools_4.6.0              
 [59] vipor_0.4.7               otel_0.2.0               
 [61] beeswarm_0.4.0            glue_1.8.0               
 [63] h5mread_1.3.1             restfulr_0.0.16          
 [65] rhdf5filters_1.23.3       grid_4.6.0               
 [67] cluster_2.1.8.1           gtable_0.3.6             
 [69] metapod_1.19.1            BiocSingular_1.27.1      
 [71] ScaledMatrix_1.19.0       XVector_0.51.0           
 [73] ggrepel_0.9.6             BiocVersion_3.23.1       
 [75] pillar_1.11.1             limma_3.67.0             
 [77] dplyr_1.1.4               BiocFileCache_3.1.0      
 [79] lattice_0.22-7            rtracklayer_1.71.2       
 [81] bit_4.6.0                 tidyselect_1.2.1         
 [83] locfit_1.5-9.12           Biostrings_2.79.2        
 [85] knitr_1.50                gridExtra_2.3            
 [87] bookdown_0.46             ProtGenerics_1.43.0      
 [89] edgeR_4.9.1               xfun_0.54                
 [91] statmod_1.5.1             UCSC.utils_1.7.1         
 [93] lazyeval_0.2.2            yaml_2.3.12              
 [95] evaluate_1.0.5            codetools_0.2-20         
 [97] cigarillo_1.1.0           tibble_3.3.0             
 [99] alabaster.matrix_1.11.0   BiocManager_1.30.27      
[101] graph_1.89.1              cli_3.6.5                
[103] uwot_0.2.4                jquerylib_0.1.4          
[105] dichromat_2.0-0.1         Rcpp_1.1.0.8.1           
[107] GenomeInfoDb_1.47.2       dir.expiry_1.19.0        
[109] dbplyr_2.5.1              png_0.1-8                
[111] XML_3.99-0.20             parallel_4.6.0           
[113] blob_1.2.4                sparseMatrixStats_1.23.0 
[115] bitops_1.0-9              alabaster.se_1.11.0      
[117] viridisLite_0.4.2         scales_1.4.0             
[119] purrr_1.2.0               crayon_1.5.3             
[121] rlang_1.1.6               cowplot_1.2.0            
[123] KEGGREST_1.51.1          
```
</div>
