% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/fdr_calculations.R
\name{calculate_and_add_fdr}
\alias{calculate_and_add_fdr}
\title{Calculate and add gene occupancy FDR}
\usage{
calculate_and_add_fdr(
  binding_data,
  occupancy_df,
  fdr_iterations = 50000,
  use_legacy_pvals = FALSE,
  BPPARAM = BiocParallel::bpparam(),
  seed = NULL
)
}
\arguments{
\item{binding_data}{A `GRanges` object of binding profiles, where metadata
columns represent samples. This is typically the `binding_profiles_data`
element from the list returned by `load_data_genes`.}

\item{occupancy_df}{A data frame of gene occupancies, typically the `occupancy`
element from the list returned by `load_data_genes`. It must contain
sample columns and a `nfrags` column.}

\item{fdr_iterations}{Integer. The number of sampling iterations to build the
FDR null model. Higher values are more accurate but slower. Default is 50000.
For quick tests, a much lower value (e.g., 100) can be used.}

\item{use_legacy_pvals}{Logical. The original Southall et al (2013) paper reported
the unadjusted p-values as 'FDR' scores. This implementation applies BH correction
to return the more useful and expected corrected FDR scores for all genes.
(Default: FALSE)}

\item{BPPARAM}{A `BiocParallelParam` instance specifying the parallel backend
to use for computation. See `BiocParallel::bpparam()`. For full
reproducibility, especially in examples or tests, use
`BiocParallel::SerialParam()`.}

\item{seed}{An optional integer. If provided, it is used to set the random
seed before the sampling process begins, ensuring that the FDR calculations
are fully reproducible. If `NULL` (default), results will vary between runs.}
}
\value{
An updated version of the input `occupancy_df` data frame, with new columns
appended. For each sample column (e.g., `SampleA`), a corresponding FDR
column (e.g., `SampleA_FDR`) is added.
}
\description{
This function calculates a False Discovery Rate (FDR) for gene occupancy scores,
which is often used as a proxy for determining whether a gene is actively
expressed in RNA Polymerase TaDa experiments. It applies a permutation-based
null model to each sample's binding profile to determine empirical p-values,
applies the Benjamini-Hochberg (BH) adjustment, and adds the resulting FDR values
as new columns to the occupancy data frame.
}
\details{
The FDR calculation algorithm is refined and adapted from the method described in
Southall et al. (2013), Dev Cell, 26(1), 101-12, and re-implemented in the
`polii.gene.call` tool. It operates in several stages:
\enumerate{
  \item For each sample, a null distribution of mean occupancy scores is
    simulated by repeatedly sampling random fragments from the genome-wide
    binding profile. This is done for various numbers of fragments.
  \item A two-tiered regression model is fitted to the simulation results. This
    creates a statistical model that can predict the empirical p-value for any given
    occupancy score and fragment count.
  \item This predictive model is then applied to the actual observed mean
    occupancy and fragment count for each gene in the `occupancy_df`.
  \item The final set of p-values is adjusted using BH correction to generate
    FDR scores for each gene.
  \item The FDR value for each gene in each sample is appended to the
    `occupancy_df` in a new column.
}

Key differences from the original Southall et al. algorithm include fitting
linear models and sampling with replacement to generate the underlying null
distribution.

This function is typically not called directly by the user, but is instead
invoked via `load_data_genes(calculate_fdr = TRUE)`. Exporting it allows for
advanced use cases where FDR needs to be calculated on a pre-existing
occupancy table.
}
\examples{
if (requireNamespace("BiocParallel", quietly = TRUE)) {
    # Prepare sample binding data (GRanges object)
    # Here, we'll load one of the sample files included with the package
    # (this is a TF binding profile, which would not normally be used for
    # occupancy calculations)
    data_dir <- system.file("extdata", package = "damidBind")
    bgraph_file <- file.path(data_dir, "Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm.gatc.2L.bedgraph.gz")

    # The internal function `import_bedgraph_as_df` is used here for convenience.
    # The column name 'score' will be replaced with the sample name.
    binding_df <- damidBind:::import_bedgraph_as_df(bgraph_file, colname = "Sample_1")

    binding_gr <- GenomicRanges::GRanges(
        seqnames = binding_df$chr,
        ranges = IRanges::IRanges(start = binding_df$start, end = binding_df$end),
        Sample_1 = binding_df$Sample_1
    )

    # Create a mock occupancy data frame.
    # In a real analysis, this would be generated by `calculate_occupancy()`.
    mock_occupancy <- data.frame(
        name = c("geneA", "geneB", "geneC"),
        nfrags = c(5, 10, 2),
        Sample_1 = c(1.5, 0.8, -0.2),
        row.names = c("geneA", "geneB", "geneC")
    )

    # Calculate FDR.
    # We use a low fdr_iterations for speed, and a seed for reproducibility.
    # BiocParallel::SerialParam() is used to ensure deterministic, single-core execution.
    occupancy_with_fdr <- calculate_and_add_fdr(
        binding_data = binding_gr,
        occupancy_df = mock_occupancy,
        fdr_iterations = 100,
        BPPARAM = BiocParallel::SerialParam(),
        seed = 42
    )

    # View the result, which now includes an FDR column for Sample_1.
    print(occupancy_with_fdr)
}

}
\seealso{
[load_data_genes()] which uses this function internally.
}
