\name{callVariants}
\alias{callVariantsPaired}
\alias{vcConfParams}
\alias{callDeletionsPaired}
\title{Variant calling}
\description{
These functions implement various attempts at variant calling.
}
\usage{
callVariantsPaired( data, sampledata, cl = vcConfParams() )

vcConfParams(
  minStrandCov = 5,
  maxStrandCov = 200,
  minStrandAltSupport = 2,
  maxStrandAltSupportControl = 0,
  minStrandDelSupport = minStrandAltSupport,
  maxStrandDelSupportControl = maxStrandAltSupportControl,
  minStrandInsSupport = minStrandAltSupport,
  maxStrandInsSupportControl = maxStrandAltSupportControl,
  minStrandCovControl = 5,
  maxStrandCovControl = 200,
  bases = 5:8,
  returnDataPoints = TRUE,
  annotateWithBackground = TRUE,
  mergeCalls = TRUE,
  mergeAggregator = mean,
  pValueAggregator = max
)
}
\arguments{
\item{data}{A \code{list} with elements \code{Counts} (a 4d
  \code{integer} array of size [1:12, 1:2, 1:k, 1:n]), 
  \code{Coverage} (a 3d \code{integer} array of size [1:2, 1:k, 1:n]),
  \code{Deletions} (a 3d \code{integer} array of size [1:2, 1:k, 1:n]),
  \code{Reference} (a 1d \code{integer} vector of size [1:n]) -- see Details.}
\item{sampledata}{A \code{data.frame} with \code{k} rows (one for each
  sample) and columns \code{Type}, \code{Column} and (\code{SampleGroup}
  or \code{Patient}). The tally file should contain this information as
  a group attribute, see \code{getSampleData} for an example.}
\item{cl}{A list with parameters used by the variant calling
  functions. Such a list can be produced, for instance, by a call to
  \code{vcConfParams}.}
\item{minStrandCov}{Minimum coverage per strand in the case sample.}
\item{maxStrandCov}{Maximum coverage per strand in the case sample.}
\item{minStrandCovControl}{Minimum coverage per strand in the control sample.}
\item{maxStrandCovControl}{Maximum coverage per strand in the control sample.}
\item{minStrandAltSupport}{Minimum support for the alternative allele
  per strand in the case sample. This should be 1 or higher.}
\item{maxStrandAltSupportControl}{Maximum support for the alternative allele
  per strand in the control sample. This should usually be 0.}
\item{minStrandDelSupport}{Minimum support for the deletion
  per strand in the case sample. This should be 1 or higher.}
\item{maxStrandDelSupportControl}{Maximum support for the deletion
  per strand in the control sample. This should usually be 0.}
\item{minStrandInsSupport}{Minimum support for the insertion
  per strand in the case sample. This should be 1 or higher.}
\item{maxStrandInsSupportControl}{Maximum support for the insertion
  per strand in the control sample. This should usually be 0.}
\item{bases}{Indices for subsetting in the bases dimension of the Counts
array, 5:8 extracts only those calls made in the middle one of the
sequencing cycle bins.}
\item{returnDataPoints}{Boolean flag to specify that a data.frame
  with the variant calls should be returned, otherwise only position are returned as a numeric vector.
  If \code{returnDataPoints == FALSE} only the variant positions are returned.}
\item{annotateWithBackground}{Boolean flag to specify that the
  background mismatch / deletion frequency estimated from all control
  samples in the cohort should be added to the output. A simple binomial
  test will be performed as well. Only usefull if \code{returnDataPoints
    == TRUE}}
\item{mergeCalls}{Boolean flag to specify that adjacent calls should be
  merged where appropriate (used by \code{callDeletionsPaired}).
  Only usefull applied if \code{returnDataPoints == TRUE}}
\item{mergeAggregator}{Aggregator function for merging adjacent calls,
  defaults to \code{mean}, which means that a deletion larger than 1bp
  will be annotated with the means of the counts and coverages}
\item{pValueAggregator}{ Aggregator function for combining the p-values
  of adjacent calls when merging, defaults to \code{max}. Is only
  applied if \code{annotateWithBackground == TRUE}}
}
\details{

  \code{data} is a list of datasets which has to at least contain the
  \code{Counts} and \code{Coverages} for variant calling respectively
  \code{Deletions} for deletion calling. This list will usually be
  generated by a call to the \code{h5dapply} function in which the tally
  file, chromosome, datasets and regions within the datasets would be
  specified. See \code{?h5dapply} for specifics. In order for \code{callVariantsPaired}
  to return the correct locations of the variants there must be the \code{h5dapplyInfo}
  slot present in \code{data} as well. This is itself a list (being automatically added by
  \code{h5dapply} and \code{h5readBlock} respectively) and contains the slots \code{Group}
  (location in the HDF5 file) and \code{Blockstart}, which are used to set the chromosome
  and the genomic positions of variants.

  \code{vcConfParams} is a helper function that builds a set of variant
  calling parameters as a list. This list is provided to the calling
  functions e.g. \code{callVariantsPaired} and influences their behavior.

  \code{callVariantsPaired} implements a simple pairwise variant
  callign approach applying the filters specified in \code{cl}, and
  might additionally computes an estimate of the background mismatch
  rate (the mean mismatch rate of all samples labeled as 'Control' in
  the \code{sampledata} and annotate the calls with p-values for the
  \code{binom.test} of the observed mismatch counts and coverage at each
  of the samples labeled as 'Case'.

}
\value{
The result is either a list of positions with SNVs / deletions or a
  \code{data.frame} containing the calls themselves which might contain
  annotations. Adjacent calls might be merged and calls might be
  annotated with p-values depending on configuration parameters.
  
  When the configuration parameter \code{returnDataPoints} is \code{FALSE} the functions return the positions of potential variants as a list containing one integer vector of positions for each sample, if no positions were found for a sample the list will contain \code{NULL} instead. In the case of \code{returnDatapoints == TRUE} the functions return either \code{NULL} if no poisitions were found or a \code{data.frame} with the following slots:
  \item{Chrom}{The chromosome the potential variant / deletion is on}
  \item{Start}{The starting position of the variant / deletion}
  \item{End}{The end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)}
  \item{Sample}{The \code{Case} sample in which the variant was observed}
  \item{altAllele}{The alternate allele for SNVs (skipped for deletions, would be \code{"-"})}
  \item{refAllele}{The reference allele for SNVs (skipped for deletions since the tally file might not contain all the information necessary to extract it)}
  \item{caseCountFwd}{Support for the variant in the \code{Case} sample on the forward strand}
  \item{caseCountRev}{Support for the variant in the \code{Case} sample on the reverse strand}
  \item{caseCoverageFwd}{Coverage of the variant position in the \code{Case} sample on the forward strand}
  \item{caseCoverageRev}{Coverage of the variant position in the \code{Case} sample on the reverse strand}
  \item{controlCountFwd}{Support for the variant in the \code{Control} sample on the forward strand}
  \item{controlCountRev}{Support for the variant in the \code{Control} sample on the reverse strand}
  \item{controlCoverageFwd}{Coverage of the variant position in the \code{Control} sample on the forward strand}
  \item{controlCoverageRev}{Coverage of the variant position in the \code{Control} sample on the reverse strand}
  
  If the \code{annotateWithBackground} option is set the following extra columns are returned
  \item{backgroundFrequencyFwd}{The averaged frequency of mismatches / deletions at the position of all samples of type \code{Control} on the forward strand}
  \item{backgroundFrequencyRev}{The averaged frequency of mismatches / deletions at the position of all samples of type \code{Control} on the reverse strand}
  \item{pValueFwd}{The \code{p.value} of the test \code{binom.test( caseCountFwd, caseCoverageFwd, p = backgroundFrequencyFwd, alternative = "greater")}}
  \item{pValueRev}{The \code{p.value} of the test \code{binom.test( caseCountRev, caseCoverageRev, p = backgroundFrequencyRev, alternative = "greater")}}
  
  The function \code{callDeletionsPaired} merges adjacent single-base deletion calls if the option \code{mergeCalls} is set to \code{TRUE}, in that case the counts and coverages ( e.g. \code{caseCountFwd} ) are aggregated using the function supplied in the \code{mergeAggregator} option of the configuration list (defaults to \code{mean}) and the p-values \code{pValueFwd} and \code{pValueFwd} (if \code{annotateWithBackground} is \code{TRUE}), are aggregated using the function supplied in the \code{pValueAggregator} option (defaults to \code{max}).
}
\author{
Paul Pyl
}
\examples{
  library(h5vc) # loading library
  tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
  sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
  position <- 29979629
  windowsize <- 1000
  vars <- h5dapply( # Calling Variants
    filename = tallyFile,
    group = "/ExampleStudy/16",
    blocksize = 500,
    FUN = callVariantsPaired,
    sampledata = sampleData,
    cl = vcConfParams(returnDataPoints=TRUE),
    names = c("Coverages", "Counts", "Reference", "Deletions"),
    range = c(position - windowsize, position + windowsize)
  )
  vars <- do.call( rbind, vars ) # merge the results from all blocks by row
  vars # We did find a variant
}
