% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/methods.R
\name{term_IC}
\alias{term_IC}
\title{Information content}
\usage{
term_IC(
  dag,
  method,
  terms = NULL,
  control = list(),
  verbose = simona_opt$verbose
)
}
\arguments{
\item{dag}{An \code{ontology_DAG} object.}

\item{method}{An IC method. All available methods are in \code{\link[=all_term_IC_methods]{all_term_IC_methods()}}.}

\item{terms}{A vector of term names. If it is set, the returned vector will be subsetted to the terms that have been set here.}

\item{control}{A list of parameters passing to individual methods. See the subsections.}

\item{verbose}{Whether to print messages.}
}
\value{
A numeric vector.
}
\description{
Information content
}
\section{Methods}{

\subsection{IC_offspring}{

Denote \code{k} as the number of offspring terms plus the term itself and \code{N} is such value for root (or the total number of terms in the DAG), the information
content is calculated as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = -log(k/N)
}\if{html}{\out{</div>}}
}


\subsection{IC_height}{

For a term \code{t} in the DAG, denote \code{d} as the maximal distance from root (i.e. the depth) and \code{h} as the maximal distance to leaves (i.e. the height),
the relative position \code{p} on the longest path from root to leaves via term \code{t} is calculated as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{p = (h + 1)/(h + d + 1)
}\if{html}{\out{</div>}}

In the formula where 1 is added gets rid of \code{p = 0}. Then the information content is:

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = -log(p) 
   = -log((h+1)/(h+d+1))
}\if{html}{\out{</div>}}
}


\subsection{IC_annotation}{

Denote \code{k} as the number of items annotated to a term \code{t}, and \code{N} is the number of items annotated to the root (which is
the total number of items annotated to the DAG), IC for term \code{t} is calculated as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = -log(k/N)
}\if{html}{\out{</div>}}

In current implementations in other tools, there is an inconsistency of defining \code{k} and \code{N}.
Please see \code{\link[=n_annotations]{n_annotations()}} for explanation.

\code{NA} is assigned to terms with no item annotated.
}


\subsection{IC_universal}{

It measures the probability of a term getting full transmission from the root. Each term is associated with a p-value and the root has
the p-value of 1.

For example, an intermediate term \code{t} has two parent terms \code{parent1} and \code{parent2}, also assume \code{parent1} has \code{k1} children
and \code{parent2} has \code{k2} children, assume a parent transmits information equally to all its children, then respectively \code{parent1} only transmits \code{1/k1} and
\code{parent2} only transmits \code{1/k2} of its content to term \code{t}, or the probability of a parent to reach \code{t} is \code{1/k1} or \code{1/k2}.
Let's say \code{p1} and \code{p2} are the accumulated contents from the root to \code{parnet1} and \code{parent2} respectively (or the probability
of the two parent terms getting full transmission from root), then the probability of reaching \code{t} via a full transmission graph from \code{parent1}
is the multiplication of \code{p1} and \code{1/k1}, which is \code{p1/k1}, and same for \code{p2/k2}. Then, for term \code{t}, if getting transmitted from \code{parent1} and
\code{parent2} are independent, the probability of \code{t} (denoted as \code{p_t}) to get transmitted from both parents is:

\if{html}{\out{<div class="sourceCode">}}\preformatted{p_t = (p1/k1) * (p2/k2)
}\if{html}{\out{</div>}}

Since the two parents are the full set of \code{t}'s parents, \code{p_t} is the probability of \code{t} getting full transmission from root. And the final
information content is:

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = -log(p_t)
}\if{html}{\out{</div>}}

Paper link: \doi{10.1155/2012/975783}.
}


\subsection{IC_Zhang_2006}{

It measures the number of ways from a term to reach leaf terms. E.g. in the following DAG:

\if{html}{\out{<div class="sourceCode">}}\preformatted{     a      upstream
    /|\\
   b | c
     |/
     d      downstream
}\if{html}{\out{</div>}}

term \code{a} has three ways to reach leaf, which are \code{a->b}, \code{a->d} and \code{a->c->d}.

Let's denote \code{k} as the number of ways for term \code{t} to reach leaves and \code{N} as the maximal value of \code{k} which
is associated to the root term, the information content is calculated as

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = -log(k/N) 
   = log(N) - log(k)
}\if{html}{\out{</div>}}

Paper link: \doi{10.1186/1471-2105-7-135}.
}


\subsection{IC_Seco_2004}{

It is based on the number of offspring terms of term \code{t}.
The information content is calculated as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = 1 - log(k+1)/log(N)
}\if{html}{\out{</div>}}

where \code{k} is the number of offspring terms of \code{t}, or you can think \code{k+1} is the number of \code{t}'s offspring terms plus itself.
\code{N} is the total number of terms on the DAG.

Paper link: \doi{10.5555/3000001.3000272}.
}


\subsection{IC_Zhou_2008}{

It is a correction of \emph{IC_Seco_2004} which considers the depth of a term in the DAG.
The information content is calculated as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = 0.5*IC_Seco + 0.5*log(depth)/log(max_depth)
}\if{html}{\out{</div>}}

where \code{depth} is the depth of term \code{t} in the DAG, defined as the maximal distance from root. \code{max_depth} is the largest depth in the DAG.
So IC is composed with two parts: the numbers of offspring terms and positions in the DAG.

Paper link: \doi{10.1109/FGCNS.2008.16}.
}


\subsection{IC_Seddiqui_2010}{

It is also a correction to \emph{IC_Seco_2004}, but considers number of relations connecting a term (i.e. number of parent terms and child terms).
The information content is defined as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{(1-sigma)*IC_Seco + sigma*log((n_parents + n_children + 1)/log((total_edges + 1))
}\if{html}{\out{</div>}}

where \code{n_parents} and \code{n_children} are the numbers of parents and children of term \code{t}. The tuning factor \code{sigma} is defined as

\if{html}{\out{<div class="sourceCode">}}\preformatted{sigma = log(total_edges+1)/(log(total_edges) + log(total_terms))
}\if{html}{\out{</div>}}

where \code{total_edges} is the number of all relations (all parent-child relations)
and \code{total_terms} is the number of all terms in the DAG.

Paper link: \doi{10.5555/1862330.1862343}.
}


\subsection{IC_Sanchez_2011}{

It measures the average contribution of term \code{t} on leaf terms. First denote \code{zeta} as the number of leaf terms that
can be reached from term \code{t} (or \code{t}'s offspring that are leaves.). Since all \code{t}'s ancestors can also
reach \code{t}'s leaves, the contribution of \code{t} on leaf terms is scaled by \code{n_ancestors} which is the number of \code{t}'s ancestor terms.
The final information content is normalized by the total number of leaves in the DAG, which is the possible maximal value of \code{zeta}.
The complete definition of information content is:

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = -log( (zeta/n_ancestor) / n_all_leaves)
}\if{html}{\out{</div>}}

Paper link: \doi{10.1016/j.knosys.2010.10.001}.
}


\subsection{IC_Meng_2012}{

It has a complex form which takes account of the term depth and the downstream of the term.
The first factor is calculated as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{f1 = log(depth)/long(max_depth)
}\if{html}{\out{</div>}}

The second factor is calculated as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{f1 = 1 - log(1 + sum_\{x => t's offspring\}(1/depth_x))/log(total_terms)
}\if{html}{\out{</div>}}

In the equation, the summation goes over \code{t}'s offspring terms.

The final information content is the multiplication of \code{f1} and \code{f2}:

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = f1 * f2
}\if{html}{\out{</div>}}

Paper link: \url{http://article.nadiapub.com/IJGDC/vol5_no3/6.pdf}.

There is one parameter \code{correct}. If it is set to \code{TRUE}, the first factor \code{f1} is calculated as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{f1 = log(depth + 1)/long(max_depth + 1)
}\if{html}{\out{</div>}}

\code{correct} can be set as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{term_IC(dag, method = "IC_Meng_2012", control = list(correct = TRUE))
}\if{html}{\out{</div>}}
}


\subsection{IC_Wang_2007}{

Each relation is weighted by a value less than 1 based on the semantic relation, i.e. 0.8 for "is_a" and 0.6 for "part_of".
For a term \code{t} and one of its ancestor term \code{a}, it first calculates an "S-value" which corresponds to a path from \code{a} to \code{t} where
the accumulated multiplication of weights along the path reaches maximal:

\if{html}{\out{<div class="sourceCode">}}\preformatted{S(a->t) = max_\{path\}(prod_\{node on the paty\}(w))
}\if{html}{\out{</div>}}

Here \code{max} goes over all possible paths from \code{a} to \code{t}, and \code{prod()} multiplies edge weights in a certain path.

The formula can be transformed as (we simply rewrite \code{S(a->t)} to \code{S}):

\if{html}{\out{<div class="sourceCode">}}\preformatted{1/S = min(prod(1/w))
log(1/S) = log(min(prod(1/w)))
         = min(sum(log(1/w)))
}\if{html}{\out{</div>}}

Since \code{w < 1}, \code{log(1/w)} is positive. According to the equation, the path (\code{a->...->t}) is actually the shortest path from \code{a} to \code{t} by taking
\code{log(1/w)} as the weight, and \code{log(1/S)} is the weighted shortest distance.

If \code{S(a->t)} can be thought as the maximal semantic contribution from \code{a} to \code{t}, the information content is calculated
as the sum from all \code{t}'s ancestors (including \code{t} itself):

\if{html}{\out{<div class="sourceCode">}}\preformatted{IC = sum_\{a in t's ancestors + t\}(S(a->t))
}\if{html}{\out{</div>}}

Paper link: \doi{10.1093/bioinformatics/btm087}.

The contribution of different semantic relations can be set with the \code{contribution_factor} parameter. The value should be a named numeric
vector where names should cover the relations defined in \code{relations} set in \code{\link[=create_ontology_DAG]{create_ontology_DAG()}}. For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for \code{contribution_factor} can be set as:

\if{html}{\out{<div class="sourceCode">}}\preformatted{term_IC(dag, method = "IC_Wang", 
    control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))
}\if{html}{\out{</div>}}

Note the \strong{IC_Wang_2007} method is normally used within the \strong{Sim_Wang_2007} semantic similarity method.
}
}

\examples{
parents  = c("a", "a", "b", "b", "c", "d")
children = c("b", "c", "c", "d", "e", "f")
annotation = list(
    "a" = c("t1", "t2", "t3"),
    "b" = c("t3", "t4"),
    "c" = "t5",
    "d" = "t7",
    "e" = c("t4", "t5", "t6", "t7"),
    "f" = "t8"
)
dag = create_ontology_DAG(parents, children, annotation = annotation)
term_IC(dag, "IC_annotation")
}
