% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/GSSTDA.R
\name{gsstda}
\alias{gsstda}
\title{Gene Structure Survival using Topological Data Analysis (GSSTDA).}
\usage{
gsstda(
  full_data,
  survival_time,
  survival_event,
  case_tag,
  control_tag = NA,
  gamma = NA,
  gen_select_type = "Top_Bot",
  percent_gen_select = 10,
  num_intervals = 5,
  percent_overlap = 40,
  distance_type = "correlation",
  clustering_type = "hierarchical",
  num_bins_when_clustering = 10,
  linkage_type = "single",
  optimal_clustering_mode = NA,
  silhouette_threshold = 0.25,
  na.rm = TRUE
)
}
\arguments{
\item{full_data}{Input matrix whose columns correspond to the patients and
rows to the genes.}

\item{survival_time}{Numerical vector of the same length as the number of
columns of \code{full_data}. In addition, the patients must be in the same
order as in \code{full_data}. For the patients whose sample is pathological
should be indicated the time between the disease diagnosis and event
(death, relapse or other). If the event has not occurred, it should be
indicated the time until the end of follow-up. Patients whose sample is
from healthy tissue must have an NA value}

\item{survival_event}{Numerical vector of the same length as the number of
columns of \code{full_data}. Patients must be in the same order as in
\code{full_data}. For the the patients with pathological sample should
be indicated whether the event has occurred (1) or not (0). Only these
values are valid and healthy patients must have an NA value.}

\item{case_tag}{Character vector of the same length as the number of
columns of \code{full_data}. Patients must be in the same order as in
\code{full_data}. It must be indicated for each patient whether its
sample is from pathological or healthy tissue. One value should be used to
indicate whether the patient's sample is healthy and another value should
be used to indicate whether the patient's sample is pathological.
The user will then be asked which one indicates whether the patient is
healthy. Only two values are valid in the vector in total.}

\item{control_tag}{Tag of the healthy sample.E.g. "T"}

\item{gamma}{A parameter that indicates the magnitude of the noise assumed in
the flat data matrix for the generation of the Healthy State Model. If it
takes the value \code{NA} the magnitude of the noise is assumed to be unknown.
By default gamma is unknown.}

\item{gen_select_type}{Option. Options on how to select the genes to be
used in the mapper. Select the "Abs" option, which means that the
genes with the highest absolute value are chosen, or the
"Top_Bot" option, which means that half of the selected
genes are those with the highest value (positive value, i.e.
worst survival prognosis) and the other half are those with the
lowest value (negative value, i.e. best prognosis). "Top_Bot" default option.}

\item{percent_gen_select}{Percentage (from zero to one hundred) of genes
to be selected to be used in mapper. 10 default option.}

\item{num_intervals}{Parameter for the mapper algorithm. Number of
intervals used to create the first sample partition based on
filtering values. 5 default option.}

\item{percent_overlap}{Parameter for the mapper algorithm. Percentage
of overlap between intervals. Expressed as a percentage. 40 default option.}

\item{distance_type}{Parameter for the mapper algorithm.
Type of distance to be used for clustering. Choose between correlation
("correlation") and euclidean ("euclidean"). "correlation" default option.}

\item{clustering_type}{Parameter for the mapper algorithm. Type of
clustering method. Choose between "hierarchical" and "PAM"
(“partition around medoids”) options. "hierarchical" default option.}

\item{num_bins_when_clustering}{Parameter for the mapper algorithm.
Number of bins to generate the histogram employed by the standard
optimal number of cluster finder method. Parameter not necessary if the
"optimal_clust_mode" option is "silhouette" or the "clust_type" is "PAM".
10 default option.}

\item{linkage_type}{Parameter for the mapper algorithm. Linkage criteria
used in hierarchical clustering. Choose between "single" for single-linkage
clustering, "complete" for complete-linkage clustering or "average" for
average linkage clustering (or UPGMA). Only necessary for hierarchical
clustering. "single" default option.}

\item{optimal_clustering_mode}{Method for selection optimal number of
clusters. It is only necessary if the chosen type of algorithm is
hierarchical. In this case, choose between "standard" (the method used
in the original mapper article) or "silhouette". In the case of the PAM
algorithm, the method will always be "silhouette".}

\item{silhouette_threshold}{Minimum value of \eqn{\overline{s}}{s-bar} that a set of
clusters must have to be chosen as optimal. Within each interval of the
filter function, the average silhouette values \eqn{\overline{s}}{s-bar} are computed
for all possible partitions from $2$ to $n-1$, where $n$ is the number of
samples within a specific interval. The $n$ that produces the highest value
of \eqn{\overline{s}}{s-bar} and that exceeds a specific threshold is selected as the
optimum number of clusters. If no partition produces an \eqn{\overline{s}}{s-bar}
exceeding the chosen threshold, all samples are then assigned to a unique
cluster. The default value is $0.25$. The threshold of $0.25$ for
\eqn{\overline{s}}{s-bar} has been chosen based on standard practice, recognizing it
as a moderate value that reflects adequate separation and cohesion within
clusters.}

\item{na.rm}{\code{logical}. If \code{TRUE}, \code{NA} rows are omitted.
If \code{FALSE}, an error occurs in case of \code{NA} rows. TRUE default
option.}
}
\value{
A \code{gsstda} object. It contains:
\itemize{
\item the matrix with the normal space \code{normal_space},
\item the matrix of the disease components normal_space \code{matrix_disease_component},
\item a matrix with the results of the application of proportional hazard models
for each gene (\code{cox_all_matrix)},
\item the genes selected for mapper \code{genes_disease_componen},
\item the matrix of the disease components with information from these genes only
\code{genes_disease_component}
\item and a \code{mapper_obj} object. This \code{mapper_obj} object contains the
values of the intervals (interval_data), the samples included in each
interval (sample_in_level), information about the cluster to which the
individuals in each interval belong (clustering_all_levels), a list including
the individuals contained in each detected node (node_samples), their size
(node_sizes), the average of the filter function values of the individuals
of each node (node_average_filt) and the adjacency matrix linking the nodes
(adj_matrix). Moreover, information is provided on the number of nodes,
the average node size, the standard deviation of the node size, the number
of connections between nodes, the proportion of connections to all possible
connections and the number of ramifications.
}
}
\description{
Gene Structure Survival using Topological Data Analysis.
This function implements an analysis for expression array data
based on the \emph{Progression Analysis of Disease} developed by Nicolau
\emph{et al.} (doi: 10.1073/pnas.1102826108) that allows the information
contained in an expression matrix to be condensed into a combinatory graph.
The novelty is that information on survival is integrated into the analysis.

The analysis consists of 3 parts: a preprocessing of the data, the gene
selection and the filter function, and the mapper algorithm. The
preprocessing is specifically the Disease Specific Genomic Analysis (proposed
by Nicolau \emph{et al.}) that consists of, through linear models, eliminating the
part of the data that is considered "healthy" and keeping only the component
that is due to the disease. The genes are then selected according to their
variability and whether they are related to survival and the values of the
filtering function for each patient are calculated taking into account the
survival associated with each gene. Finally, the mapper algorithm is applied
from the disease component matrix and the values of the filter function
obtaining a combinatory graph.
}
\examples{
\donttest{
gsstda_object <- gsstda(full_data,  survival_time, survival_event, case_tag, gamma=NA,
                 gen_select_type="Top_Bot", percent_gen_select=10,
                 num_intervals = 4, percent_overlap = 50,
                 distance_type = "euclidean", num_bins_when_clustering = 8,
                 clustering_type = "hierarchical", linkage_type = "single")}
}
