Correlation Page Annotation

Explanations of Different Types of Correlations

Literature correlations are calculated using the Semantic Gene Organizer (SGO). The SGO software uses a concept-based vector space model called latent semantic indexing (LSI) to automatically extract gene-gene relations from titles and abstracts in MEDLINE citations (Homayouni et al. 2005).

These LSI literature correlations are all positive and range from 0 to 1. They were initially computed in 2005 and 2007 by Dr. Ramin Homayouni and colleagues and entered into GeneNetwork by Nick Furlotte. All values were updated late 2016 by Ramin Homayouni and Sujoy Roy (University of Memphis) using the complete 2016 PubMed collection for mouse genes and abstracts. Although computed using mouse data, the literature correlation feature also works well for human and rat expression data sets. The 2016 matrix of gene-to-gene literature correlations is approximately 20k x 20k/2. This matrix was converted by Lei Yan into a MySQL table called LCorrRamin3 that is 204,980,628 rows deep and with the simple tuple: GeneId1, GeneId2, value.

Vector space modeling is a classical information retrieval technique used to identify conceptually related documents, whereby the semantic structure of a document is represented as a vector in word space and the degree of similarity between documents is calculated by the angle between document vectors. LSI improves retrieval by using a singular value decomposition (or principal component analysis) to create a subspace of concepts in which text documents are represented as vectors.

Each gene is represented as a vector in word or concept space. The cosine of the angle between the query gene vector and all other gene vectors is used to rank related genes. The distribution of cosine values ranges between 1 and -1, where a value of 1 denotes the highest similarity.

An important advantage of LSI over other vector-based retrieval methods is that relations can be derived even if a direct link between genes has not been established in the literature. The fewer factors that are used for query matching, the more conceptual the relations, and vice versa. Therefore, genes may be conceptually related even if they have not been studied together directly. This utility of LSI makes it ideal for investigating the functional significance of gene associations identified in discovery oriented genomic studies.

SGO literature correlation values may be used to rapidly identify known relations between co-regulated genes and the latent relations between co-regulated genes based on current literature.

Methods

Gene abstract documents are first compiled using titles and abstracts in MEDLINE citations cross-referenced for each mouse gene and its human and rat homologs. These gene documents were assembled and parsed into a dictionary of terms (tokens) and weighted frequencies that are required for the term-by-gene document (sparse) matrix. In effect, each gene document is viewed as a bag of words upon which operations can be performed. There are a number of different word weighting schemes that can be used in vector space modeling (Baeza-Yates and Ribeiro-Neto, 1999). The aim of any scheme is to measure similarity within a document while at the same time measuring the dissimilarity of a gene document from the other gene documents. In SGO, we use a log entropy weighting scheme to decrease the weight of high frequency words, while giving distinguishing words higher weights (Berry and Browne, 1999).

Term and document vectors for the LSI model deployed by SGO were generated by truncating the singular value decompisition (SVD) of the term-by-gene document matrix to s factors (i.e., only s columns of the orthogonal matrices U and V are used). LSI therefore produces a rank-reduced space in which to compare two gene documents at different conceptual levels. In practice, the maximum number of factors is limited by the number of documents in the collection. Fewer factors may be used for broad (more conceptual) comparisons, whereas a larger number of factors may be used for specific (more literal) comparisons. Other studies have demonstrated that for large documents collections the optimal number of factors is approximately 300 (Landauer et al., 2004).

For more information on SGO please refer to https://grits.eecs.utk.edu/sgo/sgo.html

Tissue Correlation

The tissue correlation is an estimate of the similarity of expression of two genes or transcripts across different cells, tissues, or organs. Tissue correlations were generated by analyzing gene expression in multiple tissues taken from single animals (C57BL/6J, DBA/2J mice, and BN rats). Both Pearson product-moment correlations and Spearman rank order correlations have been computed for all pair of genes using data from a set of tissue samples. Both correlation types -- r and rho -- as well as their associated p value are displayed in Trait Correlation pages to the far right. While we used mouse tissues to compute the tissue correlations, we display these values even in tables generated for rat and human transcripts and gene.

This tissue correlation analysis was carried out by Drs. Xusheng Wang, Lu Lu, and Robert W. Williams at the University of Tennessee Health Science Center in collaboration with Illumina Inc. (Jan and Feb 2008) using the MouseWG-6 v2.0 array. The GN interface was created by Xiaodong Zhou. We generated data from approximately 60 samples. The correlations in GeneNetwork were computed for a subset of 25 tissues or tissue pools that have moderately independent expression patterns. We merging many CNS samples into a single pooled value. We also merged data for ileum, jejunum, and duodenum.

In many cases, the expression of a single gene is estimated by multiple probes or probe sets, multiple exons, or alternative transcripts. In the case of the Illumina array that we used, there are typically two to three probes per gene and all may be equally valid estimates of different aspects of the expression of a gene. To provide an approximate first-order summary of joint expression of genes across tissues we simply selected that probe associated with the single highest estimate of expression averaged across multiple tissues. [Dec 2008, RWW].

Tissue Correlations: Pearson's r and Spearman's rho

Conventional Pearson product-moment correlations (r) or Spearman rank order correlations (rho) were computed across approximately 25 different organs and tissue types. The rank order correlations will be less dependent on the distribution of expression estimates or the particular set of 25 tissue types.

The Tissue P (r) is the probability associated with the Pearson product-moment correlation. The Tissue P (rho) is the corresponding probability associated with the Spearman rank order statistic. Both P values are currently computed for an n of 25 organs and tissue types. The rank order correlation will be more conservative. This p value may be appropriate if the bivariate distribution of points across the plots is normally distributed in both x and y axes.

Sample Correlation: Pearson's r

Pearson' s Sample Correlation, r, is computed using trait values measured across a population of genetically diverse cases (individuals or strains). This is the Pearson's r value computed across cases or samples. The correlation is generated by a combination of shared genetic, environmental, and experimental factors. In other words, this is a correlation of phenotypes across a population. It is only a good estimate of a genetic correlation when developmental, environmental, technical, and error variance in the sample is low. In the case of sets of recombinant inbred strains it is possible to reduce non-genetic sources of variance by pooling samples and by resampling genetically identical individuals.

p Value of Sample Correlation (Pearson's r): The p value associated with the Pearson product-moment correlation type described above. The p value takes into account differences in the sample size. Correlations and traits are usually ranked with the smallest p value (most significant) on the top.

Sample Correlation: Spearman's Rank Order, rho

Spearman's Sample Correlation, rho, is computed using trait values measured across a population of genetically diverse cases (individuals or strains). This is the Spearman rank order correlation (called rho rather than r) that has been computed across the samples. This correlation is not unduly affected by outliers, and should also generally be used when sample size is small (less than 20). Correlation is generated by a combination of genetic, environmental, and experimental factors. It is only a good estimate of a genetic correlation when developmental, environmental, technical, and error variance in the sample is low. In the case of sets of recombinant inbred strains it is possible to reduce non-genetic sources of variance by pooling samples and by resampling genetically identical individuals.

p Value of Sample Correlation (Spearman's rho): The p value associated with the Spearman rand order correlation type described above. The p value takes into account differences in the sample size. Correlations and traits are usually ranked with the smallest p value (most significant) on the top.

	Service initiated June 15, 2001. Page maintained by Hongqiang Li, Fan Zhang, and Robert W. Williams. Site built by Jintao Wang, Kenneth Manly, RWW, and many colleagues.
NIAAA Integrative Neuroscience Initiative on Alcoholism (U01AA13499, U24AA13513) A Human Brain Project funded jointly by the NIDA , NIMH, and NIAAA (P20-DA 21131) NCI MMHCC (U01CA105417) Biomedical Informatics Research Network (BIRN), NCRR (U24 RR021760)