Literature Correlation

The literature correlations are calculated using the Semantic Gene Organizer (SGO). The SGO software uses a concept-based vector space model called latent semantic indexing (LSI) to automatically extract gene-gene relations from titles and abstracts in MEDLINE citations (Homayouni et al. 2005).

These LSI literature correlations are all positive and range from 0 to 1. They were computed in mid 2005 using the complete PubMed collection.

Vector space modeling is a classical information retrieval technique used to identify conceptually related documents, whereby the semantic structure of a document is represented as a vector in word space and the degree of similarity between documents is calculated by the angle between document vectors. LSI improves retrieval by using a singular value decomposition (or principal component analysis) to create a subspace of concepts in which text documents are represented as vectors.

Each gene is represented as a vector in word or concept space. The cosine of the angle between the query gene vector and all other gene vectors is used to rank related genes. The distribution of cosine values ranges between 1 and -1, where a value of 1 denotes the highest similarity.

An important advantage of LSI over other vector-based retrieval methods is that relations can be derived even if a direct link between genes has not been established in the literature. The fewer factors that are used for query matching, the more conceptual the relations, and vice versa. Therefore, genes may be conceptually related even if they have not been studied together directly. This utility of LSI makes it ideal for investigating the functional significance of gene associations identified in discovery oriented genomic studies.

SGO literature correlation values may be used to rapidly identify known relations between co-regulated genes and the latent relations between co-regulated genes based on current literature.

Methods

Gene abstract documents are first compiled using titles and abstracts in MEDLINE citations cross-referenced for each mouse gene and its human and rat homologs. These gene documents were assembled and parsed into a dictionary of terms (tokens) and weighted frequencies that are required for the term-by-gene document (sparse) matrix. In effect, each gene document is viewed as a bag of words upon which operations can be performed. There are a number of different word weighting schemes that can be used in vector space modeling (Baeza-Yates and Ribeiro-Neto, 1999). The aim of any scheme is to measure similarity within a document while at the same time measuring the dissimilarity of a gene document from the other gene documents. In SGO, we use a log entropy weighting scheme to decrease the weight of high frequency words, while giving distinguishing words higher weights (Berry and Browne, 1999).

Term and document vectors for the LSI model deployed by SGO were generated by truncating the singular value decompisition (SVD) of the term-by-gene document matrix to s factors (i.e., only s columns of the orthogonal matrices U and V are used). LSI therefore produces a rank-reduced space in which to compare two gene documents at different conceptual levels. In practice, the maximum number of factors is limited by the number of documents in the collection. Fewer factors may be used for broad (more conceptual) comparisons, whereas a larger number of factors may be used for specific (more literal) comparisons. Other studies have demonstrated that for large documents collections the optimal number of factors is approximately 300 (Landauer et al., 2004).