WebQTL
 
   |   Home   |   Search   |   Help   |   News   |   References   |   Policies   |   Links   |      

Literature Correlation modify this page

The literature correlations are calculated using the Semantic Gene Organizer (SGO). The SGO software uses a concept-based vector space model called latent semantic indexing (LSI) to automatically extract gene-gene relations from titles and abstracts in MEDLINE citations (Homayouni et al. 2005).

These LSI literature correlations are all positive and range from 0 to 1. They were computed in mid 2005 using the complete PubMed collection.

Vector space modeling is a classical information retrieval technique used to identify conceptually related documents, whereby the semantic structure of a document is represented as a vector in word space and the degree of similarity between documents is calculated by the angle between document vectors. LSI improves retrieval by using a singular value decomposition (or principal component analysis) to create a subspace of concepts in which text documents are represented as vectors.

Each gene is represented as a vector in word or concept space. The cosine of the angle between the query gene vector and all other gene vectors is used to rank related genes. The distribution of cosine values ranges between 1 and -1, where a value of 1 denotes the highest similarity.

An important advantage of LSI over other vector-based retrieval methods is that relations can be derived even if a direct link between genes has not been established in the literature. The fewer factors that are used for query matching, the more conceptual the relations, and vice versa. Therefore, genes may be conceptually related even if they have not been studied together directly. This utility of LSI makes it ideal for investigating the functional significance of gene associations identified in discovery oriented genomic studies.

SGO literature correlation values may be used to rapidly identify known relations between co-regulated genes and the latent relations between co-regulated genes based on current literature.

Methods

Gene abstract documents are first compiled using titles and abstracts in MEDLINE citations cross-referenced for each mouse gene and its human and rat homologs. These gene documents were assembled and parsed into a dictionary of terms (tokens) and weighted frequencies that are required for the term-by-gene document (sparse) matrix. In effect, each gene document is viewed as a bag of words upon which operations can be performed. There are a number of different word weighting schemes that can be used in vector space modeling (Baeza-Yates and Ribeiro-Neto, 1999). The aim of any scheme is to measure similarity within a document while at the same time measuring the dissimilarity of a gene document from the other gene documents. In SGO, we use a log entropy weighting scheme to decrease the weight of high frequency words, while giving distinguishing words higher weights (Berry and Browne, 1999).

Term and document vectors for the LSI model deployed by SGO were generated by truncating the singular value decompisition (SVD) of the term-by-gene document matrix to s factors (i.e., only s columns of the orthogonal matrices U and V are used). LSI therefore produces a rank-reduced space in which to compare two gene documents at different conceptual levels. In practice, the maximum number of factors is limited by the number of documents in the collection. Fewer factors may be used for broad (more conceptual) comparisons, whereas a larger number of factors may be used for specific (more literal) comparisons. Other studies have demonstrated that for large documents collections the optimal number of factors is approximately 300 (Landauer et al., 2004).

For more information on SGO please refer to http://shad.cs.utk.edu/sgo

CITG Web services initiated January, 1994 as Portable Dictionary of the Mouse Genome; June 15, 2001 as WebQTL; and Jan 5, 2005 as GeneNetwork.This site is currently operated by Rob Williams, Pjotr Prins, Zachary Sloan, Arthur Centeno. Design and code by Pjotr Prins, Zach Sloan, Arthur Centeno, Danny Arends, Christian Fischer, Sam Ockman, Lei Yan, Xiaodong Zhou, Christian Fernandez, Ning Liu, Rudi Alberts, Elissa Chesler, Sujoy Roy, Evan G. Williams, Alexander G. Williams, Kenneth Manly, Jintao Wang, and Robert W. Williams, colleagues. Python Powered Registered with Nif
GeneNetwork support from:
  • The UT Center for Integrative and Translational Genomics
  • NIGMS Systems Genetics and Precision Medicine project (R01 GM123489, 2017-2021)
  • NIDA NIDA Core Center of Excellence in Transcriptomics, Systems Genetics,and the Addictome (P30 DA044223, 2017-2022)
  • NIA Translational Systems Genetics of Mitochondria, Metabolism, and Aging (R01AG043930, 2013-2018)
  • NIAAA Integrative Neuroscience Initiative on Alcoholism (U01 AA016662, U01 AA013499, U24 AA013513, U01 AA014425, 2006-2017)
  • NIDA, NIMH, and NIAAA (P20-DA 21131, 2001-2012)
  • NCI MMHCC (U01CA105417), NCRR, BIRN, (U24 RR021760)