Date of Original Version



Conference Proceeding

Abstract or Description

The precise relationship between a primary protein sequence, its three-dimensional structure and its function in a complex cellular environment is one of the most fundamental unanswered questions in biology. Unprecedented amounts of genomic and proteomic data create an opportunity for attacking the sequence-structure-function mapping problem with data-driven methods. The mapping of biological sequences to form and function of proteins is conceptually similar to the mapping of words to meaning. This analogy is being studied by a growing body of research ([1] and pointers thereof). Thus, n-gram analysis (statistical analysis of co-occurrence of words in a text) has found applications to biological sequences, using various types of “vocabulary”, for example nucleotides and amino acids. Here, we investigate n-gram statistics in whole-genome sequences to address the following questions: How characteristic is the amino acid n-gram distribution for specific organisms? Do different organisms tend to use different “phrases”? What is the “meaning” of a rare sequence in a protein? The long-term goal is to provide a useful starting point to derive language models with defined vocabulary and phrase preferences and grammatical rules for protein sequences of different organisms.