Mobydick

Motifs in biological sequence data can be defined as strings whose probability of occurrence greatly exceeds that expected for background. The problem is to decide what constitutes background and the natural limits on a motif since large enough pieces of a motif will themselves show up in a list of improbable strings. An algorithm to resolve both issues has been constructed by analogy with the statistical mechanics of disordered systems and has been usefully applied to decode all the regulatory sequence in yeast.

The algorithm was tested on the eponymous novel by Melville. Random letters were inserted between the words and the result reduced to a string of lower case letters. The code was then asked to recover the english dictionary, (or the subset used by Melville, which was substantial). A sampling of the dictionaries that were created as longer and longer strings were searched is shown as plain text files.

  1. Building a Dictionary for Genomes: Identification of Presumptive Regulatory Sites by Statistical Analysis H. Bussemaker, H. Li, and and E.D. Siggia PNAS 97 10096-10100 (2000).

  2. Regulatory Element detection using a Probabilistic Segmentation Model H.J. Bussemaker, H. Li, E.D. Siggia, Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla Aug 2000 R. Altman et al (eds)AAAI press Menlo Park

Enteric Bacteria (E.coli and relatives)

There are intrinsic limits to what can be inferred from a single genome by probabilistic methods. The cell classifies sequence motifs with proteins whose DNA binding specificity we cannot calculate. Given only sequence, we have to cluster similar patterns together, which for sparse data is much harder. To circumvent this limitation, we do what the cell cannot do, namely compare the regulation of homologous genes from related organisms. Mathematically this provides more samples from the same distribution and thus makes clusters visible. Here is a compilation of Inferred E.coli Regulons.

There are approximately 10 sequenced species of enteric bacteria that are close enough to E.coli to share regulatory motifs. We have designed algorithms to measure how fast minimally constrained regulatory sequence evolves and then with respect to this rate quantified the significance of motifs that evolved less rapidly. The transcription factors themselves evolve at a rate determined by the number of genes they regulate. The results from our Genome Research paper are displayed here: E.coli Regulatory Comparisons

  1. The Evolution of DNA Regulatory Regions for Proteo-gamma Bacteria by Interspecies Comparisons N. Rajewsky, N. Socci, M. Zapotocky and E.D. Siggia, Genome Research 12 298-308 (2002).

  2. Probabalistic Clustering of Sequences: Inferring new bacterial regulons by comparative genomics E. van Nimwegen, M. Zavolan, N. Rajewsky, E.D. Siggia, PNAS 99 7323-8 (2002).

  3. Identification of the binding sites of Regulatory Proteins in Bacterial Genomes H. Li, V. Rhodius, C. Gross, and E.D. Siggia, Proc Natl Acad Sci (US) 99 11772-7 2002. 


Gram Positive Bacteria

B.subtilis is the second most intensively studied bacteria, and it was of interest to apply the algorithms we developed for E.coli to it. Because of its proximity to B.anthracis, there is now a cluster of related genomes on which to explore comparative algorithms. More distant species such as the Streptococcaciae, and Staphylococcus aureus have become antibiotic resistance and are thus a serious medical problem but provide interesting data for evolutionary studies.

Genome wide identification of regulatory motifs in Bacillus subtilis M.M. Mwangi and E.D. Siggia, BMC Bioinformatics 4 18, 2003.


Patterning Fly Embryos

There has been a very productive convergence between evolutionary biology and development around the idea that most evolutionary novelty is due to changes in the regulation of existing genes rather than production of new genes. Our understanding of regulatory evolution will progress in tandem with better algorithms to recognize and parse regulatory sequence. In collaboration with Ulrike Gaul's lab, we are testing algorithms that enable us to identify cis- regulatory modules (~500 bp regions with multiple-factor binding sites) in the fly genome using collections of known binding sites. Alternatively, binding motifs can be found from intervals of sequences that are known to be functional. One key test is for the segmentation gene hierarchy, a prototype of combinatorial control where we have been quite successful in finding new blastoderm patterned genes and new binding motifs: Ahab.

More recent work uses both sequenced Drosophila genomes in the search and as a byproduct can screen for homologous regulatory modules that have changed between the two species. A more challenging task will be to dissect the regulatory cascade that gives rise to glial cells, a case where there is a known master regulator (Gcm) but with very few direct targets.

Computational detection of genomic cis-regulatory modules, applied to body patterning in the early Drosophila embryo N. Rajewsky, M. Vergassola, U. Gaul and E.D. Siggia, BMC Bioinformatics. 3 30, 2002.

Transcriptional Control in the Segmentation Gene Network of Drosophila Mark D. Schroeder, Michael Pearce, John Fak, HongQing Fan , Ulrich Unnerstall, Eldon Emberly , Nikolaus Rajewsky, Eric D. Siggia, and Ulrike Gaul, PLoS 2 E 271, 2004.