CENTIPEDE: Transcription factor footprinting and binding site prediction

CENTIPEDE applies a hierarchical Bayesian mixture model to infer regions of the genome that are bound by particular transcription factors. It starts by identifying a set of candidate binding sites (e.g., sites that match a certain position weight matrix (PWM)), and then aims to classify the sites according to whether each site is bound or not bound by a TF. CENTIPEDE is an unsupervised learning algorithm that discriminates between two different types of motif instances using as much relevant information as possible.

[1] Pique-Regi RP, Degner JF, Pai AA, Gaffney DG, Gilad Y, Pritchard JK. "Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data", Genome Research. 2011 Mar;21(3):447-55. [Full] [Pubmed]

  R Package

The CENTIPEDE R-package can be downloaded from R-forge (binary package only available for the latest R version):

Some very basic instructions to get started are [here]

install.packages("CENTIPEDE", repos="http://R-Forge.R-project.org")

If the automatic installation from R-forge does not work the package can be manually downloaded from here.

  Transcription factor map for lymphoblastoid cell-lines

Here we report the map that we generated for our paper [1]. The raw sequencing reads can be accessed in GEO for two of the LCL lines we generated DNase-I data (GSE25341), and for one additional ENCODE cell-line (GM12878, GSE19622) generated by the Crawford group.

CENTIPEDE map with all TF-bound sites predicted on LCLs (Posterior >0.99):

(Note that it may take up to a minute to load the data to the UCSC browser)

DNase-I footprints and PWMs estimated by CENTIPEDE (grouped by overlap):

  • All TRANSFAC and JASPAR PWMs that passed the conservation filter and all novel PWMs estimated from 10-mer words: Link
  • All TRANSFAC and JASPAR PWMs with all 10-mer words (both novel and repeating known PWMs) that passed the conservation filter: Link

Co-localization of motif binding sites predicted by CENTIPEDE:

  • All TRANSFAC and JASPAR PWMs that passed the conservation filter and all novel PWMs estimated from 10-mer words (ordered by direct overlap so that diaganal boxes represent self co-occurance): Link

  Application to 15 ENCODE cell-lines

We also generated maps for 15 cell-lines from data produced by Greg Crawford lab for the ENCODE project. Please check the ENCODE Consortium Data Release Policy if you plan to use any of the trancription factor binding maps derived by CENTIPEDE:

Browse results here

  • This work has been supported by grants from the National Institutes of Health, by the Howard Hughes Medical Institute, by the Chicago Fellows Program, by the American Heart Association, and by the NIH Genetics and Regulation Training grant.
  • We also thank the ENCODE Project, supported by NHGRI, for making data available pre-publication (in particular the Bernstein, Crawford, Myers and Snyder groups and the UCSC Genome Browser)
  • Greg Crawford for assistance in constructing our DNaseI libraries.
  • Other members of the Pritchard, Przeworski and the Stephens labs for helpful comments or discussions.

  Links to related resources

  Contact information
  For questions related to the CENTIPEDE algorithm, the R-package, or the generated maps, please contact the maintainers: