CENTIPEDE: Transcription factor footprinting and binding site prediction
|
|
|
|
CENTIPEDE applies a hierarchical Bayesian mixture model to infer regions of the genome that are bound by particular transcription factors. It starts by identifying a set of candidate binding sites (e.g., sites that match a certain position weight matrix (PWM)), and then aims to classify the sites according to whether each site is bound or not bound by a TF. CENTIPEDE is an unsupervised learning algorithm that discriminates between two different types of motif instances using as much relevant information as possible.
[1] Pique-Regi RP, Degner JF, Pai AA, Gaffney DG, Gilad Y, Pritchard JK. "Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data", Genome Research. 2011 Mar;21(3):447-55. [Full] [Pubmed]
|
|
|
|
|
|
|
The CENTIPEDE R-package can be downloaded from R-forge (binary package only available for the latest R version):
Some very basic instructions to get started are [here]
install.packages("CENTIPEDE", repos="http://R-Forge.R-project.org")
library(CENTIPEDE)
example(fitCentipede)
If the automatic installation from R-forge does not work the package can be manually downloaded from here.
|
|
|
|
|
Transcription factor map for lymphoblastoid cell-lines
|
|
|
Here we report the map that we generated for our paper [1]. The raw sequencing reads can be accessed in GEO for two of the LCL lines we generated DNase-I data (GSE25341), and for one additional ENCODE cell-line (GM12878, GSE19622) generated by the Crawford group.
CENTIPEDE map with all TF-bound sites predicted on LCLs (Posterior >0.99):
(Note that it may take up to a minute to load the data to the UCSC browser)
DNase-I footprints and PWMs estimated by CENTIPEDE (grouped by overlap):
- All TRANSFAC and JASPAR PWMs that passed the conservation filter and all novel PWMs estimated from 10-mer words:
Link
- All TRANSFAC and JASPAR PWMs with all 10-mer words (both novel and repeating known PWMs) that passed the conservation filter:
Link
Co-localization of motif binding sites predicted by CENTIPEDE:
- All TRANSFAC and JASPAR PWMs that passed the conservation filter and all novel PWMs estimated from 10-mer words (ordered by direct overlap so that diaganal boxes represent self co-occurance):
Link
|
|
|
|
|
Application to 15 ENCODE cell-lines
|
|
|
|
|
|
- This work has been supported by grants from the National Institutes
of Health, by the Howard Hughes Medical Institute,
by the Chicago Fellows Program, by the American Heart
Association, and by the NIH Genetics and Regulation
Training grant.
- We also thank the
ENCODE Project, supported by NHGRI,
for making data available pre-publication (in particular the
Bernstein, Crawford, Myers and Snyder groups and
the UCSC Genome Browser)
- Greg Crawford
for assistance in constructing our DNaseI libraries.
- Other members of the
Pritchard,
Przeworski and
the Stephens labs for helpful comments or discussions.
|
|
|
|
|
Links to related resources
|
|
|
|
|
|
For questions related to the CENTIPEDE algorithm, the R-package, or the generated maps, please contact the maintainers:
|
|
|
|
|