This web site is organized into two levels (steps):
  1. Sequence data input/creation. In this step, the user creates the sequence data file in the appropriate format.
  2. Programs to analyse input sequences and display results. In this step, the sequences are input in the correct format (either from Step (1) or directly), and various analysis programs are run on the sequences.

Project Name This is an arbitrary name of the user's choosing (no spaces or non alphanumeric characters allowed). It serves to identify the folder where the sequence data and its associated results are stored.

Sequence file Name This is an arbitrary name of the users choosing (Proj. name restrictions apply) that serves to identify the current set of sequences being input, eg the regulatory sequence for all the genes controlled by a common set of factors should reside in one sequence file. A sequence file (and its associated results) reside in a project.

Input protein sequence(s) in FASTA format. Multiple sequences are allowed but uploading >5 sequences may lead to a server timeout.

File format: Please provide protein sequences in FASTA format.

Pre-computed sequence ids: We have pre-computed domain matches for all DNA-binding proteins from several organisms. Using protein sequence id lookups is more computationally efficient than the ab initio calculations with user-uploaded protein sequences.
Organism NameWebsiteLocal FASTA file
A.gossypiihttp://agd.unibas.chFASTA
N.crassahttp://www.broad.mit.edu/annotation/fungi/fgi/FASTA
U.maydis http://www.broad.mit.edu/annotation/fungi/fgi/FASTA
S.pombehttp://www.sanger.ac.uk/Projects/S_pombe/FASTA
Y.lipolyticahttp://cbi.labri.fr/Genolevures/index.phpFASTA
C.glabratahttp://cbi.labri.fr/Genolevures/index.phpFASTA
C.albicanshttp://genolist.pasteur.fr/CandidaDB/FASTA
D.hanseniihttp://cbi.labri.fr/Genolevures/index.phpFASTA
K.lactishttp://cbi.labri.fr/Genolevures/index.phpFASTA
A.nidulanshttp://www.broad.mit.edu/annotation/fungi/fgi/FASTA
K.waltiihttp://www.broad.mit.edu/seq/YeastDuplication/FASTA
S.bayanusftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomesFASTA
S.castelliiftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomesFASTA
S.cerevisiaeftp://genome-ftp.stanford.edu/pub/yeast/sequence/genomic_sequenceFASTA
S.kluyveriftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomesFASTA
S.kudriavzeviiftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomesFASTA
S.mikataeftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomesFASTA
S.paradoxusftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomesFASTA
D.discoideumhttp://dictybase.org/FASTA
A.gambiaehttp://www.ensembl.orgFASTA
D.melanogasterhttp://flybase.bio.indiana.eduFASTA
D.pseudoobscura http://flybase.bio.indiana.eduFASTA
H.sapienshttp://www.ensembl.orgFASTA
M.musculushttp://www.ensembl.orgFASTA

Max number of structural homologs to be displayed for each domain match.

E value for HMMER: use a lower cutoff to discard less significant matches.

Detailed score report:

  1. Sequence-structure: displays query sequence - template structure alignment scores sorted by the total interface score (sc.tot). N refers to internal numbering (please disregard); 1abc_D is the Protein Data Bank id followed by one-letter protein chain id, respectively; [r1,r2] are the residue (amino acid) numbers of the HMMER match to the PDB protein chain (all residue numbers are one-based); ID.m.all is the number of mutated residues in the domain alignment; ID.all is the length of the domain alignment (= r2-r1+1); ID.m.bb is the number of mutated residues which contact DNA phosphate backbone/sugar ring; ID.bb is the total number of phosphate backbone/sugar ring contacts; ID.m.sc is the number of mutated residues which contact DNA base(s); ID.sc is the total number of DNA base contacts; sc.all, sc.bb and sc.sdch are the PET91 amino acid substitution scores for the whole domain, phosphate backbone and DNA base contacts, respectively; sc.tot = 0.5*sc.bb + 0.5*sc.sdch is the total template score.
  2. Orthologs: displays query sequence - ortholog sequence alignment scores sorted by the total interface score (sc.tot). ortID is the sequence id of the ortholog, [r1,r2] are the residue (amino acid) numbers of the HMMER match to the orthologous protein chain (all residue numbers are one-based). All other entries are explained in section 1, but here refer to the alignment between the query protein and the orthologous protein at the interface defined by the structural template.

PDB information: Information about the structure from the Protein Data Bank.

PDB file: biological unit coordinates available for download in the PDB format.

Weight matrix: Structure-based position specific weight matrix (PWM) prediction. We use the following format for these (e.g.):

>1dgc  10
   0.400	   0.200	   0.200	   0.200	
   0.062	   0.062	   0.062	   0.812	
   0.188	   0.188	   0.438	   0.188	
   0.588	   0.138	   0.138	   0.138	
   0.087	   0.738	   0.087	   0.087	
   0.087	   0.087	   0.738	   0.087	
   0.138	   0.138	   0.138	   0.588	
   0.188	   0.438	   0.188	   0.188	
   0.812	   0.062	   0.062	   0.062	
   0.200	   0.200	   0.200	   0.400	
<
The columns correspond to ACGT. All rows sum up to one. The prediction is made using the number of amino acid - DNA base atomic contacts as a measure of binding specificity of the consensus base (Morozov, A.V., Havranek, J.J., Baker, D. and Siggia, E.D. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005 Oct 24; 33(18): 5781-5798).

WebLogo image of the structure-based PWM prediction.

Protein viewer: View biological unit PDBs using a Jmol molecular viewer. WARNING: your browser cache should be set to a large value (e.g. 100 MB) for the viewer to work efficiently.

Start: first residue in the structure to be aligned to the query protein sequence (residues are numbered starting from one).

End: last residue in thestructure to be aligned to the query protein sequence (residues are numbered starting from one).

Score: Template score (same as sc.tot in the full score report). The score is in the [0,100] range, with higher value indicating better match.

Method: Experimental method by which the structure was solved: X-ray diffraction (X-RAY) or nuclear magnetic resonance (NMR).

Resolution: Resolution of X-ray diffraction (if applicable).

  1. Alignment of the query protein sequence to the protein sequence from the structure. First line: HMMER consensus sequence for the protein domain, second line: query sequence, third line: sequence from the structure.
  2. Alignment of the query protein sequence to the orthologous protein sequence and the protein sequence from the structure. First line: HMMER consensus sequence for the protein domain, second line: query sequence, third line: ortholog protein protein sequence, fourth line: sequence from the structure.
'b' symbols mark phosphate backbone/sugar ring contacts, 's' symbols mark DNA base contacts. All contacts are computed using a 4.5 A distance cutoff between amino acid and DNA atoms.

Max number of orthologous proteins to be displayed for each domain match in the query sequence and each structural template.

Ortholog id: sequence id of the orthologous protein.

Motif Name: this is an arbitrary name of the user's choosing (no spaces or non alphanumeric characters allowed) that serves to identify the current motif information being input.

Motif: a probabilistic binding profile defined either by the alignment of binding sites or through a position-specific weight matrix (PWM).

Weight matrix should be in the following format:

>pwm1
   0.400	   0.200	   0.200	   0.200	
   0.062	   0.062	   0.062	   0.812	
   0.188	   0.188	   0.438	   0.188	
   0.588	   0.138	   0.138	   0.138	
   0.087	   0.738	   0.087	   0.087	
   0.087	   0.087	   0.738	   0.087	
   0.138	   0.138	   0.138	   0.588	
   0.188	   0.438	   0.188	   0.188	
   0.812	   0.062	   0.062	   0.062	
   0.200	   0.200	   0.200	   0.400	
<
The columns correspond to ACGT. The rows do not have to sum up to one - if the absolute counts are provided they will be normalized.

Pseudo Count: The value will be added to all the counts from the sequence alignment. The default value is zero. Pseudocounts are often used to allow for a small probability of any DNA base at every position in the alignment.

Weight matrix should be in the following format:

>pwm1
   0.400	   0.200	   0.200	   0.200	
   0.062	   0.062	   0.062	   0.812	
   0.188	   0.188	   0.438	   0.188	
   0.588	   0.138	   0.138	   0.138	
   0.087	   0.738	   0.087	   0.087	
   0.087	   0.087	   0.738	   0.087	
   0.138	   0.138	   0.138	   0.588	
   0.188	   0.438	   0.188	   0.188	
   0.812	   0.062	   0.062	   0.062	
   0.200	   0.200	   0.200	   0.400	
<
The columns correspond to ACGT. The rows do not have to sum up to one - if the absolute counts are provided they will be normalized.

Upload aligned binding site sequences: both uppercase [ACGT-] and lowercase [acgt-] symbols are allowed, including gaps. Two formats are allowed:

>aln1
AAA--AA
CCCCCCC
GGGGGGG
t-ttttt
OR
AAA--AA
CCCCCCC
GGGGGGG
t-ttttt

Multiple binding site alignment should be in the following format: one site per line, both uppercase [ACGT-] and lowercase [acgt-] symbols are allowed, including gaps. All sequences should be of the same length (counting gaps):

AAA--AA
CCCCCCC
GGGGGGG
t-ttttt

Parameters for finding the best matched motifs:

Offset: a positive integer to let the user compute p-values for partial PWM matches. Minimum allowed length of the PWM to PWM match is computed as: min(L1,L2) - offset, where L1 and L2 are the lengths of the two PWMs being compared. For example, if we compare a PWM of length 6 to the structure-based PWM of length 8, 3 alignments will be tried with offset=0:

111111
22222222

 111111
22222222

  111111
22222222
When offset is >0, we test the partial alignments as well. For example, with offset=1
111111
 22222222

   111111
22222222
also become allowed. For all valid alignments the lowest p-value (Bonferroni corrected for the number of tested alignments) will be displayed along with the PWM alignment that yields it (in actual calculations we test both PWM and its reverse-complement). Here, p-value is defined as the probability that 2 PWMs are not correlated.

WebLogo image of the PWM uploaded by the user.

Protein sequences in FASTA format The user may upload multiple sequences in FASTA format. WARNING: Uploading more than 5 sequences at a time may result in a server timeout!

>seqID1
sequence1
>seqID2
sequence2
>seqID3
sequence3
...

user id A user's machine is assigned a unique identifier when the user first accesses the web site from that machine. This identifier is displayed at the top of the web page, and is also stored on the user's machine as a "cookie". If the user plans to access their data from other machines, or plans to delete the machine's cookies in the future, this identifier should be written down. To present their user id to the site, and thus gain access to all the data and results associated with that user id, the user must choose the "authenticate" link on the welcome page, and enter this user id. The user id is the form '0.dddddddddddddd' ie 14 digits with preceding '0.' and only id's of this form are recognized by the system.

Quick Start.

FAQ:

How do I make Jmol work with Firefox on Linux?
You need to install Java, and to make sure that there is a softlink to the Java plugin in the firefox/plugins directory. For example, on my machine I used the command:

ln -s /usr/java/j2re1.5.0/plugin/i386/ns7/libjavaplugin_oji.so
to create the link. Please restart Firefox after the plugin is installed.

How do I download the data from my runs?
All data from the runs is stored locally in a directory whose name is created using the concatenation of the project name (defaults to "demo") and the userid (the random number assigned by the computer to your project). Some of the files in this directory will be made using the motif name or the sequence file name you provided (defaults to "test"), while the others will have fixed names. When you click on the "Download" button the files created in your directory will be made into a TAR archive, gzipped, and made available for download.

Is my data private?
A random number is created the first time you access the Protein-DNA Explorer from your computer, appended to your project name, and saved in a cookie by your browser. As long as you are using the same computer you will get the same random number.If you are sharing the computer with other users,you should delete the cookie named 'protein_dna' under your browser.