This web site is organized into two levels (steps):

Sequence data input/creation. In this step, the user creates the sequence data file in the appropriate format.
Programs to analyse input sequences and display results. In this step, the sequences are input in the correct format (either from Step (1) or directly), and various analysis programs are run on the sequences.

Project Name This is an arbitrary name of the user's choosing (no spaces or non alphanumeric characters allowed). It serves to identify the folder where the sequence data and its associated results are stored.

Sequence file Name This is an arbitrary name of the users choosing (Proj. name restrictions apply) that serves to identify the current set of sequences being input, eg the regulatory sequence for all the genes controlled by a common set of factors should reside in one sequence file. A sequence file (and its associated results) reside in a project.

Input protein sequence(s) in FASTA format. Multiple sequences are allowed but uploading >5 sequences may lead to a server timeout.

File format: Please provide protein sequences in FASTA format.

Pre-computed sequence ids: We have pre-computed domain matches for all DNA-binding proteins from several organisms. Using protein sequence id lookups is more computationally efficient than the ab initio calculations with user-uploaded protein sequences.

Organism Name Website Local FASTA file

A.gossypii http://agd.unibas.ch FASTA

N.crassa http://www.broad.mit.edu/annotation/fungi/fgi/ FASTA

U.maydis http://www.broad.mit.edu/annotation/fungi/fgi/ FASTA

S.pombe http://www.sanger.ac.uk/Projects/S_pombe/ FASTA

Y.lipolytica http://cbi.labri.fr/Genolevures/index.php FASTA

C.glabrata http://cbi.labri.fr/Genolevures/index.php FASTA

C.albicans http://genolist.pasteur.fr/CandidaDB/ FASTA

D.hansenii http://cbi.labri.fr/Genolevures/index.php FASTA

K.lactis http://cbi.labri.fr/Genolevures/index.php FASTA

A.nidulans http://www.broad.mit.edu/annotation/fungi/fgi/ FASTA

K.waltii http://www.broad.mit.edu/seq/YeastDuplication/ FASTA

S.bayanus ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes FASTA

S.castellii ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes FASTA

S.cerevisiae ftp://genome-ftp.stanford.edu/pub/yeast/sequence/genomic_sequence FASTA

S.kluyveri ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes FASTA

S.kudriavzevii ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes FASTA

S.mikatae ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes FASTA

S.paradoxus ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes FASTA

D.discoideum http://dictybase.org/ FASTA

A.gambiae http://www.ensembl.org FASTA

D.melanogaster http://flybase.bio.indiana.edu FASTA

D.pseudoobscura http://flybase.bio.indiana.edu FASTA

H.sapiens http://www.ensembl.org FASTA

M.musculus http://www.ensembl.org FASTA

Max number of structural homologs to be displayed for each domain match.

E value for HMMER: use a lower cutoff to discard less significant matches.

Detailed score report:

Sequence-structure: displays query sequence - template structure alignment scores sorted by the total interface score (sc.tot). N refers to internal numbering (please disregard); 1abc_D is the Protein Data Bank id followed by one-letter protein chain id, respectively; [r1,r2] are the residue (amino acid) numbers of the HMMER match to the PDB protein chain (all residue numbers are one-based); ID.m.all is the number of mutated residues in the domain alignment; ID.all is the length of the domain alignment (= r2-r1+1); ID.m.bb is the number of mutated residues which contact DNA phosphate backbone/sugar ring; ID.bb is the total number of phosphate backbone/sugar ring contacts; ID.m.sc is the number of mutated residues which contact DNA base(s); ID.sc is the total number of DNA base contacts; sc.all, sc.bb and sc.sdch are the PET91 amino acid substitution scores for the whole domain, phosphate backbone and DNA base contacts, respectively; sc.tot = 0.5*sc.bb + 0.5*sc.sdch is the total template score.
Orthologs: displays query sequence - ortholog sequence alignment scores sorted by the total interface score (sc.tot). ortID is the sequence id of the ortholog, [r1,r2] are the residue (amino acid) numbers of the HMMER match to the orthologous protein chain (all residue numbers are one-based). All other entries are explained in section 1, but here refer to the alignment between the query protein and the orthologous protein at the interface defined by the structural template.

PDB information: Information about the structure from the Protein Data Bank.

PDB file: biological unit coordinates available for download in the PDB format.

Weight matrix: Structure-based position specific weight matrix (PWM) prediction. We use the following format for these (e.g.):

>1dgc  10
   0.400	   0.200	   0.200	   0.200	
   0.062	   0.062	   0.062	   0.812	
   0.188	   0.188	   0.438	   0.188	
   0.588	   0.138	   0.138	   0.138	
   0.087	   0.738	   0.087	   0.087	
   0.087	   0.087	   0.738	   0.087	
   0.138	   0.138	   0.138	   0.588	
   0.188	   0.438	   0.188	   0.188	
   0.812	   0.062	   0.062	   0.062	
   0.200	   0.200	   0.200	   0.400	
<

The columns correspond to ACGT. All rows sum up to one. The prediction is made using the number of amino acid - DNA base atomic contacts as a measure of binding specificity of the consensus base (Morozov, A.V., Havranek, J.J., Baker, D. and Siggia, E.D. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005 Oct 24; 33(18): 5781-5798).

WebLogo image of the structure-based PWM prediction.

Protein viewer: View biological unit PDBs using a Jmol molecular viewer. WARNING: your browser cache should be set to a large value (e.g. 100 MB) for the viewer to work efficiently.

Start: first residue in the structure to be aligned to the query protein sequence (residues are numbered starting from one).

End: last residue in thestructure to be aligned to the query protein sequence (residues are numbered starting from one).

Score: Template score (same as sc.tot in the full score report). The score is in the [0,100] range, with higher value indicating better match.

Method: Experimental method by which the structure was solved: X-ray diffraction (X-RAY) or nuclear magnetic resonance (NMR).

Resolution: Resolution of X-ray diffraction (if applicable).

Alignment of the query protein sequence to the protein sequence from the structure. First line: HMMER consensus sequence for the protein domain, second line: query sequence, third line: sequence from the structure.
Alignment of the query protein sequence to the orthologous protein sequence and the protein sequence from the structure. First line: HMMER consensus sequence for the protein domain, second line: query sequence, third line: ortholog protein protein sequence, fourth line: sequence from the structure.

'b' symbols mark phosphate backbone/sugar ring contacts, 's' symbols mark DNA base contacts. All contacts are computed using a 4.5 A distance cutoff between amino acid and DNA atoms.

Max number of orthologous proteins to be displayed for each domain match in the query sequence and each structural template.

Ortholog id: sequence id of the orthologous protein.

Motif Name: this is an arbitrary name of the user's choosing (no spaces or non alphanumeric characters allowed) that serves to identify the current motif information being input.

Motif: a probabilistic binding profile defined either by the alignment of binding sites or through a position-specific weight matrix (PWM).

Weight matrix should be in the following format:

>pwm1
   0.400	   0.200	   0.200	   0.200	
   0.062	   0.062	   0.062	   0.812	
   0.188	   0.188	   0.438	   0.188	
   0.588	   0.138	   0.138	   0.138	
   0.087	   0.738	   0.087	   0.087	
   0.087	   0.087	   0.738	   0.087	
   0.138	   0.138	   0.138	   0.588	
   0.188	   0.438	   0.188	   0.188	
   0.812	   0.062	   0.062	   0.062	
   0.200	   0.200	   0.200	   0.400	
<

The columns correspond to ACGT. The rows do not have to sum up to one - if the absolute counts are provided they will be normalized.

Pseudo Count: The value will be added to all the counts from the sequence alignment. The default value is zero. Pseudocounts are often used to allow for a small probability of any DNA base at every position in the alignment.

Weight matrix should be in the following format:

>pwm1
   0.400	   0.200	   0.200	   0.200	
   0.062	   0.062	   0.062	   0.812	
   0.188	   0.188	   0.438	   0.188	
   0.588	   0.138	   0.138	   0.138	
   0.087	   0.738	   0.087	   0.087	
   0.087	   0.087	   0.738	   0.087	
   0.138	   0.138	   0.138	   0.588	
   0.188	   0.438	   0.188	   0.188	
   0.812	   0.062	   0.062	   0.062	
   0.200	   0.200	   0.200	   0.400	
<

The columns correspond to ACGT. The rows do not have to sum up to one - if the absolute counts are provided they will be normalized.

Upload aligned binding site sequences: both uppercase [ACGT-] and lowercase [acgt-] symbols are allowed, including gaps. Two formats are allowed:

>aln1
AAA--AA
CCCCCCC
GGGGGGG
t-ttttt

AAA--AA
CCCCCCC
GGGGGGG
t-ttttt

Multiple binding site alignment should be in the following format: one site per line, both uppercase [ACGT-] and lowercase [acgt-] symbols are allowed, including gaps. All sequences should be of the same length (counting gaps):

AAA--AA
CCCCCCC
GGGGGGG
t-ttttt

Parameters for finding the best matched motifs:

Offset: a positive integer to let the user compute p-values for partial PWM matches. Minimum allowed length of the PWM to PWM match is computed as: min(L1,L2) - offset, where L1 and L2 are the lengths of the two PWMs being compared. For example, if we compare a PWM of length 6 to the structure-based PWM of length 8, 3 alignments will be tried with offset=0:

When offset is >0, we test the partial alignments as well. For example, with offset=1

also become allowed. For all valid alignments the lowest p-value (Bonferroni corrected for the number of tested alignments) will be displayed along with the PWM alignment that yields it (in actual calculations we test both PWM and its reverse-complement). Here, p-value is defined as the probability that 2 PWMs are not correlated.

WebLogo image of the PWM uploaded by the user.

Protein sequences in FASTA format The user may upload multiple sequences in FASTA format. WARNING: Uploading more than 5 sequences at a time may result in a server timeout!

>seqID1
sequence1
>seqID2
sequence2
>seqID3
sequence3
...

user id A user's machine is assigned a unique identifier when the user first accesses the web site from that machine. This identifier is displayed at the top of the web page, and is also stored on the user's machine as a "cookie". If the user plans to access their data from other machines, or plans to delete the machine's cookies in the future, this identifier should be written down. To present their user id to the site, and thus gain access to all the data and results associated with that user id, the user must choose the "authenticate" link on the welcome page, and enter this user id. The user id is the form '0.dddddddddddddd' ie 14 digits with preceding '0.' and only id's of this form are recognized by the system.

Quick Start.

Protein-DNA homology modeling.
```
Suppose you want to find structural homologs (protein-DNA complexes) for the following protein sequence:
```
```
>1ysa_C 225-281
KDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER
```
This is chain C of the yeast transcription factor GCN4 of the leucine zipper type for which the structure has been solved (PDB code: 1ysa), so we should see an exact match in the structural database.
1. Go to the Upload Sequence Page.
2. Create a project name, e.g. demo.
3. Give a name to the sequence file that will be created, e.g. test.
4. Paste the sequence shown above (including the header) into the "Paste sequences" window.
5. In the "Specify the parameters" section, choose 4 for the number of predicted homologs, and specify HMMER e-value (default of 0.1 should be fine but will allow weaker matches to be displayed).
6. Click on "GO".
7. Two hits to DNA binding domains are found: bZIP_1 (e-value = 2.5e-18) and bZIP_2 (e-value = 7.7e-10). The height of each colored bar is proportional to the significance of the hit. Both of these domains are of the leucine zipper type. The positions of the colored bars along the protein sequence (shown as the grey bar on top) show the coordinates of the hit. The one-based coordinates are also displayed in the start and end fields.
8. Now, click on the "Best Templates" link.
9. For each Pfam family (bZIP_1 and bZIP_2 in this case), 4 structural homologs with the highest interface scores are displayed. Detailed information about each column can be found by clicking on the links in the table header. We see that for all bZIP_1 homologs the interface score is 100.0, revealing that there are multiple GCN4 structures in the database. Hits 3 and 4 are from 1ysa chains C and D, where our example protein sequence came from.
10. Now suppose you wish to find GCN4 orthologs in other species. Click on the "Find Ortholog" link on the left. A new window will open: choose "A.gambiae" in the pulldown menu and specify 4 for the number of orthologs to be displayed (sorted by the interface score). Note the "pdb id" entry which specifies the structure used to define the DNA binding interface. The interface is used to compute homology scores between the original protein sequence and all bZIP_1 sequences in A.gambiae.
11. Click on "GO".
12. 4 leucine zippers in A.gambiae whose DNA binding interfaces are closest to GCN4 (have the highest homology scores) are displayed. As before, detailed information about each column can be found by clicking on the links in the table header.
Assigning DNA sequence motifs to protein-DNA structure-based weight matrix predictions.
```
Suppose you have a weight matrix, or an alignment of transcription factor binding sites (i.e. some definition of a regulatory motif),
and you would like to find out the identity of the transcription factor. You can try to do so by correlating your motif against
our database of weight matrix predictions based on the protein-DNA structures. Let us use the following weight matrix as an example:
```
```
>1ysa  11
   0.225           0.225           0.225           0.325
   0.438           0.188           0.188           0.188
   0.075           0.075           0.075           0.775
   0.188           0.188           0.438           0.188
   0.738           0.087           0.087           0.087
   0.075           0.775           0.075           0.075
   0.062           0.062           0.062           0.812
   0.150           0.550           0.150           0.150
   0.588           0.138           0.138           0.138
   0.163           0.163           0.163           0.512
   0.237           0.287           0.237           0.237
<
```
This is actually our weight matrix prediction for the yeast leucine zipper factor GCN4 (for which the homology modeling was carried out in the previous example), so the algorithm should find a very good match in our database.
1. Go to the Upload Motif Page.
2. Create a project name, e.g. demo.
3. Give a name to the sequence file that will be created, e.g. test.
4. Make sure that the "Specify by Position Weight Matrix" option is on, and paste the PWM shown above (including the header) into the "Paste PWM" window.
5. Set offset=2 in the "Specify the Searching Parameters" window. This will allow the partial PWM matches to be considered.
6. Make sure that "Select transcription factors" is on in the "choose the Pfam family" window. This will restrict the search to approximately 40 transcription factor families (though the user can choose any subset of Pfam families to search against).
7. Choose 4 in the "number of hits" window - this will display 4 top matches for each selected Pfam family.
8. Click on "GO".
9. You will see a WebLogo image of your input motif.
10. Click on "GO" again.
11. 4 motifs per Pfam family are displayed, with all families sorted by the p-value of the best hit (hits with the lowest p-value which indicates the closest match are displayed first). As expected, the bZIP_1 leucine zipper family is the top hit. Within it, 1ysa and 2c9n are the closest (P-value: 0.000e+00). 1ysa (GCN4) PWM was used as the input motif, while 2c9n is a low resolution (3.5 A) structure of the Epstein-Barr virus zebra protein which binds DNA in a similar way. By clicking on the "raw report" link below each PWM image, the user can examine the raw output of the program which computes PWM-to-PWM correlations (typically, this is for experts and developers only). The "PDB file" link allows the user to download the structure in the standard Protein Data Bank format. Finally, the "Protein Viewer" link displays the protein-DNA structure in Jmol. The next best family is bZIP_2 which actually overlaps bZIP_1 and thus returns similar structures (with 1ysa as the top hit). Below bZIP_2 the zf-C4 family is displayed in which some factors happen to have similar binding profiles. The best p-value in this family is 7.295e-06, far below the top leucine zipper match (though in practice the situation may not be so clear-cut of course).

FAQ:

How do I make Jmol work with Firefox on Linux?
You need to install Java, and to make sure that there is a softlink to the Java plugin in the firefox/plugins directory. For example, on my machine I used the command:

ln -s /usr/java/j2re1.5.0/plugin/i386/ns7/libjavaplugin_oji.so

to create the link. Please restart Firefox after the plugin is installed.

How do I download the data from my runs?
All data from the runs is stored locally in a directory whose name is created using the concatenation of the project name (defaults to "demo") and the userid (the random number assigned by the computer to your project). Some of the files in this directory will be made using the motif name or the sequence file name you provided (defaults to "test"), while the others will have fixed names. When you click on the "Download" button the files created in your directory will be made into a TAR archive, gzipped, and made available for download.

Is my data private?
A random number is created the first time you access the Protein-DNA Explorer from your computer, appended to your project name, and saved in a cookie by your browser. As long as you are using the same computer you will get the same random number.If you are sharing the computer with other users,you should delete the cookie named 'protein_dna' under your browser.

Organism Name	Website	Local FASTA file
A.gossypii	http://agd.unibas.ch	FASTA
N.crassa	http://www.broad.mit.edu/annotation/fungi/fgi/	FASTA
U.maydis	http://www.broad.mit.edu/annotation/fungi/fgi/	FASTA
S.pombe	http://www.sanger.ac.uk/Projects/S_pombe/	FASTA
Y.lipolytica	http://cbi.labri.fr/Genolevures/index.php	FASTA
C.glabrata	http://cbi.labri.fr/Genolevures/index.php	FASTA
C.albicans	http://genolist.pasteur.fr/CandidaDB/	FASTA
D.hansenii	http://cbi.labri.fr/Genolevures/index.php	FASTA
K.lactis	http://cbi.labri.fr/Genolevures/index.php	FASTA
A.nidulans	http://www.broad.mit.edu/annotation/fungi/fgi/	FASTA
K.waltii	http://www.broad.mit.edu/seq/YeastDuplication/	FASTA
S.bayanus	ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes	FASTA
S.castellii	ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes	FASTA
S.cerevisiae	ftp://genome-ftp.stanford.edu/pub/yeast/sequence/genomic_sequence	FASTA
S.kluyveri	ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes	FASTA
S.kudriavzevii	ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes	FASTA
S.mikatae	ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes	FASTA
S.paradoxus	ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes	FASTA
D.discoideum	http://dictybase.org/	FASTA
A.gambiae	http://www.ensembl.org	FASTA
D.melanogaster	http://flybase.bio.indiana.edu	FASTA
D.pseudoobscura	http://flybase.bio.indiana.edu	FASTA
H.sapiens	http://www.ensembl.org	FASTA
M.musculus	http://www.ensembl.org	FASTA