Project Name This is an arbitrary name of the user's choosing (no spaces or non alphanumeric characters allowed). It serves to identify the folder where the sequence data and its associated results are stored.
Sequence file Name This is an arbitrary name of the users choosing (Proj. name restrictions apply) that serves to identify the current set of sequences being input, eg the regulatory sequence for all the genes controlled by a common set of factors should reside in one sequence file. A sequence file (and its associated results) reside in a project.
Input protein sequence(s) in FASTA format. Multiple sequences are allowed but uploading >5 sequences may lead to a server timeout.
File format: Please provide protein sequences in FASTA format.
Pre-computed sequence ids: We have pre-computed domain matches for all DNA-binding proteins from several organisms. Using protein sequence id lookups is more computationally efficient than the ab initio calculations with user-uploaded protein sequences.
Max number of structural homologs to be displayed for each domain match.
E value for HMMER: use a lower cutoff to discard less significant matches.
PDB information: Information about the structure from the Protein Data Bank.
PDB file: biological unit coordinates available for download in the PDB format.
Weight matrix: Structure-based position specific weight matrix (PWM) prediction. We use the following format for these (e.g.):
>1dgc 10 0.400 0.200 0.200 0.200 0.062 0.062 0.062 0.812 0.188 0.188 0.438 0.188 0.588 0.138 0.138 0.138 0.087 0.738 0.087 0.087 0.087 0.087 0.738 0.087 0.138 0.138 0.138 0.588 0.188 0.438 0.188 0.188 0.812 0.062 0.062 0.062 0.200 0.200 0.200 0.400 <The columns correspond to ACGT. All rows sum up to one. The prediction is made using the number of amino acid - DNA base atomic contacts as a measure of binding specificity of the consensus base (Morozov, A.V., Havranek, J.J., Baker, D. and Siggia, E.D. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005 Oct 24; 33(18): 5781-5798).
WebLogo image of the structure-based PWM prediction.
Protein viewer: View biological unit PDBs using a Jmol molecular viewer. WARNING: your browser cache should be set to a large value (e.g. 100 MB) for the viewer to work efficiently.
Start: first residue in the structure to be aligned to the query protein sequence (residues are numbered starting from one).
End: last residue in thestructure to be aligned to the query protein sequence (residues are numbered starting from one).
Score: Template score (same as sc.tot in the full score report). The score is in the [0,100] range, with higher value indicating better match.
Method: Experimental method by which the structure was solved: X-ray diffraction (X-RAY) or nuclear magnetic resonance (NMR).
Resolution: Resolution of X-ray diffraction (if applicable).
Max number of orthologous proteins to be displayed for each domain match in the query sequence and each structural template.
Ortholog id: sequence id of the orthologous protein.
Motif Name: this is an arbitrary name of the user's choosing (no spaces or non alphanumeric characters allowed) that serves to identify the current motif information being input.
Motif: a probabilistic binding profile defined either by the alignment of binding sites or through a position-specific weight matrix (PWM).
Weight matrix should be in the following format:
>pwm1 0.400 0.200 0.200 0.200 0.062 0.062 0.062 0.812 0.188 0.188 0.438 0.188 0.588 0.138 0.138 0.138 0.087 0.738 0.087 0.087 0.087 0.087 0.738 0.087 0.138 0.138 0.138 0.588 0.188 0.438 0.188 0.188 0.812 0.062 0.062 0.062 0.200 0.200 0.200 0.400 <The columns correspond to ACGT. The rows do not have to sum up to one - if the absolute counts are provided they will be normalized.
Pseudo Count: The value will be added to all the counts from the sequence alignment. The default value is zero. Pseudocounts are often used to allow for a small probability of any DNA base at every position in the alignment.
Weight matrix should be in the following format:
>pwm1 0.400 0.200 0.200 0.200 0.062 0.062 0.062 0.812 0.188 0.188 0.438 0.188 0.588 0.138 0.138 0.138 0.087 0.738 0.087 0.087 0.087 0.087 0.738 0.087 0.138 0.138 0.138 0.588 0.188 0.438 0.188 0.188 0.812 0.062 0.062 0.062 0.200 0.200 0.200 0.400 <The columns correspond to ACGT. The rows do not have to sum up to one - if the absolute counts are provided they will be normalized.
Upload aligned binding site sequences: both uppercase [ACGT-] and lowercase [acgt-] symbols are allowed, including gaps. Two formats are allowed:
>aln1 AAA--AA CCCCCCC GGGGGGG t-tttttOR
AAA--AA CCCCCCC GGGGGGG t-ttttt
Multiple binding site alignment should be in the following format: one site per line, both uppercase [ACGT-] and lowercase [acgt-] symbols are allowed, including gaps. All sequences should be of the same length (counting gaps):
AAA--AA CCCCCCC GGGGGGG t-ttttt
Parameters for finding the best matched motifs:
Offset: a positive integer to let the user compute p-values for partial PWM matches. Minimum allowed length of the PWM to PWM match is computed as: min(L1,L2) - offset, where L1 and L2 are the lengths of the two PWMs being compared. For example, if we compare a PWM of length 6 to the structure-based PWM of length 8, 3 alignments will be tried with offset=0:
111111 22222222 111111 22222222 111111 22222222When offset is >0, we test the partial alignments as well. For example, with offset=1
111111 22222222 111111 22222222also become allowed. For all valid alignments the lowest p-value (Bonferroni corrected for the number of tested alignments) will be displayed along with the PWM alignment that yields it (in actual calculations we test both PWM and its reverse-complement). Here, p-value is defined as the probability that 2 PWMs are not correlated.
WebLogo image of the PWM uploaded by the user.
Protein sequences in FASTA format The user may upload multiple sequences in FASTA format. WARNING: Uploading more than 5 sequences at a time may result in
a server timeout!
>seqID1 sequence1 >seqID2 sequence2 >seqID3 sequence3 ...
user id A user's machine is assigned a unique identifier when the user first accesses the web site from that machine. This identifier is displayed at the top of the web page, and is also stored on the user's machine as a "cookie". If the user plans to access their data from other machines, or plans to delete the machine's cookies in the future, this identifier should be written down. To present their user id to the site, and thus gain access to all the data and results associated with that user id, the user must choose the "authenticate" link on the welcome page, and enter this user id. The user id is the form '0.dddddddddddddd' ie 14 digits with preceding '0.' and only id's of this form are recognized by the system.
Suppose you want to find structural homologs (protein-DNA complexes) for the following protein sequence:
>1ysa_C 225-281 KDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGERThis is chain C of the yeast transcription factor GCN4 of the leucine zipper type for which the structure has been solved (PDB code: 1ysa), so we should see an exact match in the structural database.
Suppose you have a weight matrix, or an alignment of transcription factor binding sites (i.e. some definition of a regulatory motif), and you would like to find out the identity of the transcription factor. You can try to do so by correlating your motif against our database of weight matrix predictions based on the protein-DNA structures. Let us use the following weight matrix as an example:
>1ysa 11 0.225 0.225 0.225 0.325 0.438 0.188 0.188 0.188 0.075 0.075 0.075 0.775 0.188 0.188 0.438 0.188 0.738 0.087 0.087 0.087 0.075 0.775 0.075 0.075 0.062 0.062 0.062 0.812 0.150 0.550 0.150 0.150 0.588 0.138 0.138 0.138 0.163 0.163 0.163 0.512 0.237 0.287 0.237 0.237 <This is actually our weight matrix prediction for the yeast leucine zipper factor GCN4 (for which the homology modeling was carried out in the previous example), so the algorithm should find a very good match in our database.
How do I make Jmol work with Firefox on Linux?
You need to install Java, and to make sure that there is a softlink to the Java plugin in the firefox/plugins directory.
For example, on my machine I used the command:
ln -s /usr/java/j2re1.5.0/plugin/i386/ns7/libjavaplugin_oji.soto create the link. Please restart Firefox after the plugin is installed.
How do I download the data from my runs?
All data from the runs is stored locally in a directory whose name is created using the concatenation of the project name (defaults to "demo")
and the userid (the random number assigned by the computer to your project).
Some of the files in this directory will be made using the motif name or the sequence file name you provided (defaults to "test"),
while the others will have fixed names.
When you click on the "Download" button the files created in your directory will be made into a TAR archive, gzipped, and made available
for download.
Is my data private?
A random number is created the first time you access the Protein-DNA Explorer from your computer, appended to your project name,
and saved in a cookie by your browser. As long as you are using the same computer you will get the same random number.If you are sharing the computer with
other users,you should delete the cookie named 'protein_dna' under your browser.