Match TM : Frequently Asked Questions (FAQ)




Download and Installation
 
MatchTM does seem to install properly, but it is not working. All we get is a blank results screen with "Thank you for using MatchTM - Could not calculate output".

This problem occurs when the version of the MatchTM binary does not run on the operating system. We have now created several TRANSFAC® distribution packages with MatchTM versions for different operating systems (Windows NT, Irix, Solaris, Linux). If you are using the correct version for your operating system, but the line "Could not calculate output" still appears, please contact our support at: support@biobase.de
 
 
When are the files from the download supposed to be saved to a temporary directory? What I don't understand is:
  • when you click on the file, will it ask you where to save them so that you can specify a temporary directory OR
  • will the download automatically save somewhere and then you have to move the files to a temporary directory afterwards?

That depends on your browser settings. To be on the safe side, just press shift and then click on the link. Then you will be asked where to put the files. Just make sure that the distribution file (e.g. match_1.1.tar.gz) and the installation/update script (e.g. install.pl/update.pl) are in the same temporary directory when starting the installation by typing "perl install.pl" or the update by typing "perl update.pl".
 
 
During the installation the script asks for the directory of the TRANSFAC® installation. Is this the directory which contains the files from the installation/update download that a person has just finished or is this the directory where the original TRANSFAC® Professional files are stored?

The script asks for the directory of the TRANSFAC® Professional version that is already running (original TRANSFAC® files). The questions a user is asked during the installation are very similar to those which are asked during the TRANSFAC® Professional installation.
 
 
Theoretical Background
 
How can it be that in some cases the multiple matrices for one and the same factor are so different from each other, and which one is the best to use?

There are pronounced differences in the experimental proof underlying the data. The matrices with just a number as appendix _01, _02, ... were taken from the cited reference. Often (but not always as in V$CEBP_01 for example) they were derived by random binding site selection, where a specific, often recombinant factor or its isolated binding domain was studied. We built the other matrices through compilation of genomic (and sometimes artificial) binding sites for orthologous factors of a broad taxonomic group, as for example vertebrates, insects, plants, fungi, etc. Their appendices _Q1, _Q2, ... refer to the 'quality' of the sites used, i.e. the certainty with which it could be concluded that the binding activity shown was identical with the suggested factor. (Other matrices again, e.g. those ending in _C, where built using specific programs, like CONSIND.) Depending on the experimental material and method/conditions on one hand and the choice and alignment of binding sites on the other hand, i.e. which and how many of the possible 'manifestations' of a 'binding site' were selected/compiled, the derived matrices can differ.

There are no general rules as to which matrices are the best ones to use. You can restrict MatchTM to the use of so-called "high quality matrices" only. For a high quality matrix, the frequency of matches found in exon2 sequences drops under 0.01 when using a high cut-off (e.g. 1.00). These matrices are thus likely to find a low number of false positive matches. A better value to estimate the quality of a matrix could be derived from two sets of test sequences, a positive and a negative one, which we have done. But since a positive set was not available for all matrices, this value is not available for all matrices.
 
 
Are all TRANSFAC® SITE entries represented in (at least) one of the matrices in the matrix library of MatchTM ?

No, to build a matrix you need several binding sites for a transcription factor. For some factors there are not enough sites in TRANSFAC® at this point, but we are working on it. The MatchTM matrix library is identical to the TRANSFAC® MATRIX table.
 
 
Are all matrix core sequences 5 in length?
Are the core sequences always consecutive nucleotides?

Yes, we use the five most conserved, consecutive nucleotides as core sequences for all matrices.
 
 
How is the score for the core/matrix similarity actually calculated?

MatchTM searches for subsequences x of an input sequence s, which are good matches to a matrix of TRANSFAC®. The quality of a match is described by two values: the core similarity and the matrix similarity. The score for the matrix similarity of a subsequence x of sequence s with length L is calculated in the following way:


where:(b1,...,BL) are the nucleotides of x
 fi,B is the frequency of B to occur at position i and:
 current frequency
 minimum frequency
 maximum frequency
 information vector

The score for the core similarity is calculated similarly to the matrix similarity.
 
 
Could you please tell me how I should quote MATCHTM   when I use it for analysis?

So far we have not published a paper for MATCHTM . - However, you can refer to the latest poster, which was published at the German Conference on Bioinformatics 2001 (available at http://www.bioinfo.de/isb/gcb01/poster/index.html). Additionally, please refer to our company as the creator of this tool.


 
 
Promoter Analysis with MatchTM
 
What do I need for promoter analysis?

TRANSFAC® Professional and the tool MatchTM provided with it allow a first/initial promoter analysis. The MatchTM  program uses a library of mononucleotide weight matrices from TRANSFAC® Professional to search in sequences provided by the user for potential transcription factor binding sites. By including the information on the tissue specificity of the transcription factors stored in TRANSFAC®, conclusions can be drawn about where and when the investigated promoter is likely to be active.
 
 
How can I confirm that my sequence is actually a DNA binding site?

MatchTM is defined to identify transcription factor binding sites in uncharacterized sequences by comparing them to a library of distribution matrices that are linked to the respective entries in the TRANSFAC® matrix table. As a result set you should get a list of binding matrices indicating where they match your sequence and how good the match is. These distribution matrices have been derived from sites within the DNA for which binding of a specific transcription factor was shown. The binding sites for a group of orthologous transcription factors were aligned and then, at each position, the frequency of the four nucleotides (A,C,G,T) was counted. The derived distribution matrix contains more information than a simple IUPAC consensus, where nucleotides that are found at lower frequencies are neglected. (i.e. at a position where 60% of the sites showed an A, 20% T and 20% C, an A would appear in the IUPAC consensus, the same as for a position where in 100% of all sites an A was found, thus pretending both positions within the site to be equally conserved.) As matrices contain more information than an IUPAC consensus, sequence comparisons based on matrices are usually slower. To enhance performance, sequence comparison is done in two steps by MatchTM : In the first step, the 'core' of a matrix is compared to the sequence given by the user, and only where the core similarity is higher than the initially chosen threshold, the whole matrix is compared. The matrix-'core' used by MatchTM consists of the five consecutive nucleotide positions, which together yield the highest conservation value. Thus, the 5bp-core (capital letters in the result set) just serves to speed up the calculations, but it cannot define the whole binding matrix/sites sufficiently on its own. Lets say you are looking for potential binding sites in the following 26-bp sequence: cgtgatcgacgtcagtcccgggatgc. Scanning this sequence for matches to the subset of vertebrate matrices in the library using MATCH TM (with matrix similarity cut-off=0.86, core similarity cut-off=0.96) will give the following result set:
   matrix             position  core   matrix sequence        factor name
   identifier         (strand)  match  match

 1 V$GATA1_01            1 (+)  0.995  0.989  cgtGATCGac      GATA-1
 2 V$GATA2_01            1 (+)  0.979  0.969  cgtGATCGac      GATA-2
 3 V$CDPCR3HD_01         2 (-)  1.000  0.879  gtgATCGAcg      CDP CR3+HD
 4 V$ATF_01              4 (-)  1.000  0.977  gatcgaCGTCAgtc  ATF
 5 V$ATF_B               4 (-)  1.000  0.943  gatcgaCGTCAg    ATF
 6 V$CREB_Q4             5 (-)  1.000  0.941  atcgaCGTCAgt    CREB
 7 V$CREB_Q2             5 (-)  1.000  0.919  atcgaCGTCAgt    CREB
 8 V$CREBP1_Q2           5 (-)  1.000  0.915  atcgaCGTCAgt    CRE-BP1
 9 V$AP1FJ_Q2            6 (-)  0.983  0.954  tcgacGTCAGt     AP-1
10 V$AP1_Q2              6 (-)  0.967  0.941  tcgaCGTCAgt     AP-1
11 V$CREB_01             7 (-)  1.000  0.981  cgaCGTCA        CREB
12 V$CREB_02             7 (-)  1.000  0.930  cgaCGTCAgtcc    CREB
13 V$CREBP1CJUN_01       7 (+)  1.000  0.862  cGACGTca        CRE-BP1/c-Jun
14 V$CREBP1CJUN_01       7 (-)  1.000  0.885  cgACGTCa        CRE-BP1/c-Jun
16 V$GEN_INI2_B         10 (+)  0.998  0.992  cgtCAGTC        GEN_INI
17 V$GEN_INI_B          10 (+)  0.999  0.991  cgtCAGTC        GEN_INI
18 V$GEN_INI3_B         10 (+)  0.996  0.989  cgtCAGTC        GEN_INI
19 V$CAP_01             12 (+)  1.000  0.999  TCAGTccc        cap
20 V$IK2_01             12 (-)  0.978  0.941  tcagTCCCGgga    Ik-2
21 V$CAP_01             19 (-)  0.973  0.970  cggGATGC        cap

For each match to a matrix position (within the above 26bp-sequence), orientation and similarity (of the core and of the whole matrix) are given. In the second to last column the fragment of the sequence which matched the matrix (orientation!) is shown (with the 'core' in capital letters). Capital letters within the sequence indicate the position of the core string within the matching matrix. Clicking on the matrix name (ID) gets you to the matrix entry in TRANSFAC®, where you can get information about the matrix and its binding factor. On the first glance the list may look rather long. But if you take a closer look you will see that some of the matrices belong to the same factor, e.g. 6,7,11,12 are all matrices for the vertebrate cAMP-responsive element binding protein (CREB). (Depending which set of binding sites have been used to build a matrix, different matrices have been derived; for details see the TRANSFAC® documentation). I would think that it is rather likely that you could really prove binding of at least some of the factors found (belonging to the listed matrices) to the above sequence in vitro. To draw any conclusions for the regulation in vivo of a promoter containing the above sequence is a bit more problematic, as this is dependent on the presence of other sites and the interaction of the factors binding to them.
 
 
The program ignores the parameters that I am selecting. (For example, I want to limit my search to core similarity = 1.0 and matrix similarity = 0.75, but the program uses "Cut-offs minimized for false negatives" regardless.)

That is just a problem of selecting the right combination radio buttons:
  • Selecting a sequence:
    Here you have three options:
    • 1. Select one of your stored sequences.
      Please select one of your sequences from the list AND click on the respective radio button (first button in the left column of the second box on the left side of the page).
    • 2. Select an example sequence.
      You also have to click on the respective radio button. (second button in the left column of the second box on the left side of the page)
    • 3. Enter a new sequence:
      Enter a new sequence into the respective textbox, enter a sequence name. (This is optional. If you do not enter any, the new sequence will be called "default".) AND click on the respective radio button. (third button in the left column of the second box on the left side of the page)
  • Selecting matrices and cut-off values (right grey box):
    • 1a. Selecting a group of matrices:
      (little grey box within the right grey box) If you go on defining cut-offs for this group of matrices you can choose among cut-offs to minimize false positive matches, to minimize false negative matches and to minimize the sum of both error rates. The fourth option is to enter cut-offs yourself. To select any of these four options you have to click on the respective radio button in front of the option you want to use.
    • 1b. Selecting cut-off for a group of matrices:
      (little grey box within the right grey box) If you go on defining cut-offs for this group of matrices you can choose among cut-offs to minimize false positive matches, to minimize false negative matches and to minimize the sum of both error rates. The fourth option is to enter cut-offs yourself. To select any of these four options,you have to click on the respective radio button in front of the option you want to use.
    • 2. Selecting profiles:
      Select the profile you want to use from the lists AND click on the respective radio buttons (lower left corner of the right grey box).
      No extra selection of cut-offs is needed, because they are already included in the profile.
 
 
What can I do if MatchTM does not find all promoter elements listed in the "misc_features" of a Genbank report?

If you are looking for a binding site for a particular factor, please make sure that there is a matrix in TRANSFAC for this factor. If matrices exist for this factor, this can be a problem of the cut-off selection. If you use fairly high cut-offs, e.g. cut-offs to minimize false positive matches, you might miss sites. Cut-offs to minimize false positive matches try to filter out all possible random matches, but they do not guarantee that all "real sites" are found. If you want to make sure that no real site is missed, you should use cut-offs to minimize false negatives for your analysis. A cut-off that finds all "real" binding sites and filters out all random matches would be optimal. But in most cases it is not that easy to separate these two sets of sites. Cut-offs to minimize the sum of both error rates are just the best possible approximation.
Therefore, to make sure that you do not miss any real sites, use cut-offs to minimize false negative matches.

Here is one example, which shows that it is possible to find all known promoter elements with MatchTM . The promoter of the human angiotensinogen gene (Genbank Accession: X15323 ) was searched with MatchTM using cut-offs to minimize false negative matches. The list below shows the misc_features of the Genbank entry and the respective matches found by MatchTM . For each matching matrix identifier, position and orientation, core similarity score, matrix similarity score, the matching sequence and the name of the binding factor are given.
misc_feature    384..390 /note="cAMP-responsive element":
V$CREB_02   383  (-)  1.000   0.905    ctgCGTCActtg         CREB


misc_feature    complement(548..553) /note="glucocorticoid binding core":
V$GR_Q6     546  (-)  1.000   0.922    acaGAACAgcacatctttc  GR
V$GR_Q6     551  (-)  0.986   0.907    acaGCACAtctttcaatgc  GR

misc_feature    complement(649..662) /note="heat-shock element":
V$HSF1_01   651  (+)  0.974   0.956    GGAAActtcc            HSF1
V$HSF1_01   651  (-)   0.974   0.963    ggaaaCTTCC           HSF1
V$HSF2_01   651  (-)   0.997   0.986    ggaaaCTTCC           HSF2

misc_feature    complement(886..899  /note="estrogen responsive element":
V$ER_Q6        883  (-)  1.000   0.927    ctgGGTCAgaaggcctggg  ER

misc_feature    complement(945..953) /note="acute phase-responsive element":
V$STAT_01      945 (-)  1.000   0.984    ttctGGGAA                   STATx

misc_feature    complement(1093..1098)/note="glucocorticoid binding core":
V$GR_Q6      1099 (+)  0.878   0.847    tctggccagccTGTGGtct   GR

misc_feature    complement(1160..1172) /note="hepatocyte-specific promoter element":
V$CEBP_01    1159 (+)  0.806   0.806    agCCTGGgaacag         C/EBP

TATA_signal     complement(1192..1197)
V$TATA_01    1191 (+)  1.000   0.976    ctATAAAtagggcct        TATA
V$MTATA_B   1189 (+)  1.000   0.916    agctATAAAtagggcct     Muscle TATA box  

The disadvantage of this approach is that one gets a huge number of false positive matches. To reduce this number you can restrict MatchTM to use high quality matrices only.
 
 
How can I reduce the number of false negative matches and make sure that I do not lose any "real" sites at the same time?

Cut-offs to minimize false positive matches try to filter out all possible random matches, but they have the disadvantage that also some "real sites" are also missed. "Real sites" do not naturally have the highest matrix similarity score, because the binding affinity of a factor does not only depend on the sequence of its binding site. An optimal cut-off would find all "real" binding sites and it would filter out all random matches. But it is rather infrequent that it is possible to separate these two sets of sites so easily. Cut-offs to minimize the sum of both error rates are the best possible approximation. If you do not want to lose any real sites, you should use cut-offs to minimize false negative matches. To reduce the number of false negative matches, you can restrict your search to the use of high quality matrices only. For a high quality matrix, the frequency of matches found in exon2 sequences drops below 0.01 when using a high cut-off (e.g. 1.00). So, for example, matrices with a short matrix length, which therefore have a high amount of random matches, are filtered out.

But the huge amount of false positive matches is in fact the general limitation of this type of analysis, when one just tries to identify all possible subsequences that might be potential transcription factor binding sites. Actually MatchTM identifies with a great deal of certainty a potential of the binding of various transcription factors to DNA that could be realized in specific cellular conditions and in a specific PROMOTER CONTEXT.
The analysis of the overall structure of promoters to understand the promoter context seems to be a more promising approach than just searching a promoter for single binding sites. First of all we are talking about certain combinations of TF sites that are specific for definite types of promoters. Searching for such combinations is much more specific and produces less false positives. You may want to take a look at our paper: "Kel et al., JMB (1999)288,353-376" concerning composite elements in immune responsive genes.


How do I search a promoter DNA sequence for potential transcription factor binding sites?

To search a DNA sequence for potential transcription factor binding sites, you can use one of the following tools/programs, which are supplied with TRANSFAC® professional, MATCHTM or PATCHTM . PATCHTM compares the sequence with the site entries in TRANSFAC®, MATCHTM compares it with the nucleotide distribution matrices. As a result you get a list of those sites or matrices, respectively, that matched your sequence. In addition to the tabular view, you can view your results graphically. Normally MATCHTM will be the preferred program to use, also because its output (search result) should be more concise and therefore easier to read and interpret. PATCHTM is indispensable for those factors, for which only a few single binding sites, but no binding matrices, are available in TRANSFAC®.
Regardless if you use MATCHTM or PATCHTM , please keep in mind that the significance of single potential binding sites/matrices for one or the other factor in a promoter or other sequence is not very high, but that they need to be seen in context.


I would like to receive more information about the tissue-specific profiles:
How were they constructed?

  1. We have selected a number of genes described in the TRANSFAC® Gene table, which are highly inducible upon response in different cells of a certain tissue. Both human and mouse genes have been selected.
  2. We have created a list of transcription factors (TFs) that have been experimentally shown to bind specific DNA sites in promoters of those genes and regulate their transcription. Thus, widely expressed TFs, which play an important role in the transcriptional regulation of genes within a certain tissue, are also included in the list.
  3. TRANSFAC® matrices for those TFs were selected.
  4. For some of the TFs there are several matrices in TRANSFAC® (for instance, GATA, Oct). In these cases, only the best matrix in a group was selected for profile.
  5. Cut-offs that are given by default are to minimize false negative matches.



I would like to receive more information about the tissue-specific profiles:
What kind of matrices have you included in it?

The user can easily find a list of all matrices that are used to construct each profile in the following way. The button "Goto Match Profiler" is located on the first page of MATCHTM in the upper right corner. Press this button and you will find the next page. The second line at the top is "Select one of our predefined profiles". User can select one and press button "VIEW". This operation immediately results in a list of matrices that are used to construct the selected profile.

The user can then go to each matrix to get more information about the matrix itself.
On the same page, user can modify our predefined profile and thereby construct his own. To include additional matrices, select them from the list below and press the "Select" button.

The selected matrices are listed on the next page. Mark those you want to include in your profile and press the "include" button. On the next page you can select cut-offs, choose a name for your profile and save it. The next step is to "Restart Profiler". Following this step, you will find your profile included in the selection box "Select one of your predefined profiles". Another possibility instead of "Restart Profiler" is to start "Goto Match", which will take you to the Match query form. Here you will find your profile in the selection box "your profiles".


I would like to receive more information about the tissue-specific profiles:
How did you built these matrices (what is the species), are they a specific human matrix in this profile?
Why don't you make a human specific profile ?

For profile construction we have used only vertebrate matrices from the TRANSFAC® (V$*). Many of them are built on the base of human, mouse and rat DNA sites and because of this they are mammalian matrices.
DNA-binding domains of orthologous mammalian factors (for example, mouse and human E2F-1) are homologous and are able to recognize the same binding sites on DNA. Moreover, rat or mouse recombinant factors are often used to study transcriptional regulation of human genes, and vise versa. Therefore, our suggestion is that mammalian matrices are useful for searching DNA sequences of any mammalian species.


What workflow would you recommend to analyze differential gene expression data with MatchTM ?

  1. Get 5' regions of the relevant genes (~2000 bp each) .
  2. Identify transcription factor binding sites with MatchTM . For MatchTM it is recommended that you prepare your own profile. ("Goto MatchTM Profiler"). In your profile, first enter matrices for factors that you know; as an example, good tips for such matrices are given in the profile: "best_selection" and/or use the functional specific profiles (liver, cell-cycle.). The good tips as to which factors to include in the profile can be found by searching in the TRANSFAC® Factor table using keywords related to your experiment (e.g. inducible factors by certain inductor or tissue-specific). First use high cut-offs (minFP) (This is the default in the MatchTM Profiler).
  3. Run the search. If the number of found sites is not enough to do the classification, decrease the cut-offs a bit and/or add more matrices.
  4. Grouping the genes and comparing them with the expression data is best done manually. However, we are currently developing an automatic system for that purpose. Some programs of this system are now available inhouse for service-based evaluation of the customer data.