How to use Match TM , Match TM Profiler and the Matrix Generation Page


 
 
Introduction

The MatchTM  tool is designed for searching potential binding sites for transcription factors (TF binding sites) in any sequence which may be of interest. MatchTM  uses a library of mononucleotide weight matrices from TRANSFAC®.
As a further feature, MatchTM allows you to specify your search by using profiles.  We use the term "profile" for a specific subset of weight matrices from the TRANSFAC® library with core similarity cut-off values and matrix similarity cut-off values for each matrix.

The MatchTM  Profiler provides a tool for creating (editing, deleting) matrix profiles -  you can create - for example - tissue specific matrix profiles that include weight matrices for tissue specific transcription factors.

You can create your own matrices from a set of aligned sequences on the Matrix Generation Page. Optimized cut-off values will also be estimated for these matrices, so that they can be used directly for a search in MatchTM  or they can be included into profiles on the MatchTM  Profiler Page.


How to cite MatchTM

So far we have not published a paper for MatchTM . - However, you can refer to the latest poster, which was published at the German Conference on Bioinformatics 2001 (available at http://www.bioinfo.de/isb/gcb01/poster/index.html). Additionally, please refer to our company as the creator of this tool.

MatchTM 
 
Viewing or deleting previous results
The results of each search you perform with MatchTM will be stored in your user directory. You can view or delete these results.
 
Deleting a previously stored sequence
Each sequence you enter into the MatchTM form will be stored (see also below). You can delete the sequences you do not want to keep any longer on top of the page. 
 
Starting a new search

1. Enter a name for your search
You should first enter a name for your search, since MatchTM will store your search result under that name. If you do not enter a name, MatchTM  uses "default" as result name.

2. Select a sequence
You have three options for selecting a sequence you would like to search:
 
a) Select one of your stored sequences:
If you select this option, you can choose among the sequences you have entered for a previous search.
 
b) Select an example:
If you choose this option, an example sequence will be used for your search. It is the 5' flank of the Rat tyrosine aminotransferase (TAT) gene (EMBL: M34257).
 
c) Enter a new sequence:
To run the search with a new sequence, you should first enter a name for it. The sequence will be stored under that name so that you can use it again for a later search. Next, you can insert your sequence.
The following formats are accepted: FASTA, TRANSFAC, EMBL, GenBank, IG, and RAW. (RAW format means the pure sequence.) Examples of each format are given below. The iupaccode characters 'B','D','H', 'K', 'M', 'R', 'S', 'V', 'W','Y' within a sequence are changed to 'N'. Using the same format for all sequences, you can always enter one or several sequences at a time - with one exception: In RAW format it is only possible to enter one sequence at a time.
The maximal length of a sequence you can search with MatchTM  is limited to 300,000 bps.
 
RAW format:
(All newlines and whitespaces will be ignored.)
acacgtagctagctagctgatcgtagctagtcgatcgtagctagctagctgatcgatgctagctgatcgtagctagtcgatag
tctagctagctagtcgatcgtagctagtcgatgctagctagctgtgtgtagctagtcgatcgatgctagctgatcgatcgtaa
gtctgatctagctagctagcgatcgtagctgatcgtagctagcatgctagtcgatgca


FASTA format:
>seq1
acagctagctacgatgatcgatcgatgctacgtcgtagtacgatcgtacg


TRANSFAC format:
(Only the fields essentially needed to recognize an entry in TRANSFACformat are shown. More fields may be included.)
AC  R00000
XX
ID  EXAMPLE$EXAMPLE_02
XX
SQ  CTCCATGGGAGTTTCTGAAGAACCTTAGTTAATAATTTTCACAGCTGTGCAC.
XX
//


EMBL format:
(Only the fields essentially needed to recognize an entry in EMBL format are shown. More fields may be included.)
ID   EXAMPLE standard; DNA; PLN; 360 BP.
XX 
SQ   Sequence 360 BP; 63 A; 92 C; 97 G; 108 T; 0 other;
     ctgcagcccc ggtttcgcaa agttaataat tttcagccgc gcacgtggtt ggccaaaccg 60
     caccctcctt cccgtcgttt cccatctctt cctcctttag agctaccact atataaatca 120
     gggctcattt tctcgctcct cacaggctca tctcgctttg gatcgattgg tttcgtaact 180
     ggtgagggac tgagggtctc ggagtggatt gatttgggat tctgttcgaa gatttgcgga 240
     ggggggcaat ggcgaccgcg gggaaggtga tcaagtgcaa aggtccgcct tgtttctcct 300
     ctgtctcttg atctgactaa tcttggttta tgattcgttg agtaattttg gggaaagctt 360
//


GenBank format:
(Only the fields essentially needed to recognize GenBank format are shown. You may include more fields.)
LOCUS     EXAMPLE 360 bp DNA PLN 13-JUN-1996
ACCESSION K00000 
            1 ctgcagcccc ggtttcgcaa agttaataat tttcagccgc gcacgtggtt gcccacaggc
           61 caccctcctt cccgtcgttt cccatctctt cctcctttag agctaccact atataaatca
          121 gggctcattt tctcgctcct cacaggctca tctcgctttg gatcgattgg tttcgtaact
          181 ggtgagggac tgagggtctc ggagtggatt gatttgggat tctgttcgaa gatttgcgga
          241 ggggggcaat ggcgaccgcg gggaaggtga tcaagtgcaa aggtccgcct tgtttctcct
          301 ctgtctcttg atctgactaa tcttggttta tgattcgttg agtaattttg gggaaagctt
//


IG format:
;seq_1
seq_1
acagctagtcgatcgatcgatgctagctgatcgtagctgatcgtagctaacgtgtagctagtcgacgtagctacgg1


3. Select a group of matrices or a profile to run MatchTM 
 
a) Matrix selection

You can select the groups of matrices from the library which are of interest to you. It is possible to combine several groups for one search. Just mark those you would like to use. The following groups are available: vertebrates, insects, plants, fungi, bacteria and nematodes. You can also run MatchTM with all matrices in the library.

There are two additional options to specify the set of matrices you would like to use. One option is to restrict the search to the use of high quality matrices only. The high quality criterion denotes the following: When using a matrix with a cut-off which allows a false negative rate of 50%, the frequency of matches found in exon3 sequences (false positive rate) must drop below a certain threshold. This threshold is defined so that the matrices which produce the highest number of false positive matches are defined as low quality matrices (about 30% of the TRANSFAC®  matrices).

The second option is to include user-defined matrices. If you have created some matrices on the Matrix Generation page, you can now include them in your search. Please make sure that you have selected a group of matrices for your search, which contains your matrix.

For any group of matrices, you can specify the cut-offs for core and matrix similarity . The matrix similarity is a score that describes the quality of a match between a matrix and an arbitrary part of the input sequences. Analogously, the core similarity denotes the quality of a match between the core sequence of a matrix (i.e. the five most conserved positions within a matrix) and a part of the input sequence. A match has to contain the "core sequence " of a matrix, i.e. the core sequence has to match with a score higher than or equal to the core similarity cut-off.  In addition, only those matches which score higher than or equal to the matrix similarity threshold appear in the output.

The appropriate cut-off selection is very important and depends largely on the user ´s objectives. Exact matches between matrix and sequence can lack any biological relevance since some transcription factors have low affinity binding sites of biological significance. So, we have calculated three different kinds of cut-offs, each answering a different purpose.
You can use cut-offs to minimize false positive matches, to minimize false negative matches or to minimize the sum of both error rates. It is also possible to define a core and a matrix similarity cut-off which are used for all matrices of the selected group.


  b) Profile selection

Instead of selecting a larger group of matrices, you can also use a predefined profile. Each profile includes a subset of matrices with defined cut-offs. You can either use one of our predefined profiles or one you have created yourself on the MatchTM  Profiler page.

Your predefined profiles:
If you come from the MatchTM  Profiler page, you will find your new profile in the list of your predefined profiles.
The profiles you created using the TRANSFAC®  search engine will also be listed there. You can recognize them by their name, which is set up in the following way: "month_day_hour-min-sec.prf"

To create a profile with the TRANSFAC®  search engine, please follows these steps:
  1. Please use the TRANSFAC®  query form "MATRIX SEARCH" to search for specific matrices in TRANSFAC® . For example, you can enter AP-1 in the textfield "Search Term" and then select "Binding Factor" in the "Quick Search Fields". When you then press the "Submit Query" button you will receive a list of AP-1 matrices. Next to each site entry you will find a box.
  2. Please mark the boxes for those entries that you would like to include in a MatchTM   search.
  3. Then scroll to the bottom of the list. Here you will find a box with the text "Run MatchTM   with marked entries". Please mark this box also.
  4. Now please click on "Show marked entries/Start MATCHTM ". MatchTM   will then be started and you will find your selection of sites among the user defined profile.
Profiles created in this way will always contain minFP cut-offs.

Predefined Profiles provided by MatchTM  :
We mainly provide tissue-specific profiles. Groups of transcription factors known to be active in a particular tissue have been collected for each profile with the help of information from the TRANSFAC®  database. Matrices linked to these transcription factors in TRANSFAC®  were then collected. When more than one matrix was linked to a transcription factor, we had to decide which matrix to include in the profile. We used the following criterion: If possible (i.e. if there was such a matrix) only matrices that fulfilled the "high quality criterion" were accepted. If several of the matrices fulfilling the criterion were linked to the same transcription factor, we estimated how many of the genomic binding sites of TRANSFAC®  for this particular transcription factor could be recognized with each of these matrices. We include in the profile the matrix that has the lowest level of false positive matches in exon2 and exon3 sequences when identifying 90% of the set of the genomic binding sites at the same time. The cut-offs used in the tissue specific profiles are those that minimize false negative matches.

We offer the following tissue-specific profiles:
  • immune cell-specific profile
    This profile is constructed to search for potential binding sites within regulatory regions of genes whose transcription is induced upon immune response in T-, B-, mast, myeloid, natural killer cells, macrophages. Click here to view a list of factors known to be active in this tissue.

  • cell cycle-specific profile
    This profile is designed to search for potential binding sites within regulatory regions of genes whose expression is dependent on the stage of cell cycle. Click here to view a list of factors known to be active in this tissue.

  • muscle-specific profile
    This profile is designed to search for potential binding sites within regulatory regions of muscle-specific genes. Click here to view a list of factors known to be active in this tissue.

  • liver-specific profile
    This profile is designed to search for potential binding sites within regulatory regions of liver-enriched genes. Click here to view a list of factors known to be active in this tissue.

All gene and factor list we created with the help of the TRANSFAC® database.

We also offer an additional profile called the "best selection profile". This profile has been constructed in the following way: TRANSFAC®  contains several groups of matrices for the same transcription factor. We selected some of these groups and included just one matrix out of such a group in the" best selection profile". We used the criteria described above for the tissue-specific profiles to decide which matrix to use. We chose cut-offs to minimize false negative matches for this profile.


4. Submit the form

Press the Submit button and a results page will appear.





The results page

The results page displays a listing of all matches found in the input sequence. The output of the program is limited to 500 000 matches per sequence. The results are presented in a table with the following columns:

  1. identifier of the respective matrix
Each identifier is linked to the TRANSFAC® entry of its matrix or, if it is a user-defined matrix, the respective matrix is displayed.
  2. position of the match in the input sequence and the strand ((+) or (-)) in which it can be found
  3. score for core similarity (core match)
  4. score for matrix similarity (matrix match)
  5. matching sequence
The capital letters indicate the positions in the sequence which match with the core sequence of the matrix, while the lower cases refer to the remaining position of a matrix.
  6. name of the factor whose binding site is represented by the matrix
If an entry exists, a factor name is linked to a selection of the TRANSFAC® factor table, showing all entries of this factor mentioned in the respective matrix entry (TRANSFAC® matrix table). It is also possible to view a graphic output of the results. Here the identifier of a matching matrix is "aligned" to the sequence being searched. When you use the "Back" button of your browser to return to the MatchTM page, please press the Reset button.Then the new results can be found in the lists of your results.



The last three lines of the results page give the total length of all sequences which have been searched, the total number of sites that have been found and the frequency of sites per nucleotide.

A flatfile version of the results can be found in the directory:
"<CGI-BIN-DIRECTORY-OF-WEB-SERVER>/biobase/transfac/<VERSION>/match/etc/usr/<USER-LOGIN>/"
The flatfiles have the ending".out".
 
 
MatchTM   Profiler
 
Create a new profile

1.Matrix selection:

To create a new profile, you should first select the matrices which might be of interest to you. This can be done in two ways:
  The first possibility is to select them directly from the list (on the left) by clicking on one or several matrices that you would like to have a closer look at and then pressing the Select button. It is possible to sort the list either by accession number or by factor name by pressing the respective button under the list of matrices.
  The second possibility is to search for accession number or factor name (on the right). You can enter either a list of accession numbers. or a list of factor names, separated by a comma. Then press the Search button.
After pressing one of the two buttons, the matrices will be listed in the middle of the page. The name of the factor whose binding site is described by the matrix, the accession number and the quality of the matrix are given. (If you click on an accession number, you can view the respective matrix.) In addition, the false positive frequencies for each selected matrix when also allowing a false negative rate of 10%, 30%, 50%, 70% or 90% are listed. The false positive rate was estimated on exon 3 sequences, while sets of generated oligonucleotides were used to calculate the false negative rate. . This new feature helps to select appropriate matrices for a certain task. When there are several matrices for the same factor, it is generally recommended to include that matrix into a profile which produces the smaller number of false positive matches for a desired false negative rate.

2. Include matrices in a profile:

Once you have had a closer look at the matrices, please mark those you would like to include in your profile. Now press the Include button.

3. Cut-off selection

After pressing the Include button, another page will appear, where you can specify individual cut-offs for every matrix. You can choose among cut-offs to minimize false positive matches (minFP), to minimize false negative matches (minFN) and to minimize the sum of both error rates (minSUM). The cut-offs allowing a false negative rate of 10% (FP10), 30% (FP30), 50% (FP50), 70% (FP70) and 90% (FP90) are also given. Please keep in mind that minFN and FN10 are identical. ( Please see also.) The number of false positive matches for these cut-offs, which was estimated on exon3 sequences, is given in brackets. You can enter your own cut-offs in the field current, if you like. If you are editing an existing profile, you will find here the cut-offs which are currently stored in the profile.

Mark the matrices you would like to save to your profile and make sure that you have specified cut-offs for them. Then please enter a name for your profile and press the Save button.

If the save was  successful, your new profile will be displayed. You can now run MatchTM  with your profile. Just go to MatchTM and mark the option "user-defined profile" and select your profile from the list.



Edit or delete an existing profile

Delete an existing profile

On the top of the page you can delete any predefined profiles you no longer need . Just select the profile and press the Delete button.

Edit an existing profile
You can also edit one of our profiles or one of your predefined profiles. If you choose one of our profiles, your edited version will be stored in your directory, which means it will then be one of your predefined profiles. After pressing the Edit button, the matrices of the profile will be displayed in the middle of the page in the same way as described above. Now you can go on adding additional matrices to your profile. Just perform the steps described above.