|
How to use Match
TM
, Match
TM
Profiler and
the Matrix Generation Page
|
|
Introduction |
The MatchTM tool is designed for searching potential binding sites for transcription factors (TF binding sites) in any sequence which may be of interest. MatchTM uses a library of mononucleotide weight matrices from TRANSFAC®.
As a further feature, MatchTM allows you to specify your search by using profiles. We use the term "profile" for a specific subset of weight matrices from the TRANSFAC® library with core similarity cut-off values and matrix similarity cut-off values for each matrix. |
The MatchTM Profiler provides a tool for creating (editing, deleting) matrix profiles - you can create - for example - tissue specific matrix profiles that include weight matrices for tissue specific transcription factors. |
You can create your own matrices from a set of aligned sequences on the Matrix Generation Page. Optimized cut-off values will also be estimated for these matrices, so that they can be used directly for a search in MatchTM or they can be included into profiles on the MatchTM Profiler Page.
|
How to cite MatchTM |
So far we have not published a paper for MatchTM . - However, you can refer to the latest poster, which was published at the German Conference on Bioinformatics 2001 (available at http://www.bioinfo.de/isb/gcb01/poster/index.html). Additionally, please refer to our company as the creator of this tool.
|
MatchTM |
|
Viewing or deleting previous results |
The results of each search you perform with MatchTM will be stored in your user directory. You can view or delete these results. |
|
Deleting
a previously stored sequence |
Each sequence you enter into the MatchTM
form will be stored (see also below). You can delete the
sequences you do not want to keep any longer on top of the page. |
|
Starting a new search
|
1. Enter a name for your search
You should first enter a name for your search, since MatchTM will store your search result under that name. If you do not enter a name, MatchTM uses "default" as result name.
|
2. Select a sequence
You have three options for selecting a sequence you would like to search:
|
|
a) Select one of your stored sequences:
If you select this option, you can choose among the sequences you have entered for a previous search.
|
|
b) Select an example:
If you choose this option, an example sequence will be used for your search. It is the 5' flank of the Rat tyrosine aminotransferase (TAT) gene (EMBL: M34257).
|
|
c) Enter a new sequence:
To run the search with a new sequence, you should first enter a name for
it. The sequence will be stored under that name so that you can use it again for a later search. Next, you can insert your sequence.
The following formats are accepted: FASTA, TRANSFAC, EMBL, GenBank, IG, and RAW. (RAW format means the pure sequence.) Examples of each format are given below. The iupaccode characters 'B','D','H', 'K', 'M', 'R', 'S', 'V', 'W','Y' within a sequence are changed to 'N'. Using the same format for all sequences, you can always enter one or several sequences at a time - with one exception: In RAW format it is only possible to enter one sequence at a time.
The maximal length of a sequence you can search with MatchTM is limited to 300,000 bps. |
|
RAW format: (All newlines and whitespaces will be ignored.)
|
acacgtagctagctagctgatcgtagctagtcgatcgtagctagctagctgatcgatgctagctgatcgtagctagtcgatag
tctagctagctagtcgatcgtagctagtcgatgctagctagctgtgtgtagctagtcgatcgatgctagctgatcgatcgtaa
gtctgatctagctagctagcgatcgtagctgatcgtagctagcatgctagtcgatgca
|
FASTA format: |
>seq1
acagctagctacgatgatcgatcgatgctacgtcgtagtacgatcgtacg
|
TRANSFAC format: (Only the fields essentially needed to recognize an entry in TRANSFACformat are shown. More fields may be included.) |
AC R00000
XX
ID EXAMPLE$EXAMPLE_02
XX
SQ CTCCATGGGAGTTTCTGAAGAACCTTAGTTAATAATTTTCACAGCTGTGCAC.
XX
//
|
EMBL format: (Only the fields essentially needed to recognize an entry in EMBL format are shown. More fields may be included.) |
ID EXAMPLE standard; DNA; PLN; 360 BP.
XX
SQ Sequence 360 BP; 63 A; 92 C; 97 G; 108 T; 0 other;
ctgcagcccc ggtttcgcaa agttaataat tttcagccgc gcacgtggtt ggccaaaccg 60
caccctcctt cccgtcgttt cccatctctt cctcctttag agctaccact atataaatca 120
gggctcattt tctcgctcct cacaggctca tctcgctttg gatcgattgg tttcgtaact 180
ggtgagggac tgagggtctc ggagtggatt gatttgggat tctgttcgaa gatttgcgga 240
ggggggcaat ggcgaccgcg gggaaggtga tcaagtgcaa aggtccgcct tgtttctcct 300
ctgtctcttg atctgactaa tcttggttta tgattcgttg agtaattttg gggaaagctt 360
//
|
GenBank format: (Only the fields essentially needed to recognize GenBank format are shown. You may include more fields.) |
LOCUS EXAMPLE 360 bp DNA PLN 13-JUN-1996
ACCESSION K00000
1 ctgcagcccc ggtttcgcaa agttaataat tttcagccgc gcacgtggtt gcccacaggc
61 caccctcctt cccgtcgttt cccatctctt cctcctttag agctaccact atataaatca
121 gggctcattt tctcgctcct cacaggctca tctcgctttg gatcgattgg tttcgtaact
181 ggtgagggac tgagggtctc ggagtggatt gatttgggat tctgttcgaa gatttgcgga
241 ggggggcaat ggcgaccgcg gggaaggtga tcaagtgcaa aggtccgcct tgtttctcct
301 ctgtctcttg atctgactaa tcttggttta tgattcgttg agtaattttg gggaaagctt
//
|
IG format: |
;seq_1
seq_1
acagctagtcgatcgatcgatgctagctgatcgtagctgatcgtagctaacgtgtagctagtcgacgtagctacgg1
|
| |
3. Select a group of matrices or a profile to run MatchTM |
|
a) Matrix selection
You can select the groups of matrices from the library which are of
interest to you. It is possible to combine several groups for one search. Just
mark those you would like to use. The following groups are available: vertebrates, insects,
plants, fungi, bacteria and nematodes. You can also run MatchTM with all matrices in the library.
There are two additional options to specify the set of matrices you would like
to use. One option is to restrict the search to the use of high quality
matrices only. The high quality criterion denotes the following: When using a matrix with a cut-off which allows a false negative rate of 50%, the frequency of matches found in exon3 sequences (false positive rate) must drop below a certain
threshold. This threshold is defined so that the matrices which produce the highest number of false positive matches are defined as low quality matrices (about 30% of the TRANSFAC® matrices).
The second option is to include user-defined matrices. If you have created some matrices on the Matrix Generation page, you can now include them in your search. Please make sure that you have selected a group of matrices for your search, which contains your matrix.
For any group of matrices, you can specify the cut-offs for core and matrix
similarity . The matrix similarity is a score that describes the quality of
a match between a matrix and an arbitrary part of the input
sequences. Analogously, the core similarity denotes the quality of a match
between the core sequence of a matrix (i.e. the five most conserved positions
within a matrix) and a part of the input sequence. A match has to contain the
"core sequence " of a matrix, i.e. the core sequence has to match with a score
higher than or equal to the core similarity cut-off. In addition, only
those matches which score higher than or equal to the matrix similarity threshold appear in the output.
The appropriate cut-off selection is very important and depends largely on the user ´s objectives. Exact matches between matrix and sequence can lack any biological relevance since some transcription factors have low affinity binding sites of biological significance. So, we have calculated three different kinds of cut-offs, each answering a different purpose.
You can use cut-offs to minimize false positive matches, to minimize false negative matches or to minimize the sum of both error rates. It is also possible to define a core and a matrix similarity cut-off which are used for all matrices of the selected group.
|
|
b) Profile selection
Instead of selecting a larger group of matrices, you can also use a predefined
profile. Each profile includes a subset of matrices with defined cut-offs. You can either use one of our predefined profiles or one you have created yourself on the MatchTM Profiler page.
Your predefined profiles:
If you come from the MatchTM Profiler
page, you will find your new profile in the list of your predefined profiles. The profiles you created using the TRANSFAC® search engine will also be listed there. You can recognize them by their name, which is set up in the following way: "month_day_hour-min-sec.prf"
To create a profile with the TRANSFAC® search engine, please follows these steps:
- Please use the TRANSFAC® query form "MATRIX SEARCH" to search for specific
matrices in TRANSFAC® . For example, you can enter AP-1 in the
textfield "Search Term" and then select "Binding Factor" in the "Quick
Search Fields". When you then press the "Submit Query" button you will
receive a list of AP-1 matrices. Next to each site entry you will
find a box.
- Please mark the boxes for those entries that you would like to
include in a MatchTM search.
- Then scroll to the bottom of the list.
Here you will find a box with the text "Run MatchTM with marked
entries". Please mark this box also.
- Now please click on "Show marked
entries/Start MATCHTM ". MatchTM will then be started and you will find your
selection of sites among the user defined profile.
Profiles created in this way will always contain minFP cut-offs.
Predefined Profiles provided by MatchTM :
We mainly provide tissue-specific profiles. Groups of transcription factors known to be active in a particular tissue have been collected for each profile with the help of information from the TRANSFAC® database. Matrices linked to these transcription factors in TRANSFAC® were then collected. When more than one matrix was linked to a transcription factor, we had to decide which matrix to include in the profile. We used the following criterion:
If possible (i.e. if there was such a matrix) only matrices that fulfilled the
"high quality criterion" were accepted. If several of the matrices fulfilling
the criterion were linked to the same transcription factor, we estimated how
many of the genomic binding sites of TRANSFAC® for this
particular transcription factor could be recognized with each of these
matrices. We include in the profile the matrix that has the lowest level of
false positive matches in exon2 and exon3 sequences when identifying 90% of the set of the genomic binding sites at the same time. The cut-offs used in the tissue specific profiles are those that minimize false negative matches.
We offer the following tissue-specific profiles:
- immune cell-specific profile
This profile is constructed to search for potential binding sites within regulatory regions of genes whose transcription is induced upon immune response in T-, B-, mast, myeloid, natural killer cells, macrophages. Click here to view a list of factors known to be active in this tissue.
- cell cycle-specific profile
This profile is designed to search for potential binding sites within regulatory regions of genes whose expression is dependent on the stage of cell cycle. Click here to view a list of factors known to be active in this tissue.
- muscle-specific profile
This profile is designed to search for potential binding sites within regulatory regions of muscle-specific genes. Click here to view a list of factors known to be active in this tissue.
- liver-specific profile
This profile is designed to search for potential binding sites within regulatory regions of liver-enriched genes. Click here to view a list of factors known to be active in this tissue.
All gene and factor list we created with the help of the TRANSFAC® database.
We also offer an additional profile called the "best selection profile". This
profile has been constructed in the following way:
TRANSFAC® contains several groups of matrices for the
same transcription factor. We selected some of these groups and included just
one matrix out of such a group in the" best selection profile". We used the
criteria described above for the tissue-specific profiles to decide which matrix to use. We chose cut-offs to minimize false negative matches for this profile.
|
|
4. Submit the form
Press the Submit button and a results page will appear.
|
The results page |
The results page displays a listing of all matches found in the input
sequence. The output of the program is limited to 500 000 matches per sequence. The results are presented in a table with the following columns: |
|
1. identifier of the respective matrix
Each identifier is linked to the TRANSFAC® entry of its matrix or, if it is a user-defined matrix, the
respective matrix is displayed.
|
|
2. position of the match in the input sequence and the strand ((+) or (-)) in which it can be found |
|
3. score for core similarity (core match)
|
|
4. score for matrix
similarity (matrix match)
|
|
5. matching sequence
The capital letters indicate the positions in the sequence which match
with the core sequence of the matrix, while the lower cases refer to the remaining position of a matrix.
|
|
6. name of the factor whose binding site is represented
by the matrix If an entry exists, a factor name is linked to a selection of the TRANSFAC® factor table, showing all entries of this factor mentioned in the respective matrix entry (TRANSFAC® matrix table). It is also possible to view a graphic output of the results. Here the identifier of a matching matrix is "aligned" to the sequence being searched. When you use the "Back" button of your browser to return to the MatchTM page, please press the Reset button.Then the new results can be found in the lists of your results.
|
The last three lines of the results page give the total length of all sequences which have been searched, the total number of sites that have been found and the frequency of sites per nucleotide.
A flatfile version of the results can be found in the directory:
"<CGI-BIN-DIRECTORY-OF-WEB-SERVER>/biobase/transfac/<VERSION>/match/etc/usr/<USER-LOGIN>/"
The flatfiles have the ending".out".
MatchTM Profiler
|
Create a new profile |
1.Matrix selection:
To create a new profile, you should first select the matrices which might be of interest to you. This can be done in two ways:
|
| |
The first possibility is to select them directly from the list (on the left) by clicking on one or several matrices that you would like to have a closer look at and then pressing the Select button. It is possible to sort the list either by accession number or by factor name by pressing the respective button under the list of matrices.
|
| |
The second possibility is to search for accession number or factor name (on the right). You can enter either a list of accession numbers. or a list of factor names, separated by a comma. Then press the Search button.
|
|
After
pressing one of the two buttons, the matrices will be listed in the middle of
the page. The name of the factor whose binding site is described by the matrix,
the accession number and the quality of the matrix are given. (If you click on
an accession number, you can view the respective matrix.) In addition, the
false positive frequencies for each selected matrix when also allowing a false
negative rate of 10%, 30%, 50%, 70% or 90% are listed. The false positive rate
was estimated on exon 3 sequences, while sets of generated oligonucleotides
were used to calculate the false negative rate. . This new feature helps to
select appropriate matrices for a certain task. When there are several matrices
for the same factor, it is generally recommended to include that matrix into a
profile which produces the smaller number of false positive matches for a
desired false negative rate.
|
2. Include matrices in a profile:
Once you have had a closer look at the matrices, please mark those you would like to include in your profile. Now press the Include button.
|
3. Cut-off selection
After pressing the Include button, another page will appear, where you
can specify individual cut-offs for
every matrix. You can choose among cut-offs to minimize false positive matches (minFP), to
minimize false negative matches (minFN)
and to minimize the sum of both error rates
(minSUM). The cut-offs allowing a false negative rate of 10% (FP10), 30%
(FP30), 50% (FP50), 70% (FP70) and 90% (FP90) are also given. Please keep in
mind that minFN and FN10 are identical. ( Please see also.) The number of false positive matches for these cut-offs, which was estimated on exon3 sequences, is given in brackets. You can enter your own cut-offs in the field current, if you like. If you are editing an existing profile, you will find here the cut-offs which are currently stored in the profile.
Mark the matrices you would like to save to your profile and make sure that you have specified cut-offs for them. Then please enter a name for your profile and press the Save button.
If the save was successful, your new profile will be displayed. You can now run MatchTM with your profile. Just go to MatchTM and mark the option "user-defined profile" and select your profile from the list.
|
Edit or delete an existing profile |
Delete an existing profile |
On the top of the page you can delete any predefined profiles you no longer need . Just select the profile and press the Delete button.
|
Edit an existing profile
|
You can also edit one of our profiles or one of your predefined profiles. If you choose one of our profiles, your edited version will be stored in your directory, which means it will then be one of your predefined profiles. After pressing the Edit button, the matrices of the profile will be displayed in the middle of the page in the same way as described above. Now you can go on adding additional matrices to your profile. Just perform the steps described above.
| | |
|