Match(TM) command line documentation Content * INTRODUCTION * HOW TO USE MATCH(TM) FROM COMMAND LINE * SEQUENCE INPUT FORMAT * PROFILES (MATRIX SET WITH CUT-OFFS) * CUT-OFFS * MATCH RESULT OUTPUT * HOW TO CITE MATCH(TM) For additional information, please see also the online documentation in your TRANSFAC(r) (BKL) online account and the reference at the end of this file. ************************************************************************** * INTRODUCTION The Match(TM) tool is designed for searching potential binding sites for transcription factors (TF binding sites) in any DNA sequence which may be of interest. Match(TM) uses a library of mononucleotide weight matrices from TRANSFAC(r). Match(TM) allows you to specify your search by using profiles. We use the term "profile" for a specific subset of weight matrices from the TRANSFAC(r) library with core similarity cut-off values and matrix similarity cut-off values for each matrix. >From release to release, results obtained with Match(TM) can change, due to the addition of new matrices, matrix and profile updates, and improved cut-offs. Therefore, please always note the release / version number of TRANSFAC(r) Professional / Match(TM) which you used for your analysis. Although we advise to use always the latest release of TRANSFAC(r) Professional, in case you are doing a large analysis for which you want to assure that the conditions stay constant, it is advisable to do the whole analysis with one and the same locally installed version (same Match(TM) binary and same data files, i.e. same matrix.dat as well as same profiles). Match(TM) is designed for binding site search in single sequences, i.e. all sequences (from the submitted FASTA file) are analyzed individually. Comparative analysis for overrepresented binding sites in a set of sequences vs. a background set is provided in the ExPlain(TM) Analysis System. For information on the ExPlain(TM) Analysis System, please see the BIOBASE website: http://www.biobase-international.com/ ************************************************************************** * HOW TO USE MATCH(TM) FROM COMMAND LINE Please note: Currently only the binding site search of Match(TM) can be provided as command line tool. For accessory functionalities like creation of user-defined profiles or matrices, please refer to the online version of Match(TM) or ExPlain(TM). Command line use can be subject to change without notice. For the full paramter list of your current version, please check description on command line, when you type 'match' without parameters. Currently the following Match(TM) command line versions are available (see bin subdirectory in the match directory, after you downloaded and extracted the TRANSFAC flat file download package): match (for Linux) match_linux64 (for Linux 64) match.exe (for Windows) match_irix (for Irix) match_solaris (for Solaris) match_true64 (for True64) The synopsis for command line usage is: (For additional parameters, please see parameter list, when you type 'match', i.e. without any paramters, on command line.) match match: Match(TM) binary mxlib: library of transcription factor weight matrices (matrix.dat) in TFP__data/match/data/matrix.dat seq: input sequence in EMBL-Format or FASTA (the input file may contain several sequences); for testing you can use the following example file: TFP__data/match/data/default.seq.EXAMPLE out: file to which Match(TM) shall write its output (If the specified file already exists, it will be overwritten without any notice!) mxprf: file containing a selection of matrices (profile) with defined cut-offs For command line use of Match(TM) the cut-offs for the matrices have to be specified in the profile (please see below)! ************************************************************************** * SEQUENCE INPUT FORMAT FASTA format: >seq1 acagctagctacgatgatcgatcgatgctacgtcgtagtacgatcgtacg ctagctacgatgatcgatcgatgacagctagctacgatgatcgatcgatg >seq2 ctagctacgatgatcgatcgatgacagctagctacgatgatcgatcgatg acagctagctacgatgatcgatcgatgctacgtcgtagtacgatcgtacg >seq3 acagctagctacgatgatcgatcgatgctacgtcgtagtacgatcgtacg ctagctacgatgatcgatcgatgacagctagctacgatgatcgatcgatg EMBL format: (Only the fields essentially needed to recognize an entry in EMBL format are shown. More fields may be included.) ID EXAMPLE standard; DNA; PLN; 360 BP. XX SQ Sequence 360 BP; 63 A; 92 C; 97 G; 108 T; 0 other; ctgcagcccc ggtttcgcaa agttaataat tttcagccgc gcacgtggtt ggccaaaccg 60 caccctcctt cccgtcgttt cccatctctt cctcctttag agctaccact atataaatca 120 gggctcattt tctcgctcct cacaggctca tctcgctttg gatcgattgg tttcgtaact 180 ggtgagggac tgagggtctc ggagtggatt gatttgggat tctgttcgaa gatttgcgga 240 ggggggcaat ggcgaccgcg gggaaggtga tcaagtgcaa aggtccgcct tgtttctcct 300 ctgtctcttg atctgactaa tcttggttta tgattcgttg agtaattttg gggaaagctt 360 // ************************************************************************** * PROFILES (MATRIX SETS WITH CUT-OFFS) We use the term "profile" for a specific subset of weight matrices from the TRANSFAC(r) library with core similarity cut-off values and matrix similarity cut-off values for each matrix. Match(TM) will search only matrices that are specified in the profile. It will return only hits with scores that are equal or higher than the cut-offs specified in the profile. Match(TM) offers a number of predefined profiles with cut-offs - to minimize false negative matches: match/data/minFN.prf (all TRANSFAC matrices) match/data/minFN_good.prf (only high quality matrices) - to minimize false positive matches: match/data/minFP.prf (all TRANSFAC matrices) match/data/minFP_good.prf (only high quality matrices) - to minimize the sum of both error rates: match/data/minSUM.prf (all TRANSFAC matrices) match/data/minSUM_good.prf (only high quality matrices) In the directory match/data/prfs you will also find profiles for different taxons (vertebrates, plants, fungi, insects, nematodes) and a number of tissue and condition/process-specific profiles for vertebrates. Cut-off in these profiles is minFN (to minimize number of false negatives). To change the cut-off or the composition of a profile, please see below. You can generate your own profiles with the help of the profile generation tool in the web interface of the online version of Match(TM) in the following ways: - Based on a matrix search result in TRANSFAC(r) (BKL) a profile can be created: Select the matrices in the database search result list to be included in the profile and then click on the "MATCH PROFILE" button on top of the table to be redirected to the Profile Generation Tool in the Match(TM) interface, where the cut-offs for the profile can be defined. - New profiles can be created in the Profile Generation Tool of the Match(TM) interface by selecting or searching from/in the listed matrices. - Based on already existing profiles new ones can be created in the Match(TM) interface, by adding or removing matrices and/or by changing the cut-offs. For details, please see the respective sections in the Match(TM) online documentation. To download in the Match(TM) interface a profile for command line use, please click the "Download" button on top of the page after you generated a new profile, or go to the Profile generation page, select one of your previously saved profiles and then click on "Download". - Profiles for command line use can also be created in and exported from the ExPlain(TM) Analysis System. - Finally, new profiles can be created or existing ones modified with the help of a text editor. (For example, vertebrate profiles can be generated by removing with help of the search/replace function of your editor all those lines/matrices with identifiers other than "V$..." from the provided profiles minFP.prf, minSUM.prf, ...) A profile should have the following format: ________________________________ TATA box and SP-1 sites tata.prf MIN_LENGTH 300 0.0 1.000000 0.6 0.5 M00216 V$TATA_C 1.000000 1.0 0.5 M00008 V$SP1_01 // ________________________________ Description: 1.line: profile description 2.line: profile name 3.line: relevant for the use with other programs, but should also not be left out for Match(TM). 4.line: relevant for the use with other programs, but should be contained in every profile. 5.-n.line: a line for each of the included matrices containing the following information: 1.000000: needed by Match core similarity cut-off matrix similarity cut-off TRANSFAC(r) accession number of the matrix TRANSFAC(r) identifier of the matrix last line: symbol for the profile end ************************************************************************** * CUT-OFFS Cut-offs are defined in the profile. Cut-offs can be changed with the help of the Profile Generation Tool in the Match(TM) online version by modification of an already existing profile and then saving it under a new name. After it has been saved, the new profile can be exported via the "Download" on top of the page. Cut-offs for core and matrix similarity: MSS (Matrix Similarity Score) The matrix similarity is a score that describes the quality of a match between a matrix and an arbitrary part of the input sequences. CSS (Core Similarity Score) The core similarity score denotes the quality of a match between the core sequence of a matrix (i.e. the five consecutive most conserved positions within a matrix) and a part of the input sequence. A match has to contain the "core sequence" of a matrix, i.e. the core sequence has to match with a score higher than or equal to the core similarity cut-off. In addition, only those matches which score higher than or equal to the matrix similarity threshold appear in the output. For the minFP, minFN, and minSUM cut-offs, first the core similarity score is calculated, and then using this core similarity score the matrix similarity score is calculated. Precalculated cut-offs: minFN (Cut-off to minimize false negative matches): The false negative rate was measured, as far as available, on known genomic binding sites for the transcription factors. In case not sufficient (less than 10) genomic binding sites were available, SELEX sites or sets of generated oligonucleotides were used for estimating the cut-offs to minimize the false negative rate, using actual weight matrices to calculate the probability of a nucleotide occurring at a certain position of a binding site. For each matrix we applied the Match(TM) algorithm to the test sequences without using any matrix similarity cut-offs. Then we set the cut-off to a value that provides recognition of at least 90% of oligonucleotides. We decided to tolerate an error rate of 10%. We call this set of cut-offs minFN (=FN10) cut-offs. Applying the minFN cut-offs, the user will find most genomic binding sites, but in this case a high rate of false positives should be taken into account as well. The minFN cut-offs are useful for the detailed analysis of relatively short DNA fragments. minFP (Cut-off to minimize false positive matches): In order to estimate this cut-off, which will reduce the number of random sites found by Match(TM), we applied the Match(TM) algorithm to promoter sequences from TRANSPRO. That score which gives 1% of hits in these sequences relative to the number of hits received when using the minFN score (calculated above) is defined as minFP. When a minFP cut-off is applied for searching a DNA sequence, the algorithm will find a relatively low number of matches per nucleotide. In the output the user will only find putative sites with a good similarity to the weight matrix; however, some known genomic binding sites could not be recognized. This kind of cut-off is useful, for example, for searching the most promising potential binding sites in the extended genomic DNA sequences. minSUM (Cut-off to minimize the sum of both error rates): We compute a sum of both error rates to find cut-offs that give an optimal number of false positives and false negatives. To do so, we compute the number of matches found in promoter sequences for each matrix using a cut- off allowing 10% of false negative matches (minFN=FN10). This number is defined as 100% of false positives. The sum of corresponding percentages for false positives and false negatives is then computed for every cut-off ranging from minFN10 to minFP. We refer to the cut-off that gives the minimum sum as minSum cut-off. FN10 (=minFN) This cut-off allows a false negative rate of 10%. For calculation, please see minFN above. FN30: This cut-off allows a false negative rate of 30%. FN50: This cut-off allows a false negative rate of 50%. FN70: This cut-off allows a false negative rate of 70%. FN90: This cut-off allows a false negative rate of 90%. Matrix "Quality": Matrices producing less than 10 hits (FP) per 1000 nucleotides (in sequences, 10,000 to 5,000 nucleotides upstream of the transcription start sites) at minSUM are defined as "high quality matrices". About 5% of the current matrices producing higher FP rate, can be excluded as "highly abundant" / "low quality"; these 5% of matrices give about 50% of all FP hits. The cut-offs and the quality (high/low) of a matrix is stored in the "index" file: example line from the index file: M00008|V$SP1_01|Sp1|T00759;T08484|0.819|0.957|0.819|0.851|0.819|0.914|high|0|0.851|7.161|0.887|2.486|0.922|0.790|0.954|0.117|0.973|0.018| description: matrix accession number matrix identifier matrix name factor list minFP core cut-off minFP matrix cut-off minFN core cut-off minFN matrix cut-off minSUM core cut-off minSUM matrix cut-off matrix quality (high, low) matrix type (not in use, always 0) matrix cut-off of false negative rate 10% FP hits (per 1kb nucleotides) at false negative rate of 10% matrix cut-off of false negative rate 30% FP hits (per 1kb nucleotides) at false negative rate of 30% matrix cut-off of false negative rate 50% FP hits (per 1kb nucleotides) at false negative rate of 50% matrix cut-off of false negative rate 70% FP hits (per 1kb nucleotides) at false negative rate of 70% matrix cut-off of false negative rate 90% FP hits (per 1kb nucleotides) at false negative rate of 90% These values are also shown when you view or generate a profile in the web interface of the online Match(TM). ************************************************************************** * MATCH RESULT OUTPUT If you run Match(TM) in the match/bin directory in the following way: match ../../data/matrix.dat ../data/default.seq.EXAMPLE result ../data/minFP_good.prf you will get a results file similar to the following one: -------------------------------------------------------------------------- Search for sites by WeightMatrix library: ../../data/matrix.dat Sequence file: ../data/default.seq.EXAMPLE Site selection profile: ../data/minFP_good.prf prf to minimize false positives, high qual. Inspecting sequence ID RNTATFL V$VMYB_01 | 1143 (-) | 1.000 | 0.966 | aaCCGTTact V$VMYB_01 | 11517 (-) | 1.000 | 0.961 | taCCGTTgtc V$ELK1_01 | 7227 (+) | 1.000 | 0.931 | ccagcaGGAAGttcat V$ELK1_01 | 9432 (+) | 1.000 | 0.931 | ataacaGGAAGcccaa I$KR_01 | 2676 (-) | 1.000 | 1.000 | ttAACCCgtt I$KR_01 | 7833 (-) | 1.000 | 0.966 | ttAACCCact F$MATA1_01 | 9800 (-) | 1.000 | 0.996 | atgtaCATCA V$VMAF_01 | 6289 (+) | 0.910 | 0.931 | tgatGATGActgagcaggg V$VMAF_01 | 8318 (-) | 1.000 | 0.889 | agcttctgcgTCAGCgcca V$NFKAPPAB65_01 | 331 (-) | 1.000 | 1.000 | GGAAAttccc V$CREL_01 | 331 (-) | 1.000 | 0.990 | GGAAAttccc V$NFKAPPAB_01 | 331 (-) | 1.000 | 1.000 | ggaaaTTCCC V$MYOGNF1_01 | 7526 (+) | 0.929 | 0.813 | ctgaagttacagTTGGTtgtgagccaact V$TAL1BETAE47_01 | 5124 (+) | 1.000 | 0.994 | ggaaaCAGATggtgcg V$TAL1ALPHAE47_01 | 5124 (+) | 1.000 | 0.994 | ggaaaCAGATggtgcg V$TAL1BETAITF2_01 | 5124 (+) | 1.000 | 0.997 | ggaaaCAGATggtgcg V$EVI1_04 | 834 (+) | 0.842 | 0.788 | atataaaacAAGTTa V$EVI1_04 | 1900 (-) | 0.842 | 0.767 | tATTTTattatttta ... Total sequences length=11973 Total number of found sites=1155 Frequency of sites per nucleotide=0.096467 -------------------------------------------------------------------------- Result output description The first three lines of the results file show which matrix library file, which sequence file and which profile file have been used for this search. Following this is a list of matches found in the searched sequence. The first column gives the TRANSFAC(r) identifier of the matching matrix, then comes the position and the strand where the respective match has been found. The core similarity score is given in column three, the matrix similarity score in column four. The last column gives the matching sequence. If the input sequence file contains several sequences, such a listing of matches will be given for each sequence in the input file. Each listing starts with the respective sequence identifier. The last three lines of the file give the total length of all sequences which have been searched, the total number of sites that have been found and the frequency of sites per nucleotide. Note: In contrast to the Match(TM) online version, the raw result on command line gives the matches by matrix not by location. For additional results and graphical outputs, please see the online Match(TM) tool and the Match(TM) functionality and related tools in the ExPlain(TM) Analysis System. ************************************************************************** * HOW TO CITE MATCH(TM) Kel, A. E.; Goessling, E.; Reuter, I.; Cheremushkin, E.; Kel-Margoulis, O. V.; Wingender, E. (2003) "Match(TM) : A tool for searching transcription factor binding sites in DNA sequences" Nucleic Acids Res. 31, 3576-3579. Additionally, please refer to BIOBASE GmbH as the creator of this tool. **************************************************************************