Max-Planck-Institut für Informatik
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

MethMarker F.A.Q.

Installation

What is the difference between MethMarker Java Web Start and local installation?

There is no functional difference between MethMarker Java Web Start and local installation. The same program is just started in two different ways. With the Java Web Start, you don't need any installation procedure, which should be most comfortable for almost all users. If your machine doesn't support Java Web Start, or the program shows some unexpected behaviour, you can install MethMarker on your machine. Further, the MethMarker Web Start doesn't include the demonstration dataset, which is needed if you want to follow the tutorials or try MethMarker out. You can also use the Web Start and download the demonstration data separately.

How do I install MethMarker?

Please look at the download page, on which the install instructions are written. The Windows installer for MethMarker consists of a native .exe file, while the platform-independent MacOS / Linux / Unix installer is a .jar file that can be launched with the following command line statement "java -jar install_MethMarkerUnix.jar", after the ZIP archive has been unpacked.

What are the requirements for the use of Methmarker?

  • Platform-independent due to Java technology,
  • Java 1.5 or higher to start the program,
  • 512 Mb RAM or more,
  • if you use MethMarker for the design of DNA methylation assays, you need a DNA sequence in fasta format and optionally primer sequences for this DNA sequence,
  • if you use MethMarker for the development of DNA methylation biomarkers, you need a DNA sequence (target sequence) and bisulfite sequenced training samples (preprocessed or generated with BiQ Analyzer, QUMA or EpiTYPER),
  • if you use MethMarker to classify samples with an existing PMML-model, you need the PMML-file and your sample data (i.e. the methylation grades of the corresponding CpG sites),
  • to use MethMarker's online tools (all optional - such as SNP finding via UCSC Genome Browser), you need an unrestricted internet connection.

How do I run MethMarker?

The MethMarker installer will generate shortcuts (startmenu, desktop icon), which can be clicked to start the program. Alternatively, you can start MethMarker after installation with the command java -jar -Xmx512m MethMarker.jar -d 0 in the program folder.

How do I uninstall MethMarker?

MethMarker comes with an uninstaller. You should have access to it via the shortcut "uninstall MethMarker". Alternatively, you can directly call the uninstaller located in the program folder of MethMarker. Go to the subfolder "uninstall" and enter java -jar uninstaller.jar.


Usage of MethMarker

What is MethMarker and when should I use it?

MethMarker is a bioinformatic application that has been designed to automate the generation and validation of DNA methylation biomarker candidates. The idea for MethMarker was to facilitate the process of designing a clinically useable biomarker based on a previously identified gene or genomic region.
Assume you have a DNA sequence whose methylation pattern has been proven to distinguish between cancer and non-cancer cells (or some similar meaning) - but only if sequenced with bisulfite sequencing, which is expensive and not appliciable to low-quality DNA (often error prone). To enable diagnostic use, a biomarker has to be developed for the region that is based on a high-throughput DNA methylation analysis method, such as COBRA, bisulite SNuPE, bisulite pyrosequencing, MSP, MethyLight or MeDIP-qPCR. MethMarker supports you in translating the marker-DNA sequence into biomarker candidates that are ascertainable by these methods.

During the software implementation, MethMarker gained more and more features and uses: MethMarker can be used ...

  • ... for the design of DNA methylation analysis assays for a given genomic region,
  • ... for investigating bisulfite sequenced sample data (both visually and statistically),
  • ... as generator of DNA-methylation biomarkers (as logistic regression models), based on bisulfite sequenced training data,
  • ... for the application of DNA methylation biomarkers to classify new samples with such markers,
  • ... to save biomarkers as XML-based PMML files, which is used as a standardized file format for logistic regression models.

General

Can MethMarker be used offline?

Yes. All important functionalities are implemented in MethMarker. For special tasks, you can optionally use online services (such as "show in UCSC", "find primers with eprimer" or "retrieve gene information online") for which an unrestricted internet connection is required.

I get a "heap space error".

This problem occurs when importing a very large number of samples into MethMarker or using overly loose constraints for MSP assay design or MethyLight assay design. It means that the Java environment of MethMarker doesn't have enough working memory for calculations. There are several solutions to this problem:

  • Restart MethMark and try again.
  • Repeat the analysis restricted to those steps that are indispensable.
  • Repeat the analysis on a different computater or operating system (e.g. Windows emulations on linux machines can have limited user memory)
  • Restart MethMarker with the following command in a shell (in the MethMarker directory): java -jar -Xmx512m MethMarker.jar . This will increase the working memory to 512mb. Your can increase this number as you want, if your Computer has enough working memory.

Genomic target sequence

Which file formats are supported for the genomic sequence?

The genomic sequence is expected to be in fasta format, simple sequence format (simple text file), GenBank format or EMBL format. In fasta, the first row starts with ">", followed by the sequence annotation. The second row is the genomic sequence. In simple sequence format, the first row of the fasta format is omitted, the remainder is identical to fasta. Sequences in the other formats are automatically downloaded from the corresponding services. Examples of all formats are included in the demonstration data set.

How can SNPs be included in the genomic target sequence?

SNPs can be included online by Retrieving Gene Informatiion from UCSC (button on the right side in MethMarker). These SNPs come from the UCSC database. However, SNPs can also manually be included in the DNA sequence: the genomic sequence consists of the four letters 'A', 'C', 'G' and 'T'. Additionally, the one-letter-code for SNPs is also recognized (e.g. 'B' for 'C', 'G', 'T'). If you are not familiar with this code, you can also use following syntax: e.g. '[C/G/T]' for a SNP of 'C', 'G', 'T'. If you want to insert SNPs manually to the sequence, just replace the corresponding base by the SNP in the genomic sequence file and load it into MethMarker.

I cannot browse the sequence in UCSC!

Maybe you don't have the right to write on your PC. To browse the sequence in UCSC, MethMarker writes a HTML file on your PC (program folder) and opens it. The file is beeing deleted when closing the program. Try to open MethMarker with administrator rights.

Why should I add primers to the sequence?

Some 3' or 5' CpG sites might be analyzed only if the genomic target sequence has primers attached on both ends, because in that case, the sequence is longer around these CpG sites and MethMarker can generate primer assays for the analyzing methods SNuPE, Pyrosequencing, MSP, MethyLight and MeDIP-qPCR.

Samples

Which file formats are supported for the samples?

The samples are the bisulfite sequenced methylation profile data. They can come from:

  • BiQ Analyzer: One output HTML file of BiQ Analyzer is one sample, consisting of several clones. It should comprise the same DNA region as the loaded genomic target region.
  • QUMA: One output CSV file of QUMA is one sample, consisting of several clones. It should comprise the same DNA region as the loaded genomic target region.
  • EpiTYPER: One output CSV file of EpiTYPER comprising all samples, each consisting of several clones. EpiTYPER files only reflect CpG sites and may not comprise more CpG sites than the genomic target region. However, EpiTYPER files may include several clones per sample.
  • Custom Methylation Profile: If you don't use these programs to generate methylation profiles, you can also import custom methylation profiles of samples. These are simple TXT or CSV files that contain the methylation information of each CpG per sample.
Examples of each file type is included in the demonstration dataset.

How do I know the class of a sample?

You don't need to know. If you know a sample's class, you can load the sample into MethMarker via the corresponding button ("Add Case (+)" or "Add Control (-)"). If you are not sure, you can use any of these buttons and let MethMarker decide which class the samples belong to via "Sample" -> "Classify All...". Finally, a sample's class is indicated by a red (positive, case), green (negative, control) or grey (unknown) symbol, respectively, on the left of each sample.

Can I assign more than two classes to my samples?

MethMarker is designed to generate biomarkers that distinguish between methylated and unmethylated samples. Therefore, two classes is the default number of classes. But you can use the class "unknown" and assign it to special samples as third class. MethMarker will not use samples with "unknown" class to train biomarkers, but it will try to classify also these samples with a generated classifier. In this way, you can see how a classifier (biomarker) would handle the samples with "unknown" class.

Does MethMarker align the samples to the genomic target region?

Yes, a Needleman-Wunsch sequence alignment algorithm is used to align the samples to the genomic target region.

How is the p-value (Fisher's exact test) calculated?

MethMarker calculates the p-value for the Fisher's exact test for 2x2 contingency tables. There are two groups of samples: methylated and unmethylated samples. For each CpG site, the number of methylated clones and unmethylated clones in both groups constitute the 2x2 contingency table. See here for explanation of Fisher's exact test.

How is the p-value (Mann-Whitney test) calculated?

The MannWhitney test is calculated if there are at least 2 methylated and 2 unmethylated samples. This test is implemented by the java statistical classes (jsc). See here for more information.

COBRA, SNuPE, Pyrosequencing, MSP, MethyLight and MeDIP-qPCR

I don't get any biomarkers on the Model View after click on "Score Biomarker Candidates".

  • Check if CpG sites are selected (highlighted in blue) in the Biomarker View. Only selected CpG sites are regarded as potential biomarkers.
  • If CpG sites are selected and you still don't get any biomarkers in the Model View, the (Pearson-)correlation of these CpG sites to the sample's overall methylation grade might be NaN ("not a number" - undefined). If so, these CpG sites are not listed in the Model View. This can be the case if the CpG methylation grade does not change through the samples (e.g. if there are too few samples loaded in MethMarker), in which case biomarkers cannot be evaluated in terms of their correlation. Especially MSP and MethyLight biomarkers have this problem, since they take several CpG sites into account that have to be methylated at the same time to represent "methylation". This phenomen, a small number of loaded samples and missing values on the bisulfite sequenced data can lead to undefined (Pearson-)correlation.

I cannot connect to Primer3.

For this service, MethMarker (and Java) needs unrestricted internet access. This is a problem for lab computers that are located behind restrictive firewalls. Try it out at other machines. Sorry.

What is a non-informative SNP or an informative SNP?

An non-informative SNP is a C/T to T/T or T/C to T/T SNP. Informative are all other SNPs and all SNPs within a CpG.

I don't find any primers with SNuPE, Pyrosequencing, MSP, MethyLight or MeDIP-qPCR.

  • This might be an individual problem for each method. Go through the options for the methods and be sure that the criteria can be fulfilled.
  • It may be that the target sequence (Sequence View) has too many SNPs or deletions / insertions with respect to the loaded samples.
  • Try out Primer3 to find primers on the sequence (available in the options for each method).

Does MethMarker check for primer-dimers and other regions in the sequence where the primers might bind?

MethMarker checks each primer for uniqueness. If there is another region in the sequence at which the primer would bind at least to 80 percent, the primer is omitted.

How do I know which direction the primers are running in?

The primers are visualized with small 3' and 5' numbers running from 5' to 3'.

Biomarkers

How is the correlation between DNA methylation at individual CpG sites and the overall DNA methylation status calculated?

A percentage of methylation is known for each CpG site. For example, assume that a sample has 10 clones, 6 of which are methylated on a certain CpG site, indicating that this CpG site is 60% methylated. In addition, the sample exhibits an overall percentage of DNA methylation, which is the number of methylated CpGs of all clones devided by the number of CpGs of all clones. For a biomarker consisting of one CpG site, both numbers are correlated over all loaded samples. A high correlation indicates that these CpG sites are highly methylated when the whole sample is highly methylated. These CpG sites would "represent" the methylation state of the sample.

Note that the percentage of methylation of combined CpG sites, i.e. two CpG sites in one restriction site, e.g., is calculated as the number of simultaneously methylated CpG sites devided by the number of clones in the sample.

Note that the percentage of methylation of independent CpG sites, i.e. two restrictions with one CpG site, each, e.g., is calculated as the average of both methylation grades.

How can I inspect a biomarker to obtain more information?

After creating logistic regression models, you can double-click in the Model View on the biomarkers having a "calc." written in the column "Model" (last column), which will open a new window with additional information about the biomarker. Alternatively, click on the green button "Results >>" in the lower right corner to open the biomarker result window. In this result window, you can switch from one biomarker candidate to another by doubleclick on a biomarker candidate on the left side.

Is it possible to save a biomarker?

Yes.

  • Right-click on the plot in a biomarker window to edit and save the plot.
  • Use the "Save as PMML..." button in a biomarker result window to save the biomarker as PMML (logistic regression model).
  • Use the "PDF Report..." button in a biomarker result window to save a PDF report about the biomarker.

What is the "Error Test"?

After creating a logistic regression model for a biomarker, MethMarker tries to classify all samples that are loaded into MethMarker according to this model. This is shown in a colored plot in the biomarker window. A score of a logistic regression model above 0 indicates a methylated sample, below 0 an unmethylated sample. The basis for this classification is provided by the CpG sites of the model with their respective weights, combined with the DNA methylation profiles of the samples derived from bisulfite sequencing. In the Error Test, MethMarker repeats this classification. But this time, it does not take the original methylation profile, but an "erroneous" profile, in which some CpG sites get an random methylation grade between 0 and 1, representing error in sample data. You cannot determine which CpG sites have errors, since this is done randomely. But you can determine the error rate (besides the "Error Test" button). An error rate of five percent e.g. means that five percent of all CpG sites have random methylation grade rather than their original methylation grade. With this test, you can validate the robustness of your model to erroneous data.

What about "unknown (?)" samples regarding the training of logistic regression models?

For the training of models, you can choose the samples that should act as training data. Only samples with determined class are allowed. Samples with unknown class are not taken into account for training. However, these samples are later classified with the trained model as seen in the biomarker plot.

How to interpret the logistic regression formula of a biomarker?

In the formula, you just plug in the methylation scores of the CpG sites used in this biomarker. These scores should be obtained by applying the corresponding method (i.e. COBRA for a "CO" biomarker, SNuPE for a "SN" biomarker, Pyrosequencing for a "PY" biomarker, MSP for a "MS" biomarker, MethyLight for a "ML" biomarker and MeDIP-qPCR for a "MD" biomarker). The formula then gives rise to the biomarker score for the corresponding sample. A score below 0 indicates an unmethylated sample, whereas a score above 0 indicates a methylated one.

HINT: perhaps you can identify stricter bounds for the classification: Look at the plot in which MethMarker classified all samples according to this biomarker. Perhaps all unmethylated samples have a score < -10, suggesting that a score below -10 indicates strictly an umethylated sample. In the same way, all methylated samples have a score > 30, suggesting that a score above 30 indicates strictly a methylated sample.

Note that the regression formula is trained on DNA methylation profiles obtained from bisulfite sequencing, which results in precise methylation profiles/scores for single CpG sites. Applying another method, like COBRA, SNuPE, Pyrosequencing, MSP, MethyLight or MeDIP-qPCR, may result in less accurate CpG score data, such that a biomarker model is less predictive. Experimental validation of a logistic regression formula is essential.


Webservices, Data Security & Confidentiality

What happens with my data when I choose to calculate primers with Primer3?

The raw target region sequence (and nothing else, especially not the experimental data) is sent to ePrimer3 at EMBOSS webservice (http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/eprimer3.html), where the primers are calculated and passed back to the client.

What happens with my data when I choose to browse the sequence via UCSC?

The raw target region sequence (and nothing else, especially not the experimental data) is sent to UCSC Genome Browser (http://genome.ucsc.edu/).

What happens with my data when I click on "Retrieve Gene Information Online"?

The raw target region sequence (and nothing else, especially not the experimental data) is sent to UCSC Genome Browser (http://genome.ucsc.edu/), where chromosome and start/stop positions are determined. This information is then used to retrieve SNPs (SNP) and gene annotations such as transcription start sites and exons (TSS) via the UCSC Genome Browser. MethMarker uses the following databases:

  • Human: SNP: hg18.snp130; TSS: hg18.refGene
  • Mouse: SNP: mm9.snp128; TSS: mm9.refGene
  • Rat: SNP: rn4.snp125; TSS: rn4.refGene

I cannot connect to the UCSC server to get sequence information!

To get sequence information from the UCSC server, you need unrestricted internet access. Especially when you work behind a firewall, problems may occur. Adding following access rules to the firewall might solve the problems:

  • HGW8.CSE.UCSC.EDU Port 80
  • HGNFS2.CSE.UCSC.EDU Port 3306