Sequence Comparison

This a simple tool for comparing DNA sequences, optimized somewhat for virus length genomes such as SARS-CoV-2. Some more sophisticated tools (although possibly less oriented to SARS-Cov-2) are listed on Wikipedia. This document is a tutorial describing what this visualization can show, what the components mean, and how to interact with it.

Some peripheral issues have the details hidden but can be clicked on for more information. An example is the line below this:

Why would you want to compare sequences?

There are many reasons, vastly more than I will discuss here.

You may be wondering how similar different coronavirus strains are, or how the difference between two human-infecting coronavirus strains compares to the difference between a human infecting strain and strain affecting some other animal. In this case the first visualization described here is more relevant.

Alternatively, you may be interested in the specific differences between two different samples of some well studied virus like Sars-Cov-2. In this case the second visualization described here is most useful.

Selecting the sequences to compare

The first thing to do is select the sequences you want to compare. This is done by entering their GenBank or RefSeq accession number in the boxes marked "Accession 1" and "Accession 2". Versions are optional. Assuming the sequence is available, the page will show a description of that GenBank record on the right. This may take a second or two if the record needs to be retrieved from GenBank.

The default sequences (shown at the top of the image below) are NC_045512.2, the base RefSeq sequence for Sars-Cov-2, and MT568634.1, a different sample of Sars-Cov-2.

Below the two accession numbers is a pop up menu labeled "Orthocononavirinae". This is a selection of RefSeq sequences for corona viruses. To select one of these, click on the triangle to the right of "Orthocononavirinae". A list will pop up. Select one of the sequences by clicking on it, or close the popup with the X in the upper right corner. A selected sequence will replace the accession number of the last accession box you selected. This is just some suggestions; it is nowhere near an exhaustive list of all possible accessions. See NCBI Virus for another three million suggestions.

Dot plot comparing NC_045512.2 and MT568634.1

Dot Plot

This is a standard method to compare sequences that may be very different (e.g human and yeast).

Imagine that you have a great big grid, with each column corresponding to a single base in one sequence, and each row a single base in the other sequence. Color each point in white if the subsequence of length (say 20) of bases in the first sequence starting at that column, is the same as the same length subsequence of the second sequence starting at the corresponding row. Otherwise color it black.

Of course as DNA sequences are long, the resultant image would generally be too large to display on a computer monitor, and so we look at a scaled down version where a pixel's illumination is determined by the number of white grid elements that that pixel represents.

If the two sequences are very similar, we would expect to see a diagonal white line where the sequences match, with perhaps some random matches off the line, which is indeed what we do see in the image above.

The first sequence is represented by a (meaningless pretty) cartoon of single stranded helix on which one can zoom (more details later) above the black square, with a nucleotide position ruler and annotations below the black square. The second sequence is represented similarly except on the left and right.

SARS1 still matches very closely but not as well; changing the second accession to "NC_004718.3" produces a fainter diagonal line with some gaps. See image below.

Dot plot comparing SARS1 and SARS2

Parameters

The length of the sequences of nucleotides that must match perfectly is given in the "Fragment Length" box, and can be edited. Higher lengths will of course generally have fewer (fainter) matches, and generally less noise. Reasonable numbers may be 10 to 50. The "Use Peptides" checkbox says don't look for perfect nucleotide matches, but rather check that the peptides that each codon of three nucleotides encodes is the same. This means that some variations (that don't change what protein a gene would produce) still count as perfect matches. This will generally make the picture brighter, and is generally more biologically meaningful. When using peptides, the "Fragment Length" is still in terms of nucleotides, and partial codons will be ignored. So 12, 13, and 14 would all produce the same results. Lastly brightness is a scaling factor that generally makes the image brighter with higher numbers. It has no biological meaning, and is solely to make the image easy to interpret.

Zooming

You can zoom in on some region of the image by holding down the shift key, pressing the mouse button down at one corner of the rectangle you wish to zoom in on, and then drag the mouse (holding the button down) to the opposite corner, then release the mouse button. While dragging the mouse, the region not desired will be covered in purple. In some sense horizontal and vertical zooming are independent; one can zoom in just one dimension by doing the same shift-drag combination over the annotation region below and to the right of the black rectangle.

When zoomed, the region of the sequence one has zoomed in on will be shown pictorially above and to the left of the black rectangle. See image below.

Zoom in on dot plot

One can move the zoom region around without changing its size by clicking and dragging (without the shift key) on the black rectangle or just in one dimension on the annotations and zoom regions around the black rectangle.

You can zoom out in either dimension by clicking the zoom out icon (magnifying glass) on the upper left corner of the black rectangle.

Annotations

The annotations are mostly colored rectangles representing genes or other notable structures, and mostly come from the NCBI records. The arrows represent open reading frames, with their direction being the reading direction. Zooming in will give sequence details (and codons/peptides for coding sequences). Holding the mouse over them will give some more detailed information.

Colors on the image display

If a sequence matches more frequently than is vaguely plausible, pixels will be colored in red rather than white. This generally comes from repeated short sequences.

If a sequence contains "N" elements, the corresponding rows and columns will be greyed out.

Sequence Graph

Switching the Accession 2 back to the very similar "MT568634.1", we can compare the two sequences in a different way: by looking at individual base changes. To do this, click on the "Sequence Graph" tab to the right of "Dot Plot".

A graph comparing the sequences and the two sequences will be shown, with every mismatch drawn. Clearly this is not workable for anything other than very close sequences, such as two different strains of SARS-Cov-2, as in this example (image below).

Sequence graph comparing two sequences

Each of the two colored lines represents one of the sequences. When the lines run together, they have identical sequences. When they are separated, they have different sequences. Path segments are labeled with the sequence (if short) or the length (if long).

This indicates that the NC_045512.2 sequence (grey) contains 12 nucleotides at the start that are missing in the MT568634.1 sequence (brown), followed by 1417 identical nucleotides, followed by an a for NC_045512.2 and a g for MT568634.1 followed by 15104 identical nucleotides ...

This is very similar to the graph structure described in more detail in the tutorial for Show Graph.

Annotations on the first sequence are shown above the comparison graph; annotations on the second sequence are shown below the comparison graph. The meaning is the same as in the Dot Plot.

As elsewhere, you can zoom by holding the shift key down and dragging the mouse over the region you wish to zoom in on. Dragging the mouse without the shift key will move the zoom region.

Graph computation method

There are a couple of different ways of computing the graph displayed here. Parameters are disabled by default as you don't want to modify them unless you know exactly what you are doing.

The computation here start off by finding sequences of exact matching nucleotides of length at least 20. A simplest path of these is found matching the two genomes, and then the mismatched regions between are analyzed using the relatively computationally expensive Smith-Waterman algorithm.

Saving images

The "Save Image" button allows saving the graphs and annotations as an SVG (vector) format. It is not as appropriate in the Dot Plot case as the Dot Plot is itself pixelated. The image directly above without the surrounding browser is the result of this button.

More then two sequences

You can compare more than two sequences by adding their accession numbers to the "Extra Accessions" box. Multiple sequences can be separated by a space. The image below comes from setting the "Extra Accessions" text field to MT974069.1 MW012266.1 which are two more SARS-Cov-2 sequences. You now get an image with four graphs. The additional two sequences have their annotations at the bottom of the picture.

Graph of four sequences

Sequences that are quite different

Sequences that have more than a handful of differences will be unusable on this display - it will take a long time to compute them, you will be confused about what is happening, and then the picture will be horribly crowded. Don't do it, please!

Miscellaneous

You can send URLs to other people or bookmark them. When you change options, they are reflected in the URL, so you will see the same thing.

You can use the sequence comparison images generated by this site as you wish, although citing this website is appreciated.

This website is created as a hobby; care but no responsibility is taken. The server is running on a computer in a basement with a home internet connection so large loads may slow it down.