Sequence Graph Visualizations

This a prototype of a sequence graph visualization. It was originally designed for human, but recent events made me try it on SARS-CoV-2 (the virus causing covid-19). The latter is currently displayed here. This document is a tutorial describing what this visualization can show, what the components mean, and how to interact with it.

Some peripheral issues have the details hidden but can be clicked on for more information. An example is the line below this:

What is a sequence graph? Why would you want one?

The sequence of a DNA or RNA molecule is generally represented as a long linear sequence of nucleotides, each abbreviated by a letter, such as ACGTTGCTTGACAACCAAAACA.... RNA uses U instead of T, although often for convenience the letter T is used in such sequences anyway. If the molecule is circular, it is usually cut at some arrbitrary point and then represented as a linear sequence, with the understanding that the ends are actually attached.

The sequence of a population of molecules (e.g. a species' genetic diversity) is less easy to represent. For convenience it is typical to choose a single exemplar of the species, and call that sequence the sequence of the species, with the understanding that actual members of the species will probably have similar but slightly different sequences.

A more accurate description of a species' genome includes some representation of the alternatives available. This is usually imperfect, as the full range of genetic diversity is not know, and the interactions between different variations is unknown, and it is difficult to describe many of these variations.

A commmon, simple, and remarkably useful approach is to attempt to catalog all the observed small genetic variations, based upon the observation that this seems to account for a lot of the genetic variation in any genomes. Furthermore, it is easy to describe, visualize, and measure, and analyze. For instance, two sequences might be the same at all bases except one, where one nucleotide is different (a SNP), or there may be a nucleotide inserted or deleted at some place (indel).

Large variations are harder to deal with in many ways. They are harder to discover and characterize (genome read alignment to a reference exemplar often fails near them), they are harder to describe and visualise, and they often interact with a large number of other variations, making analysis more complex. They also appear to be relatively infrequent, and so one can often go a long way just ignoring them. However, they are there and ignoring them is not idea.

A sequence graph is a description and visualization technique that easily deals with small variations, and can describe some larger variations. It can also to some extent describe interactions of nearby variations. Instead of representing the sequence as a linear list of nucleotides, the sequence is represented using a graph (the network type of graph, not the plotting points on a chart type of graph). Such a graph would look something like a train network, where the tracks between stations consist of a linear list of nucleotides, and the different routes to get from point A to point B represent different possible sequences in the genome.

See the next section for an example which will be helpful. Note that there is little standardization of genome graphs so far.

An example of a sequence graph visualization

The image below shows a simple extract of a graph that describes the first hundred SARS-CoV-2 genomes published at NCBI.

gata[235][39][17][91]tggtcatcgttatggt

The graph above, read from left to right, says that the genome contains (in this sub portion) a series of 235 nucleotides that are the same in all sequenced strains. Then there is a choice between either a "t" or an "a". The different thicker line length for the "t" choice means that the "t" is more common. After this SNP comes another 17 conserved nucleotides, the same in all strains. After that comes a more complex choice. By taking the lower loop one can completely ignore the next 15 nucleotides. Alternatively (and more frequently), one can have the sequence "tggtca", followed by usually a "t" but occasionally a "c", followed by "gttatggt". After this complex variation come another 91 conserved nucleotides, a g/a SNP, and another 39 conserved nucleotides.

The different colors of the paths represent, in this case, the mean sample date for the samples that take that branch. Green is late, blue is early. This is subject to change.

Are all sequence graphs like this? What can't this show?

No. There is little standardization of sequence graphs yet. They are not well enough understood for the trade offs to be settled on. More general graphs are more complex and harder to visualize, but can capture more of the genetic diversity.

This representation aims for simplicity at the expense of power. It uses a particular class of simple graphs called Directed Acyclic Graphs (DAG). These can be represented in a left to right flowing manner that can be straighforwardly visualized. It can clearly represent SNPs and indels. It can represent simple multi-way choices. It can represent large structural variants, such as the alt-configs in the human genome reference GRCh38.

This structure cannot represent some things that would be nice. It is messy to indicate some homology between alt-configs. It cannot represent loops, which would be convenient for some repeating sequences. It cannot represent retrograde steps, which would be convenient for some gene shuffling. It is not good for showing evolutionary relationships between species such as chimp and human, due to the large amount of rearrangements. There are many other desirable properties that are sacrificed for the sake of simplicity.

I do not expect that this is the best representation, but feel that more tools that can allow people to play with different representations will let people get a feeling for the relative merits of various trade offs, with the hope of more consensus developing.

Genome Browser

Such a graph by itself is only part of a genome browser. The picture below shows the sequence graph with the SARS-CoV-2 RefSeq reference sequence, NC_04512.2, and annotation for that sequence including genes and other sequence features.

The top cartoon of a single stranded spiral (not a particularly appropriate shape, but it is a cartoon) represents the full genome. The picture is zoomed in on a portion represented by the white rectangle just below the spiral cartoon. The length and position of this rectangle represents the portion of the genome being shown below it.

Annotated Genome Browser Picture

The sequence graph is on the bottom, showing a large number of SNPs and a few other variations, the third variation being a deletion of att.

Between the zoom lines at the top and the sequence graph at the bottom are the annotations on the reference sequence NC_04512.2. The uppermost annotation is a ruler indicating the position on the NC_04512.2 sequence. Below the ruler are some rectangles representing sequence features. These features are primarily genes (green, with arrows indicating coding sequences and read direction, and name above). The purple rectangles represent other sequence features, which can include subunits of a gene (ORF1ab codes for multiple non-structural proteins).

Why is the ruler nonlinear?

The horizontal distance between nucleotides is not even. This is in order to show the variations in the sequence graph in a readable manner. There is some portion of the horizontal space assigned to the actual sequence, and some portion assigned to the variations. This apportionment of the horizontal space can be controlled using the control Percentage of space to represent actual sequence length.

If one zooms in a long way the nucleotides are visible. The image below shows a region where a stem loop (middle blue rectangle) in the RNA causes sometimes a frame shift in transcription, reading a c twice, leading to two alternate interpretations of the sequence, one of which has a STOP codon shortly after the skip.

Sequence graph zoomed in to the sequence level

The sequence graph now has the nucleotides written on it(somewhat difficult to read, it's on my list of defects). Also, the genes have the same nucleotides, grouped into codons, with the peptide the codon encodes below the three codon nucleotides.

Zooming and panning the Genome Browser

To zoom in, hold down the shift key, and then click and drag the mouse horizontally over the region to zoom in upon. Shading will partially obscure the not-selected area. Release the mouse to perform the zoom. Alternatively, clicking on the Zoom in icon zoom in icon will zoom in by a factor of two. Click on the Zoom out icon zoom out icon to zoom out by a factor of two. In each case, the zoom will be around the centre of the image, unless the end of the genome prevents this.

To pan, click on the image and drag in the desired direction. This is most intuitive if you click on the sequence graph/annotation region, and will function in the way that most mobile phone apps scroll. You may instead click on the zoom lines and rectangle to manipulate them directly; this functions like the scroll bar on most desktop applications, and scrolls in the opposite direction.

Entering a zoom by text

There is a text box between the Zoom in icon zoom in icon and the Zoom out icon zoom out icon that contains two numbers separated by "..". These numbers range from 0 (the start of the genome) and 1 (the end of the genome), and represent the left and right zoom boundaries. These numbers are reflected in the URL. The only time you are likely to want to do this is to remember or share a particular zoom region with someone else, although the easiest way to do this is to bookmark the URL or send the whole URL.

Hover information

Many of the features have too much information about them to to show simultaneously. Instead, the information is made available by hovering the mouse over the element in question. A light grey rectangle is drawn over the feature being hovered over, a dark grey rectangle then shows the extra information.

Mouse over showing information about the stem loop

Above is a small amount of information shown by hovering the mouse above the middle purple rectange (the frameshift causing loop). Hovering slightly higher would show information on the ORF1am gene.

You can also hover over an edge. For curved edges, hover over the mostly-horizontal section. This will give information about the effect of that variation, and the sequences that contained that variation. The image below shows the effect of hovering over the well known D614G mutation [Plante, J.A., Liu, Y., Liu, J. et al. Spike mutation D614G alters SARS-CoV-2 fitness. Nature (2020).]

D614G refers to a mutation in the S gene that converts amino acid 614 from D(aspartic acid) to G(glycine). On the RefSeq sequence NC_045512.2, the S gene starts at base 21563 (which can be established by hovering the mouse over the S gene). Add 613 codons of 3 nucleotides gives 21563+3x613=23402. So codon 614 goes from 23402..23404 on NC045512.2. The picture below comes from hovering the mouse over the "g" at this position. The line Off NC_045512.2 : 23403 says that the "g" is not on the reference, but replaces the base at 23403 on the reference. Below this it shows the effect on the codons: The reference is gat encoding Asp/aspartic acid while the variant is ggt encoding Gly/glycine. The middle nucleotide of the codon is bold as it is the varying nucleotide.

After the variant line are four lines showing the four reported sequences at this position. Two of these (n and r) represent ambiguous sequence measurements. The other two are the main ones; there were 19739 instances of g, and 2895 instances of a. Most of them contain metadata describing what month the samples were taken and what country they were taken in. This is summarized in the table below. The g line is in bold as it is the variant being hovered over, and described in the table below.

mouse over showing D614G details

The metadata table shows that, throughout the world, early infections were the a original allele, and later versions were overwhelmingly the g allele. In particular, each entry shows a fraction with the numerator the number of g alleles, and the denominator the total number of sequences for that country and month.

Saving images

Pressing the "Save Image" icon will save the genome browser as a SVG image file, which is a vector format suitable for high quality images. You can also do a normal screenshot, which will create a pixelated image.

Top controls

At the top of the page, What to show and How to display it. These respectively let you choose what graph to display, and some details of the layout.

Choosing what to display

The most important choice is the Show Graph control in the upper left. It lets you choose from some precomputed graphs based on the data from NCBI. A brief description of the graphs are given.

The SARS-CoV-2 sequences are all made from the first n sequences listed at NCBI for various values of n. The smaller numbers can be useful for exploring with the system - there is less going on, and so the display is less confusing.

The untrimmed graphs are graphs such that each input sequence can be represented perfectly by a path through the graph. The trimmed graphs are simplified subsets of these graphs. There are two types of distracting, probably low value graph structure that are removed:

How the graphs were computed

The graph starts off with the reference sequence. Then extra sequences are aligned to that graph, one at a time, and the graph is extended to represent the new genome diversity. After this is finished, the alighment is repeated, with no structure added to the graph, to get new use counts for each edge, and unused edges are removed. Then trimming is performed, and the alignment is again redone for new use counts for edges.

Only sequences that are at least 29000 bases long are used.

Multiple sequence alignment experience has shown that ordering is important for such iterative alignments. All sequences containing no ambiguous nucleotides are processed before sequences not containing ambiguous nucleotides. Within each of these groups, sequences are aligned to the reference sequence and the ones that align the most closely are processed first.

To reduce the chance of undesirable artifacts from the iterative addition, sequences are first aligned to the reference sequence, and long segments of perfect match are mapped to the same portion of the graph as the reference sequence. This also makes the process very fast. Then the mismatch to reference sequences are matched to the graph using a modified Smith-Waterman type algorithm.

For the large datasets, it is probably better to use the trimmed graphs, and indeed probably wise to trim then farther with the next control, Minimum number of usages of an edge to show if better available. If this number is greater than 1, then the graph will be trimmed further by removing any edges used fewer times than the number specified, as long as there is an alternative path between the same two nodes (places on the graph) that has a greater number of paths taking it.

Choosing how to display it

There is a text control indicating the proportion zoomed in on, surrounded by the zoom in and out buttons. You will rarely want to directly enter this.

How exactly is the line thickness chosen for each edge?

The line thickness is proportional to the cube root of the number of sequences taking that edge. This makes more used edges visibly thicker, without causing the problem of one line being twenty thousand times thicker than another, which would mean either one would be impractically thin, or one would be impractically thick. The exact function used may be an option in the future.

Bookmarking and URLs

When you zoom, pan, or select options, your choice is reflected in the URL. This means you can bookmark your current position and come back to it, or send a URL to someone else.

Caveat : this is still a prototype under active development. There are a few bugs to fix and future enhancements may things slightly, which can affect the position of zooms. So URLs may change with time, although the aim is to have this as static as practical.

This website is created as a hobby; care but no responsibility is taken.

Frequently Asked Questions

What is on the TODO list

High priority items are:

Does the vertical position of a gene in the display mean anything?

No. The vertical position is just chosen not to overlap other genes. In the case of multiple splice variants, there are multiple arrows for each gene stacked vertically, each arrow representing a different splice variant. These are not present in the reference sequence, unless you consider the frame skip in ORF1ab

How do I look at this on a phone?

With a magnifying glass. Sorry. I recommend a large screen - there is a lot of information to convey. I am considering Virtual Reality, which I consider likely to become a useful approach for scientific visualization in the coming decade.

Do you want suggestions, constructive criticism, and bug reports?

Yes please! Email username andrew at the domain andrewconway.org

Did you find what you were looking for?

No. I was looking for more complex genetic variation than one could easily characterize via standard multiple sequence alignment, which many people are doing better than I am. I have not yet found any present more than a trivial number of times. This is possibly disappointing from a scientific point of view, but it is probably a very good thing from a treatment point of view.

I did get some useful ideas for human variation representation, which is what I was originally working on before getting side tracked into this.

Why is the server connection slow/unreliable?

Er. It's currently just running on a PC in our home basement. I've just put it up for a few people to look at and have not arranged more serious hosting yet.