Power of Round
Circular data tracks naturally support display of information at various resolutions.
Compared to a track at a radius r, a pixel in a track at r/4 will span a region 4x larger. Tracks in the interior of the figure are therefore useful to display low-resolution or summary information.
Dog vs Human Synteny Panel
The completion of the draft version of the dog genome revealed large overlaps between dog and human genomes. Working with American Scientist, Martin Krzywinski designed the cover image for the magazine's Sept/Oct 2007 issue, to accompany the article "Genetics and the Shape of Dogs" by Elaine Ostrander.
The panel shown here reveals the details in the structure of sequence similarity between each dog chromosome and the human genome (top) and each human chromosome and the dog genome (bottom).
Circos Sniffs out Dog Genetics in American Scientist
The completion of the draft version of the dog genome revealed large overlaps between dog and human genomes. Working with American Scientist, Martin Krzywinski generated an illustration showing blocks of similarity between the two genomes.
The illustration accompanies the article Genetics and the Shape of Dogs, by Elaine Ostrander.
NYT Article - Mapping the Epigenome
In collaboration with Jonathan Corum from the NYT, Martin Krzywinski created an illustration of data showing methylation on chromosome 22 in a variety of tissues.
The illustration accompanies the article Now: The Rest of the Genome, by Carl Zimmer.
Circos is used for the identification and analysis of similarities and differences arising from comparisons of genomes. Circos is effective in displaying variation in genome structure and, generally, any other kind of positional relationships between genomic intervals. Such data are routinely produced by sequence alignments, hybridization arrays, genome mapping, and genotyping studies.
Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements. Circos is capable of displaying data as scatter, line and histogram plots, heat maps, tiles, connectors and text.
Bitmap or vector images can be created from GFF-style data inputs and hierarchical configuration files, which can be easily generated by automated tools, making Circos suitable for rapid deployment in data analysis and reporting pipelines.
An interactive online version of Circos designed to visualize tabular data is available. Circos is licensed under GPL.
Krzywinski, M. et al. Circos: an Information Aesthetic for Comparative Genomics. Genome Res (2009) 19:1639-1645.
The creation of Circos was motivated by a need to visualize structural variation within a genome. Initially, this variation was detected using BAC clones derived from tumor genomes — clones which had alignments to distant regions of the genome captured a rearrangement in the cancer genome. The positions of these alignments were drawn circularly and the density of the alignments (clones sampled the genome redundantly) was taken as the configuration of rearrangement.
Subsequently, we began using Circos to show relationships between the sequence of multiple genomes, thus visualizing sequence synteny and conservation. Typically a genome is characterized in several ways, each at a different resolution, and Circos was used to show the relationships between corresponding positions within these representations (e.g. sequence assembly and fingerprint map).
Specific features are included to help viewing data on the genome. The genome is a large structure with localized regions of interest, frequently separated by large oceans of uninteresting sequence. To help visualize data in this context, Circos can create images with variable axis scaling, permitting local magnification of genomic regions to be controlled without cropping. Scale smoothing ensures that the magnification level changes smoothly. In combination with axis breaks and custom ideogram order, the final image can be easily tuned to offer the clearest illustration of your data.
Let's look at an image which typifies one kind of genomic data illustration — one with a large number of links and several high-resolution tracks placed on the outside. This image appeared in the Conde Nast Portfolio as part of an article about 23andMe.
The human genome is comprised of 22 pairs of chromosomes 1-22 and the pair of sex chromosomes X,Y. Individual chromosomes range from about 50 Mb (chr 21) to about 250 Mb (chr 1) and together compose the 3 Gb human genome.
This graphic shows the chromosomes arranged in a circular orientation, shown as wedges, marked with a length scale. Data placed outside of the chromosome ring represents degree of small- and large- scale variation in the genome at a given position found between different populations.
Data placed on top of the chromosome ring highlights positions of genes implicated in disease, such as cancer, diabetes, and glaucoma. Data placed inside the ring links disease-related genes found in the same biochemical pathway (grey) and the degree of similarity for a subset of the genome (colored).
The graphic shows the human genome annotated with data related to genes implicated in disease, regions of variation found in various populations, and regions of similarity between chromosomes.
The 24 individual chromosomes (1..22 [each present in pairs in the genome], X, Y) are arranged circularly (C), and represented by labeled (C3) ideograms on which the distance scale is displayed (C1).
Some chromosomes are shown at different physical scales to illustrate the rich pattern of the data (chr2 3x; chrs 18,19,20,21,22 2x; chrs 3,7,17 10x). Within each ideogram, cytogenetic bands are shown (C2). These are large-scale features used in cytogenetics to locate and reference gross changes.
On the outside of the ideograms, genomic variation between individuals and populations is represented by tracks (A) and (B). The number of catalogued locations at which single base pair changes have been observed within populations is shown as a histogram (A). Large regions which have been seen to vary in size and copy number between individuals are marked in (B).
Locations of genes associated with disease are superimposed on the ideograms (D). (D3) shows the location of genes implicated in cancer (very dark red), other disease (dark red) and all other genes (red). (D2) shows locations of genes implicated in lung, ovarian, breast, prostate, pancreatic, and colon cancer, colored in progressively darker shade of red. (D1) marks gene positions implicated in other diseases such as ataxia, epilepsy, glaucoma, heart disease, neuropathy, colored in progressively darker shade of red, as well as diabetes (orange), deafness (green), and Alzheimer (blue) disease.
Grey lines (E) connect positions on ideograms associated with genes that participate in the same biochemical pathways. The shade of the link reflects character of the gene - dark grey indicates that the gene is implicated in cancer, grey in disease, and light grey for all other genes. Colored links (F) connect a subset of genomic region pairs that are highly similar and illustrate the deep level of similarity between genomic regions (about 50% of the genome is in so-called repeat regions regions which appear in the genome multiple times and in a variety of locations).
Many of the data sets used in the figure are available through the genome browser at University of California Santa Cruz. The data used in the figure was downloaded from the table browser for the human genome assembly (hg18, May 2006).
The data used (group/track) for figure elements is as follows
Gene-to-chromosome location mappings were done using the following data tables from UCSC
For example, genes implicated in diabetes were found by scanning for all gene and gene aliases that have the keyword "diabetes" in the OMIM entry. Subsequently, the list of gene names was cross-referenced with positional information from UCSC to obtain a final list of genomic positions.
Track (D3) shows all genes (red), OMIM genes (dark red) and genes from the Cancer Gene Census, a manually curated subset of genes with strong evidence linking them to cancer. Tracks (D1) and (D2) show locations of genes associated with specific types of cancer, as well as with other disease.
Links shown in (E) connect genes found in metabolic pathways catalogued by the KEGG database. For a given set of genes g1, g2, g3, ... , gn found in the same pathway, links are drawn between g1-g2, g2-g3, and so on.