Defining the complete repertoire of mutations driving cancer development and progression

Next generation sequencing technology has heralded new opportunities for cancer genomic research. It is now feasible to survey the entire sequence content of an individual tumour and define the accumulation of somatic mutations and structural variations. We are undertaking the systematically surveying of complete transcriptome complexity, genome sequence content / genome structure and epigenomic signatures in a large cohort of individual Pancreatic Cancers (in collab with A. Biankin, Gavan Institute) and Ovarian Cancers (in collab with D. Bowtell, PeterMac Cancer Institute) as part of the International Cancer Genome Consortium.

Over the last 18 months we have established multi-gigabase scale next-generation sequencing technology and demonstrated its utility for studying gene activity, identifying which transcripts are made from each locus and surveying sequence content of ES cell and HeLa transcriptomes. We are completely surveying the RNA abundance and sequence content (both mRNA and miRNA) and complexity in tumour. We have also developed the computational pipelines and experimental methods to create mate pair libraries for high resolution genome scanning for structural variations (SVs) (insertions, homozygous and heterozygous deletions, translocations, inversions). The central concept for these studies relies on the creation of genomic libraries from the terminal sequences of genomic fragments that are of a uniform length (ie make clones of 25-50bp terminal sequences of 3kb genomic fragments). When both tags of the "mate-pair" are independently mapped, they should be end up being the expected uniform distance apart (ie 3kb). Structural variations to the genome perturb the observed distance between mate-pairs that span an altered region:

Mate pair library strategy. A) Sheared DNA is size selected and then circularized with inclusion of a biotinylated linker. The majority of the insert is then removed, and the linker plus terminal sequences are purified using streptavidin. Adaptors allow driving single molecules to beads. B) Sequencing gives 2 reads that should be a defined distance apart. C) Tags are mapped and deviations from expected mapping distances, direction of tag sequences, and the combination of these both are used to identify structural variant (SVs) and define breakpoints. Distribution of normal mapping pairs is used to assess CNVs.

Large Insertions: can be identified by mate-pairs mapping closer together than expected.
Large Deletions: can be identified by mate-pairs mapping further apart than expected.
Tandem Duplications: can be identified by mate-pairs mapping in the reverse order.
Inversions: can be identified by mate-pairs mapping in the incorrect orientation.
Translocations: can be identified by each mate-pair mapping to a different chromosome.
Chromosomal Copy number variations: can be identified by changes in tag coverage & depth.

In addition to structural variant analysis, genome and transcriptome sequences can be screened at single nucleotide resolution. Overlapping tags can be used to discern sequence variations such as SNPs, substitution mutations and insertion and deletions. We are refining pipelines to identify these events, determine whether they are known SNPs in the population and summarize the pathogenicity of novel events (synonymous Vs non-synonmymous, splice junction mutation, the likelihood of the variant to drive a cancer phenotype (using tools like Canpredict)

Detecting mutations and expressed SNPs via RNAseq: High quality sequence substitutions are recognising by looking at QC calls and overlapping independent tags. Variations are then screened against dbSNP and Canpredict if novel non-synonymous. Truncating and splice site variations are also compiled separately. Similar clustering approaches can be used to identify indels.