Sequence-Based Analysis of Human Cancer

Ryan Morin

Abstract

Our group produces and analyzes next-generation sequencing data to understand the role of mutations in cancer. We are focusing on various types of non-Hodgkin lymphomas (NHLs). We use high-throughput DNA sequencing methods and state-of-the art algorithms to perform a large-scale meta-analysis on a collection of in-house and published data sets from NHL patients. Our analysis will determine the position of mutations in tumour cell populations and, by considering the mutation patterns across many patients, infer the genes and mutations under positive selective pressure that contribute to tumour cell fitness (i.e. drivers). Knowledge of such driver mutations in cancer will facilitate refined diagnostic methods and may lead to novel therapeutic strategies. Project Description: In 2014, non-Hodgkin lymphomas (NHLs) collectively represented the seventh highest cause of cancer-related deaths in Canada. B-cell NHLs (B-NHLs) account for the majority of NHLs and are composed of several histologic types, each with their own clinical features. These include diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), mantle cell lymphoma (MCL) and Burkitt lymphoma (BL). With cancer being a disease of the genome, the discovery of improved therapies hinges on a better understanding of the genetic aetiology of B-NHLs. Furthermore, it is now appreciated that tumours continuously evolve, and delineating this process can allow us to gain insight for preventing or better managing treatment resistance. Next-generation sequencing (NGS) has proven to be incredibly valuable to achieve these goals, enabling unprecedented genome-wide studies of B-NHL tumour genomes. Since 2012, there have been several published studies exploring the genomic landscape of each B-NHL type, shedding light on the genes and pathways implicated in tumour formation and progression. However, they are limited in statistical power due to their small cohort sizes. Also, each one of these studies focused on one NHL type, ignoring the possible power that could arise from leveraging related oncogenic pathways known or suspected to be shared between disease types. For instance, FL, BL and many DLBCL tumours arise from the germinal centre and thus share many genetic characteristics. This highlights the possible gain in power from integrating the data from multiple cohorts and different types of NHL. Owing to timing and lack of consistency in methodology for data analysis, these studies used different versions of the human reference genome (hg18 and hg19, neither of which are the most recent) and various combinations of read aligners and variant calling algorithms, many of which are now considered outdated. These discrepancies makes these published datasets not directly comparable for our needs. Accordingly, these data would greatly benefit from being re-analyzed by a unified bioinformatic pipeline composed of state-of-the-art computational tools. This process allows for the harmonization of the human reference genome across all cohorts. The latest version (GRCh38) includes many improvements that improves the accuracy of downstream analyses such as variant calling. In addition, detecting copy number variations in exome sequencing data could not readily be done during the earlier studies. There are now several methods that can accomplish this. Altogether, the re-analysis of these published datasets offers the potential to identify novel recurrent genetic lesions that have gone previously undetected due to inadequate computational methods. The objective of our project is to perform a comprehensive meta-analysis of all published B-NHL genomic datasets. The scale of this analysis is unprecedented, amounting to over 300 tumour-normal pairs. Our hypothesis is that this large integrative cohort will grant us increased statistical power for detecting lower frequency events and enable us to better describe the commonalities and differences between the various B-NHL types. To achieve this, we will first re-align all sequencing reads to a common human reference genome, GRCh38. This will be done using the BWA MEM algorithm. We will then perform somatic variant calling on the resulting BAM alignment files using matched tumour-normal pairs. We will use MuTect for detecting single nucleotide variants and Strelka for calling small insertions and deletions (indels). As for copy number variations, we will use the TITAN algorithm, which is compatible with both genome and exome data. Then, the gene effects of each variant will be predicted. We will identify any recurrently altered genes or pathways and determine their prevalence in the various B-NHL types using statistical approaches that identify patterns of positive Darwinian selection acting on cancer cell populations. This will facilitate the discovery of genes whose mutation contributes to the individual forms of NHL (i.e. drivers) which, in turn, may lead to improve diagnostic methods and may lead to novel therapeutic strategies.