Guidance for submitting whole genome sequencing (WGS) data to support the pre-market assessment of novel foods, novel feeds, and plants with novel traits

Purpose

The purpose of this document is to provide guidance to industry on the use of whole genome sequencing (WGS) to generate data for pre-market submissions for genetically modified plants. Commercial platforms for high-throughput sequencing were launched in the mid-2000s and continue to undergo rapid development. These platforms are now increasingly affordable, and with over a decade of experience, they are also more reliable and accessible to developers with different levels of resources. Adoption of WGS has been widespread in biological, medical and agricultural research, and more recently in clinical diagnostics and epidemiology. Canadian regulatory agencies and our international counterparts have and continue to receive pre-market submission packages that include WGS data. Given the complexity of WGS data, industry has requested guidance that will enable them to compile pre-market submission packages that facilitate the regulatory review process. The use of WGS technology is optional and data generated using traditional molecular biology methods are still acceptable.

On May 31, 2017, Health Canada and the Canadian Food Inspection Agency (CFIA) published a draft guidance document on the Health Canada website, requesting comments on this guidance from the larger stakeholder community. Comments were accepted until 12:00 a.m. EST on July 30, 2017. This final document includes minor editorial changes incorporated as a result of the comments received.

Early stages of the guidance document were developed following discussions of the Canada, United States, and Mexico Trilateral Technical Working Group (TTWG), and the perspectives of Canada's regulatory counterparts aided in developing this document. To the knowledge of Health Canada and the CFIA, this guidance document is the first to address the submission of WGS data for the pre-market assessment of genetically modified plants.

Preface

Novel Foods, Novel Feeds, and Plants with Novel Traits (PNTs) are required to undergo a mandatory pre-market assessment. Published guidance documents^Footnote 1 for developers of these products list the information and data that is required in a pre-market submission, typically including a full molecular characterization. The aim of the molecular analysis is to (i) show the changes introduced into the event genome, (ii) ascertain their stability, and (iii) assist in predicting the molecular or biochemical mode of action, or in other words, the mechanisms by which the genetic changes give rise to the novel phenotypes or traits.

Data generated using classical molecular techniques, such as Southern blotting, Sanger sequencing and Polymerase Chain Reaction (PCR) –based assays, are routinely submitted by petitioners in support of their pre-market applications. These data inform on the molecular characterization end-points that regulators consider in completing their assessments, namely characterization of:

the DNA that was inserted, deleted or modified
the number of complete or partial copies of the inserted DNA
the organization of any inserted or altered genetic elements, including coding, regulatory and other non-coding regions; this may include sequence data of the inserted DNA and surrounding regions, where appropriate (example, to characterize a partial insertion or rearrangement)
the mode of inheritance and stability of the genetic change(s)

Taken together, the molecular data presented in a submission contribute evidence for the unambiguous interpretation of the nature and stability of the genetic changes contained in the event.

In recent years, a technological leap has been made with the development of massively parallel sequencing, also commonly referred to as high-throughput sequencing, next generation sequencing (NGS) or whole genome sequencing (WGS). In this guidance document the term WGS will be used. Several platforms have been developed that enable the rapid generation of large quantities of DNA sequence data. This highly automated technology is becoming increasingly accessible and affordable, and petitioners may wish to use them to generate data in support of the molecular characterization of their products.

In essence, all WGS methods involve collecting large scale genome^Footnote 2 sequence data at single nucleotide resolution, usually a collection of fragments that require curation and often sophisticated computational analysis to interpret. WGS technologies and analytical methods are improving rapidly, and petitioners are advised to consult the scientific literature and device manufacturers' websites for information on the latest developments. For regulators, it is not the raw data but rather the demonstration of overall sequence quality, description and validation of the in silico (such as, computational) analysis, and data presentation that are of principal value to inform the assessment. As with any scientific data submission prepared for pre-market review, regulators reserve the right to ask for the raw data should the need arise.

In light of rapid ongoing changes in the field of sequencing, there is an absence of standardized procedures for producing and analyzing WGS data that would apply universally across all platforms and all applications of these techniques. A need was identified to set forth in a guidance document the principles and good practices that petitioners should consider in organizing and presenting a WGS based analysis as part of a pre-market submission. The aim is to ensure that the WGS data submitted to regulators is produced through a well-documented analysis and is demonstrably at least as robust as the molecular data obtained using traditional molecular biology methods. This guidance describes the expectations for information to be included in a submission with regard to the WGS study design and methodologies, data analysis and data presentation.

I. WGS vs traditional molecular biology techniques

Examples in the literature have shown how WGS data can be useful as an alternative to Southern blots in the characterization of DNA insertions (Kovalic et al., 2012; Zastrow-Hayes et al., 2015) and may be applied to transgenic or cisgenic events. For different molecular characterization end-points (example, for products of mutagenesis and/or selection, and in general for studies of trait inheritance), the use of classical methods may be more appropriate. Data produced using classical molecular methods remain acceptable for use in pre-market submissions and can be presented alone or in combination with WGS data for molecular characterization, regardless of the method of development used to produce the event.

When petitioners present data produced using traditional molecular techniques, the descriptions of the methodology and analysis are generally uncomplicated because the techniques are in widespread use and the data interpretation is typically straightforward. With WGS, each analysis can be unique and the sequence reads require customized and often sophisticated handling in order to generate interpretable results. For this reason, all manipulations applied to the sequence data have to be explained in a submission and any sequence that is eliminated from analysis requires justification.

It is up to the petitioner to demonstrate that the presented sequencing data accurately represents the event genome. Appropriate metrics, quality analyses and/or controls should be included and explained in order to give confidence to the regulator that the WGS characterization has been performed rigorously and that the results capture the genome structure and modifications accurately and completely.

II. WGS study design and methodologies

The overall strategy of the WGS study and motivation for the choice of methodology should be clearly explained. It should also be stated in the submission what molecular characterization end-points are addressed using the chosen methodology.

There are several sequencing platforms available, each offering a suite of models that are frequently updated. In addition to the instrumentation, DNA preparation kits and on-board software are optimized regularly. WGS technologies in general are powerful and versatile, however each setup has its strengths and limitations, and some are better suited than others to different sequencing challenges. The submission should state the instrument make and model, as well as the version of the on-board software.

A description of how the DNA sample was prepared should include the distribution of the fragment sizes. If a commercial kit is used, this can be stated as well, with a mention of any known performance limitations and any steps that were taken to account for these.

In the context of WGS, bias can occur where the target sequence (or sequence of interest) contains any regions (for example, GC or AT rich, low complexity, or repetitive sequences) that give rise to sequencing artifacts with the result that they are over- or underrepresented in the data. Petitioners should mention and explain if any steps were taken prior to sequencing or afterwards during the analysis to account for such biases.

Similarly, if the WGS experimental design calls for the use of controls, these should be explained. One example might be the sequencing of a reference genome spiked with target sequence, which is analogous to a positive control used in Southern blot analysis to show probe specificity.

Overall, the submission should include a clear description of the WGS study's intent and rationale. Laboratory protocols may be provided as supporting material (for example, in an appendix) and referenced in the overview of the methodology.

III. WGS data analysis

Depending on the molecular characterization end-point(s) being addressed, WGS sequence reads can be processed in different ways. The ultimate purpose of the data analysis is to generate tables and figures to present the key information distilled from the sequencing data that clearly supports the petitioner's conclusions regarding the molecular characterization end-points. Submissions should include a stepwise description of the data analysis pipeline, organized in order to facilitate the interpretation of the presented results.

The use of schematics to accompany the narrative description of the data analysis pipeline is encouraged. As appropriate, the following aspects should be included:

an explanation of any data cleaning and/or error correction applied to the read output, with disclosure of any eliminated outliers
a data quality report (for example, FASTQC^Footnote 3). These reports present basic statistical data such as the range of read lengths, number of reads, GC content, etc., as well as charts that present quantitative measures of the overall data quality
literature citations for any programs or algorithms used. If new computational tools are developed by the petitioner, validation studies should be included.
the purpose of each step in the pipeline, for example, searching, parsing, aligning, mapping, assembly, etc. The choice of parameters, including defaults, at each computational step should be justified or explained
the outcome of each step in the pipeline should be stated
for cases in which a reference sequence is used to map reads generated from the event genome, the petitioner should identify the reference strain or variety and present a rationale for the choice of reference

Coverage depth, breadth and uniformity are key considerations for data analysis and interpretation. There is no set threshold for coverage as this will depend on the specific case. By way of example,a relatively low average coverage may be sufficient to show that a sequence of interest is present in the event genome. In order to support any conclusion that hinges on having sampled the entire genome (such as, breadth of coverage approaching 100 percent), this should be demonstrated empirically using controls or other metrics. The factors that contribute to achieving a breadth of coverage that is appropriate for different applications are reviewed by Sims et al. (2014). In any WGS study, the petitioner needs to justify why the genome coverage is adequate for their conclusions. Any gaps in coverage or regions that have either shallower or deeper coverage compared to the average may require explanation or further characterization.

IV. Presentation of the WGS data

The choice of how to present WGS data in tables and figures depends on the molecular characterization end-points addressed. Some examples can be seen in Kovalic et al. (2012) and Zastrow-Hayes et al. (2015), but petitioners are by no means limited to using these as models. The narrative text in a submission should explain and interpret the data and rationales that support the petitioner's conclusions. Information that can be presented, as relevant, include:

charts from the FASTQC report (Section II) or similar analyses that show the quality of the read output data
coverage maps showing the variation in read coverage over the loci of interest.
if unexpected sequence variants, substitutions, insertions, or deletions are observed in the event genome, these should likewise be explained and/or further characterized
if traditional molecular biology techniques are used to complement or clarify any ambiguity in interpreting the WGS data, the combined weight of evidence should be clearly explained

Glossary of terms

Algorithm: A process of set rules to be followed in calculations or other problem solving operations, especially by a computer.
Coverage: The number of times that a given nucleotide is captured by WGS reads. The coverage depth, breadth, and uniformity serve as metrics for the quality of the sequence data.
Coverage depth: The number of times that a given base position is read by a sequencing run.
Coverage breadth: The fraction of the genome that is captured by reads in a sequencing run.
Coverage uniformity: The range of coverage depth across a genome for a sequencing run.
De novo assembly: An approach to assembling WGS reads through alignment with one another, without guidance from a reference genome.
Junction: In the genome of a transgenic event, junctions are the sites of cassette insertion. Fragments of DNA which capture the junctions, i.e., sequence containing both the insert cassette and the endogenous host sequence, can be detected using Southern blot or WGS-based analysis.
Gap: When aligning sequences, spaces are introduced to represent sites where one sequence contains more nucleotides compared to the other. It is interpreted that an insertion or a deletion ("indel") had occurred at some point when the sequences diverged.
Parameters: Variables in a computer program that can be changed by the user to influence the data output.
Pipeline: A collection of programs and scripts that allow the data to flow in a controlled direction into each program until completion. A pipeline can be built to fully automate execution of a series of programs and scripts to obtain an answer or complete an analysis.
Reads: DNA sequence fragments that are outputted from a WGS experiment. The size of the reads depends on the initial library preparation and the sequencing platform. The collection of output sequences is typically of high quality and requires analysis and parsing to enable interpretation.
Reference: A database of nucleotide sequence data for the genome of an organism of interest. Reference genomes are curated and typically of very high quality and mostly assembled. Gaps in assembly are often highly repetitive regions. They are also typically a hybrid of several donors and as such do not represent a single strain or individual.

References

Kovalic, D., et al. (2012) "The Use of Next Generation Sequencing and Junction Sequence Analysis Bioinformatics to Achieve Molecular Characterization of Crops Improved Through Modern Biotechnology." Plant Genome 5(3): 149-163. doi: 10.3835/plantgenome2012.10.0026

Sims, D., et al. (2014) "Sequencing depth of coverage: key considerations in genomic analysis." Nature Reviews Genetics 15:121-132. doi:10.1038/nrg3642

Zastrow-Hayes, G.M., et al. (2015) "Southern-by-Sequencing: A Robust Screening Approach for Molecular Characterization of Genetically Modified Crops." The Plant Genome 8(1). doi:10.3835/plantgenome2014.08.0037