Abstract

We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome.

We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.

Main

Since its initial publication1,2, the human genome sequence has undergone continual improvements aimed at filling gaps and correcting errors. The latest release, GRCh38, spans 3.1 gigabases (Gb), with just 875 remaining gaps3. The ongoing effort to improve the human reference genome, led by the Genome Reference Consortium, has in recent years added alternate loci for genomic regions where variation cannot be captured by SNPs or small insertions and deletions (indels). These alternate loci, which comprise 261 scaffolds in GRCh38, capture a small amount of population variation and improve read mapping for some data sets.

Despite these efforts, the current human reference genome derives primarily from a single individual4, thus limiting its usefulness for genetic studies, especially among admixed populations, such as those representing the African diaspora. In recent years, a growing number of researchers have emphasized the importance of capturing and representing sequencing data from diverse populations and incorporating these data into the reference genome and genomics studies in general5,6,7. The alternate loci in GRCh38 offer one possible way to add such diversity, although it is unclear whether such a solution is sustainable as more populations are sequenced. Among other problems, the addition of alternate loci as separate contigs can mislead sequence alignment programs, which were designed under the assumptions that each read has a single true point of origin and that the genome is represented as a linear haploid sequence8.

The lack of diversity in the reference genome poses many challenges when analyzing individuals whose genetic background does not match the reference. This problem may be addressed by using large databases of known SNPs (for example, dbSNP9), but this solution only addresses single-base differences and small indels and is not adequate for larger variants. Findings from the 1000 Genomes Project indicate that differences between populations are quite large; examination of 26 populations across five continents revealed that 86% of discovered variants were present in only one continental group. In that study, the five African populations examined had the highest number of variant sites compared with the remaining 21 populations10.

One way to address the limitations of a single reference genome is to sequence and assemble reference genomes for other human subpopulations. The 1000 Genomes Project, Genome in a Bottle, and other projects have assembled draft genomes from various populations, including Chinese, Korean, and Ashkenazi individuals11,12,13,14,15. Other groups have used highly homogenous populations (for example, Danish, Dutch, or Icelandic individuals) together with assembly-based approaches to discover SNPs and structural variants (SVs), including up to several megabases of non-reference sequence common to these populations16,17,18,19. Although these variant analyses are a step in the right direction, to date, none have produced a reference-quality genome that can replace GRCh38 (ref. 3); however, this is an explicit goal of the Danish Genome Project (URLs).

While efforts to produce new reference genomes are worthwhile, attempts to create a pan-genome of a human population, a collection of sequences representing all of the DNA in that population, are rare. Although multiple pan-genomes have been created for bacterial species20,21,22, as of yet, there are no pan-genomes for any other animal or plant species. The lack of pan-genomes is due in part to the technical challenges of assembling many deeply sequenced genomes de novo and combining them into one genome. Whereas the Danish Genome Project focused on 50 trios of non-admixed individuals (removing admixed samples from their study17), our study focuses on a highly heterogeneous group of admixed individuals. Because the human reference genome is largely complete (the sequence has very few gaps), our strategy for creating a pan-genome focused on finding large insertions. This approach, although computationally demanding, made the African pan-genome assembly process described here feasible.

A 2010 study that sequenced one Asian and one African individual used the novel sequences identified to estimate that a full human pan-genome would contain an additional 19–40 megabases (Mb) that are not in the current reference genome11. Recent efforts to sequence a Dutch population and a set of 10,000 individuals have supported this estimate, reporting 4.3 and 3.3 Mb of non-reference sequences, respectively18,23; however, neither study was designed with the primary goal of discovering long, non-reference sequences. A 2017 study in which two haploid human genomes (hydatidiform moles) were sequenced using long reads estimated that a single diploid genome may differ by as much as 16 Mb from the reference genome24. As we describe here, our analysis of 910 deeply sequenced individuals, all from the Consortium on Asthma among African-Ancestry Populations in the Americas (CAAPA)25, produced a much larger amount of novel sequence (sequence absent from GRCh38) in the African pan-genome, spanning 296.5 Mb. We describe the methods used to identify and validate these sequences along with comparisons to other human sequences. The African pan-genome (APG) contigs have been deposited at NCBI under accession PDBU01000000 to provide a better foundation for future analyses of individuals of African ancestry.

In total, we discovered 296.5 Mb of novel DNA distributed across 125,715 sequences assembled from 910 individuals of African descent (Table 1 and Supplementary Fig. 1). We took steps to ensure contaminants and redundant contigs were removed, resulting in a non-redundant set of human contigs representative of the entire study group (Fig. 1). After discovery, we called presence/absence for all APG sequences in each CAAPA sample. A total of 33,599 contigs with a combined length of 81,096,662 bases represented sequences present in at least two individuals in the CAAPA cohort. When alignments above 80% coverage and 90% identity to Chinese and Korean genome assemblies were also considered shared, the number of non-private insertions increased to 61,410, totaling 160,475,353 bases and leaving 64,305 singleton contigs, a ~51% singleton rate. Of the 125,715 APG sequences, 1,548 (total length 4.4 Mb) were anchored to a specific location in the primary GRCh38 assembly. On average, each individual contained 859 of these inserted sequences, with a single sequence being shared among six individuals (Table 2). Placed contigs were shared among more individuals, 196 on average, as shared sequences were more likely to meet the placement criterion in at least one individual.
Table 1 Novel sequences in the African pan-genome
Fig. 1: Overview of methods.
Fig. 1
Raw reads are aligned to GRCh38, and unaligned reads are assembled with MaSuRCA. Assembled contigs are then filtered for contaminants with Centrifuge, and contigs shorter than 1 kb are removed (blue box). Assembled contigs are placed based on their mate’s alignment locations when possible by checking whether >95% of mates align to the same location. If such a placement is found, the exact breakpoint is determined via a nucmer alignment to the region for each end of the contig (yellow box). Contig placement locations are then compared between all individuals, nearby placements are clustered, and a representative is chosen. All contigs are then aligned to the representatives to determine which samples contain a given placed insertion. Contigs in or aligning to placed clusters are removed from the unplaced set, and the remaining unplaced contigs are aligned to one another with nucmer to remove redundancy and result in a final nonredundant unplaced set of contigs (purple box). EP, end placed; 1EP, one end placed; 2EP, two end placed; L, left; R, right.
Table 2 African pan-genome contig presence/absence statistics
We fully resolved the location for 302 of these sequences and resolved the breakpoint of one end of the insertion for the remaining 1,246 (Supplementary Table 1). Placement locations were determined by complementing our methods with results from the PopIns program16, which corroborated many placements and resolved placements for some insertions for which our method was ambiguous (Supplementary Note 1). The remaining sequences (Supplementary Table 2) could not be fully localized; however, mate-linking information pointed to a consistent location for at least one end for an additional 57,655 sequences (Supplementary Table 3). The longest placed sequence was 79,938 bp and appeared in 197 samples, and the longest unplaced sequence was 152,806 bp, which appeared in 11 samples (Table 1). Among all placed sequences, 387 intersected known genes, with placements within exons in 48 distinct genes and placements within introns in an additional 267 genes (some genes contained more than one insertion). Of the 315 genes containing insertions, 292 were named (had names other than ‘hypothetical’ or a non-meaningful identifier). An additional 133 placed insertions and 46 that already intersected a protein-coding gene intersected 142 distinct lncRNAs, 21 of which were named (Supplementary Table 4). A translated BLAST26 search on unplaced sequences against NCBI’s nr database yielded an additional 10,667 contigs hitting a chordate protein with ≥70% identity and an e value less than 1 × 10–10. Placement locations and gene intersections were dispersed throughout the genome, and placed pan-genome elements were found on every chromosome (Fig. 2), in addition to 115 insertions in chromosome-specific ‘random’ sequences and 103 more in ‘unlocalized’ sequences included in the primary assembly of GRCh38.
Fig. 2: African pan-genome contig locations.
Fig. 2
Map of the human genome showing the locations of all African pan-genome contigs, for those that could be placed accurately along one of the chromosomes. Yellow lines represent an intergenic location; blue lines represent insertion points with RNA but not exonic annotations, and red lines indicate intersections within exons. All exon-intersecting insertions are labeled with the gene name. mRNA and lncRNA gene names are reported in Supplementary Table 4. In some cases, insertions are too close together for lines to be resolved; when this occurs within exons, gene names are listed in order by chromosome position. Line width is not to scale.
Of our APG contigs, 31,354,079 bases aligned to a GRCh38 ‘patch’ or alternate (ALT) locus as part of an alignment with an identity of ≥80%. An additional 60,202,871 bases aligned to the primary assembly at ≥80% identity; however, most of these alignments covered a small portion of an APG contig and can be explained by the presence of extra copies of small repetitive elements. Data in Supplementary Tables 1 and 2 report alignments to ALT, patch, or primary assembly sequences covering at least 50% of the contig length with ≥80% identity. Requiring that at least 50% of a contig be aligned to any single location in GRCh38 produced a much smaller subset: of the 125,715 contigs, only 17,140 aligned to any part of GRCh38.p10 with a single alignment at ≥80% identity covering ≥50% of the contig length. These 17,140 contigs contain 22,420,979 aligned bases, with 13,770,950 bases being alignments to a reference chromosome. Although very few ALT loci in GRCh38.p10 are tagged with population-specific information, alignments of the CAAPA-specific sequences to these loci suggest an African source for some of these ALT sequences.
In addition to calling presence/absence of our APG insertions in the CAAPA individuals, we performed a similar analysis of 12 European and 12 African individuals from the Simons Genome Diversity Project (SGDP)27. The SGDP individuals varied in the number of APG sequences they contained (Supplementary Table 5), though analyzing the European- versus African-only contigs demonstrated that the APG insertions tend to be more representative of African than European assemblies, despite the admixed nature of the data (Supplementary Note 2).
We additionally aligned all 125,715 pan-genome contigs to recent human assemblies of Chinese (HX1)14 and Korean (KOREF1.0)15 individuals using bwa-mem28. We detected 42,207 contigs totaling 120.7 Mb aligning to either the Korean or Chinese assembly’s with ≥90% identity and ≥80% contig coverage, and matching the Chinese or Korean assembly better than GRCh38. A vast majority of these contigs (32,955) had no alignment at ≥80% identity and ≥50% coverage to GRCh38.p10, indicating that these sequences were not simply divergent from GRCh38, but rather were not present at all (Table 3); an example of such a sequence and its alignments to GRCh38 and HX1 are shown in Fig. 3. This finding suggests these sequences have been lost in the small number of individuals used to create GRCh38, although some of them may reside in the few remaining gaps in the genome.
Table 3 Comparison of African pan-genome contigs to the Chinese and Korean genomes
Fig. 3: An example of an alignment that does not meet the 50% coverage, 80% identity threshold for a ‘reasonably good’ alignment to GRCh38.
Fig. 3
The APG contig is shown at the top, with the best consistent alignments to GRCh38 in the middle. The three constituent alignments (blue, red, and yellow segments) cover 801 bases, just under 25% of the contig, with a cumulative weighted identity of 87.9%. CAAPA_113686 has a single near-perfect alignment to a Chinese HX1 contig (delineated by dotted lines) covering over 80% of CAAPA_113686 at over 90% identity. The APG contig also aligns well to the Korean assembly (not shown).
While Shi et al. reported 12.8 Mb of novel DNA in the HX1 genome14, we found a total of 68.1 Mb shared by HX1 and the unique sequences in the APG contigs (Table 3). This discrepancy is methodological: the Chinese genome assembly has relatively large scaffolds that were considered unique only if a large proportion of the scaffold failed to align to GRCh38 (Supplementary Note 3).
As an additional check to ensure the APG sequences were not contaminants, we examined what portion of contigs had some match, even just a partial match, to the GRCh38, Korean, or Chinese assemblies. After filtering to retain only query-consistent alignments, 98% of the contigs (123,600) had some portion aligning to either the Chinese, Korean, or GRCh38 assemblies. The Korean assembly had the most alignment, with 123,585 contigs containing an alignment totaling 247.2 Mb of aligned length, or 83% of the total APG sequence, although only 31,033 contigs, totaling 80.9 Mb of alignment, aligned with over ≥90% identity and ≥80% coverage.
Our findings here demonstrate that the standard human reference genome lacks a substantial amount of DNA sequence compared with other human populations. The APG sequences contain 296.5 Mb, equal to 10% of the genome, regions that will necessarily be missed by any efforts relying only on GRCh38 to study human variation, as nearly all studies do at present. Of these 296.5 Mb, 120.7 Mb were shared by the Korean or Chinese populations, suggesting those regions may have been lost more recently or may be rare in the specific populations represented in GRCh38. Overall these results suggest that a single reference genome is not adequate for population-based studies of human genetics. Instead, a better approach may be to create reference genomes for all distinct human populations, which over time will eventually yield a comprehensive pan-genome capturing all of the DNA present in humans.

URL

http://www.genomedenmark.dk/english/about/referencegenome/.

Methods

We used whole-genome shotgun sequence data from 910 individuals whose genomes were sequenced as part of the CAAPA project, available from dbGaP as accession phs001123.v1.p1. The total data set contains 1.19 trillion (1.19 × 1012) 100-bp paired end reads, representing an average of 30–40× coverage for each individual’s genome. Sequencing was performed on an Illumina HiSeq 2000. The subjects in the study were all of African ancestry and were selected from 19 populations across the Americas, the Caribbean, and continental Africa25 (Supplementary Table 6).

Assembly of novel contigs

For each sample, we aligned all reads to GRCh38.p0 using Bowtie2 (ref. 29) and extracted unaligned reads and their mates using Samtools30 (Fig. 1). GRCh38 alternate loci were excluded from the reference index, but were considered later in the process. We then assembled all unaligned reads with the MaSuRCA assembler31; if neither mate in a pair aligned to GRCh38, MaSuRCA treated the reads as paired ends with a fragment size of 300 bp, and if only one mate was unaligned, MaSuRCA treated it as an unpaired read.
We filtered the resulting assemblies to exclude contigs shorter than 1000 bp (Fig. 1) and evaluated all remaining contigs with the Centrifuge metagenomics program32, scanning against the comprehensive NCBI nucleotide database to obtain a taxonomic classification of each contig. We considered any contigs labeled by Centrifuge as non-chordates (for example, bacterial or viral contigs) to be contaminants and removed them from further consideration.

Positioning contigs within GRCh38

We attempted to place the assembled contigs in a precise location in the human genome using mapping information from paired reads (“mates”). We masked contigs with RepeatMasker33 with the low-complexity option off (–nolow) and used Bowtie2 to realign all unaligned reads from read pairs in which only one mate had aligned originally. For each read R aligning within 500 bases of the end of a contig, we examined the alignment of R’s mate to GRCh38 to determine whether the contig had a unique placement in the reference genome. The fragment length for all paired-end libraries was 300 bp; by considering reads within 500 bp of the end of a contig, we reduced the likelihood that one or both of the alignments was a spurious match. Additional details of the sequencing protocols for the CAAPA genomes are described elsewhere25. This process resulted in a pool of linking mates corresponding to the beginning and end of each contig.
We then separated contigs into several groups based on their linking information:
  1. 1. No linking mates existed on either end of the contig; the reads mates did not align to GRCh38.
  2. 2. Placement was unambiguous (or unique) for at least one end of the contig. We define ‘chromosome unambiguous’ to mean >95% of the linking mates linked to the same chromosome. We define ‘region unambiguous’ to mean that of the >95% of mates aligned to the same chromosome, all mates aligned within 2 kb of each other. When both conditions hold, we say placement is unambiguous. These contigs were further divided into two subgroups:
    1. a. Both ends of the contig were placed unambiguously, or
    2. b. Only one end was placed unambiguously.
  3. 3. At least one end of the contig was chromosome unambiguous, but neither end was region unambiguous.
  4. 4. Neither end was chromosome unambiguous.
For all contigs in the second group, we used NUCmer34 to align them to the region determined by the linking mates (Fig. 1). If a contig end had one or more consistent exact matches of at least 15 bases (and no inconsistent alignments), we then determined the contig end’s exact insertion location based on alignment coordinates (Supplementary Methods). We permitted an exact two-ended placement only if both ends aligned to the same reference region with the same orientation. The insertion position was either a single breakpoint, if both ends of the contig were placed identically, or a range, if the insertion location of the two ends was not identical. For contigs with only a single end exactly placed, we recorded their exact single-end insertion position and the number of overlapping bases (bases to be trimmed off the end of the contig).

Insertion discovery with PopIns

To supplement the list of placed contigs determined by the procedure above, we ran the PopIns program16, which was used previously for a set of genomes from Icelandic individuals, and was designed to find insertions from a relatively genetically homogenous population. We ran PopIns beginning with the popins merge step, using the cleaned MaSuRCA contig assemblies described above. We ran subsequent PopIns steps as recommended in the PopIns documentation, through the popins place-finish step. PopIns output was converted into a comparable format, and verifiable placements were added to our sets of insertions (Supplementary Methods).

Clustering of placed contigs

Once contig locations were determined for each individual sample, we aligned all insertions to one another and clustered them to determine which contigs represented the same insertion across individuals (Fig. 1).

Clustering two-ended placements

For contigs with both ends placed, we ran BEDtools merge35 to group contigs placed at approximately the same location. We used the -d option with a distance of 10 to allow placements within 10 bases of each other to be combined. We also ran the merge using -d 100, which produced identical results. For each resulting region and contig cluster, we chose the longest contig in the cluster as the cluster’s representative (R), and these representatives formed the initial set of two-end placed contigs, 2EP. Two-ended placement clusters from PopIns were then added to 2EP. We verified clusters by aligning all contigs in each cluster to its representative, R, with default nucmer parameters and removing from the cluster any contigs that did not have any alignments to R. To find the complete set of samples containing each insertion, we then aligned all remaining contigs (including unplaced contigs) to the contigs in the clusters. Any contig aligning with >99% identity that was fully contained within a contig in a cluster C and covered ≥80% of the contig in C was included in C as part of the final set. Contained, 99–100% identical contigs aligning with <80 25="" 5="" a="" african="" also="" and="" as="" at="" cluster="" collection="" contig="" coverage="" data-track-action="supplementary material anchor" data-track-label="link" data-track="click" each="" final="" five="" for="" had="" href="https://www.nature.com/articles/s41588-018-0273-y#MOESM3" if="" in="" included="" insertion="" kb="" least="" linked="" linking="" location.="" longest="" mates="" of="" pan-genome="" placement="" representative="" sequence="" tables="" the="" they="" those="" to="" upplementary="" used="" was="" were="" within="">1