Multiple Deeply Divergent Denisovan Ancestries in Papuans
Highlights
- •A new dataset of 161 genomes covering the understudied Indonesia-New Guinea region
- •Introgressing Denisovans comprise at least three genetically divergent groups
- •Papuans carry haplotypes from two Denisovan groups, with one unique to Oceania
- •Some Denisovan introgression was recent and likely occurred in New Guinea or Wallacea
Summary
Genome
sequences are known for two archaic hominins—Neanderthals and
Denisovans—which interbred with anatomically modern humans as they
dispersed out of Africa. We identified high-confidence archaic
haplotypes in 161 new genomes spanning 14 island groups in Island
Southeast Asia and New Guinea and found large stretches of DNA that are
inconsistent with a single introgressing Denisovan origin. Instead,
modern Papuans carry hundreds of gene variants from two deeply divergent
Denisovan lineages that separated over 350 thousand years ago. Spatial
and temporal structure among these lineages suggest that introgression
from one of these Denisovan groups predominantly took place east of the
Wallace line and continued until near the end of the Pleistocene. A
third Denisovan lineage occurs in modern East Asians. This regional
mosaic suggests considerable complexity in archaic contact, with modern
humans interbreeding with multiple Denisovan groups that were
geographically isolated from each other over deep evolutionary time.
Graphical Abstract
Keywords
Introduction
Contact
between modern humans and archaic hominins in the distant past has left
a distinctive genetic signature in all human populations alive today.
Modern humans interbred with multiple hominin species in different
places around the world, including Neanderthals (
), Denisovans (
), and possibly others (
,
).
Examining genome sequences to identify regions that introgressed from
these archaic species has revealed evolutionarily adaptive variants and
extended deserts of introgression (
,
).
Recently, analysis of Denisovan ancestry in populations across Eurasia
uncovered introgression from an extra branch on the Denisovan hominin
clade in East Asia (
).
However, the center of gravity of Denisovan admixture today lies
>8,000 km south of Denisova Cave in the Papuan populations of
tropical eastern Indonesia and New Guinea, where the composition of
Denisovan introgression remains poorly understood.
We
therefore analyzed archaic introgression in a new dataset covering
Island Southeast Asia (ISEA) and Papua, a maritime zone of densely
inhabited archipelagos larger than Europe. This culturally and
linguistically diverse region remains strikingly underrepresented in
modern genetic surveys, despite its extraordinary human diversity and is
a major missing link for medical and evolutionary studies (
). Notably, the area has some of the first traces of anatomically modern humans in Eurasia (
), archaic H. floresiensis likely coexisted with modern humans here (
),
and eastern Indonesians, Papuans, and Philippine “negritos,” together
with Siberians and South and East Asians, are among the few living
groups with substantial traces of archaic introgression from Denisovans (
,
,
).
Results and Discussion
The Indonesian Genome Diversity Project Fills a Gap in Regional Coverage
The
Indonesian Genome Diversity Project (IGDP) has been run by the Eijkman
Institute of Molecular Biology in Jakarta for over a decade, with the
goal of capturing a representative sample of genomic diversity across
this understudied region. Spanning a transect of communities across the
Indonesian archipelago and neighboring regions of ISEA, populations
chosen for whole-genome sequencing were selected to reflect the main
axes of genomic variation observed in an earlier population genetic
study (
). We sequenced complete genomes to >30× coverage for 161 individuals, from Sumatra in the west to New Britain in the east (Figure 1).
We combined this new dataset with 317 additional high-coverage human
genomes sampled world-wide, including those few genomes previously
available for ISEA and Oceania (
,
), and three complete archaic hominin sequences, the Altai and Vindija Neanderthals (
,
) and the Altai Denisovan (
) (Figure S1; Table S1; STAR Methods S1–S5).
To
confirm that the dataset captures expected genomic patterns, we
calculated principal components and determined local ancestry along the
genome (
) (Figure 1; Table S2; STAR Methods
S2 and S6). We observed key features of population diversity in the
region, notably a strong cline in Asian to Papuan ancestry across the
archipelago with an abrupt transition within the island zone of Wallacea
(
,
).
These signals primarily reflect recent events of regional history,
particularly the agricultural expansion of Austronesian-speaking
populations from ca. 4500 ya. This cline serves, however, as an
important backdrop to facilitate understanding of regional signals of
genetic contact between anatomically modern humans and archaic hominins.
Combining Methods Identifies High-Confidence Denisovan Introgression
Because
individuals in ISEA carry ancestry from both Neanderthals and
Denisovans, these archaic signals must be disentangled. Assigning clear
ancestry is a major challenge, especially for single variants or small
ancestry blocks with few informative variants, because of extensive
shared polymorphisms between the two archaic groups as well as
incomplete lineage sorting due to the shared early history of
Neanderthals and Denisovans. One way to overcome this problem is with
haplotype methods to detect longer introgressing blocks, which have more
easily assigned ancestry and are less likely to result from incomplete
lineage sorting. Examining introgressing haplotypes of Denisovan
ancestry offers further advantages over site-by-site methods such as D
statistics (
),
because haplotypes provide additional information on introgression
dates and better resolution of detailed relationships between
introgressing and archaic genomes.
To
obtain a set of high confidence blocks, rather than all possible
stretches of Denisovan introgression, we developed a protocol to extract
archaic regions using the intersection of three different statistical
methods (Table S3; STAR Methods S7 and S8). Denisovan blocks were classified based on the overlap of (1) ChromoPainter (CP) (
),
which was used to identify haplotypes that are more similar to the
Denisovan genome than to a panel of sub-Saharan Africans; (2) an updated
Hidden Markov Model (HMM) (
,
) detecting the same signature; and (3) S∗ (
), which identifies clusters of linked non-African variation. These haplotypes were then filtered, using a range of protocols (Figures 2A–2C; STAR Methods S9a), to remove blocks that were also similar to the Altai Neanderthal (
) and optionally the Vindija Neanderthal (
), as measured by CP, leaving a dataset of high-confidence introgressed Denisovan regions.
A
clear correlation emerged between loci identified by the three methods.
Most archaic introgression was detected in Papua and East ISEA, with
the least in West Eurasia. Our multi-step filtering approach enriched
detectable Denisovan introgression in Papuans relative to West Eurasians
(Figures 2A–2C; STAR Methods S9c and S9d), who are thought to harbor little Denisovan introgression (
). This enrichment rose from 6.4-fold when using CP alone to nearly 50-fold when combining CP, HMM, and S∗.
The result is approximately 32.3 Mb of high-confidence Denisovan
introgressed blocks per genome copy for each Papuan individual. For
comparison, just 688 kb of Denisovan blocks were identified in West
Eurasians, which is consistent with earlier low estimates (
).
As
a further check, we compared the total amount of our high-confidence
haplotypes with Denisovan ancestry proportions calculated with counting
statistics as reported for the Simons Genome Diversity Project (SGDP)
samples (
) (STAR Methods
S9b). The total introgression in West Eurasia estimated by CP alone did
not correlate with genome-wide estimates of Denisovan introgression
estimated by D statistics but instead correlated strongly with estimates
of Neanderthal introgression (extracting 20% of the signal).
Strikingly, however, this signal drops to 2% for our high-confidence
Denisovan blocks, showing that Neanderthal spillover has been almost
entirely removed from the high-confidence Denisovan block set by our
multi-step filtering approach.
A strong correlation between Denisovan and Papuan ancestry (r2 = 0.98, p = 2.6 × 10−20) confirms that these two components have interconnected histories (Figure 2D; STAR Methods
S9d). In this correlation, a gradient close to 1 and a correspondingly
low intercept of –0.01 is observed for the high-confidence Denisovan
blocks, consistent with the inference that Denisovan introgression is
largely confined to Papuans. Low levels of Papuan ancestry in West ISEA
(<5 a="" class="figure-link scroll-into-link" data-link="modal" data-locator="fig1" data-target="#image-S0092-8674(19)30218-1gr1" denisovan="" href="https://www.cell.com/cell/fulltext/S0092-8674(19)30218-1?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867419302181%3Fshowall%3Dtrue#fig1" id="back-gr1" in="" introgression="" limited="" match="" observed="" region="" the="">Figure 15>
Denisovan Populations Introgressed into Papuans Twice
To
determine whether there is structure within the Denisovan sequences, we
calculated mismatch distributions between our high-confidence Denisovan
blocks and the high-coverage Altai Denisovan genome (STAR Methods
S10). Small ancestry blocks can be a problem for mismatch analysis,
because the mismatches of individual haplotypes have an imprecise
correspondence to genetic divergence caused by the low resolution
offered by a small number of stochastic, discrete polymorphisms (STAR Methods
S10a) and because small blocks are more often affected by incomplete
lineage sorting. We therefore profiled mismatches across a range of
block lengths in Papuans, the population in our dataset with the highest
Denisovan introgression. Intriguingly, we observe two clearly separate
mismatch peaks in Papuans (Figure 3A; STAR Methods
S10a). This suggests that Papuans carry lineages from two genetically
different Denisovan populations that had been separated from each other
for a very long time. These peaks are also observed when using just the
CP blocks or just the HMM blocks alone (STAR Methods
10b). Resolution of the two peaks (here called D1 and D2) improves with
block sizes greater than approximately 130 kb, with much less
resolution for blocks <50 a="" as="" block="" class="scroll-into-link" expected="" for="" href="https://www.cell.com/cell/fulltext/S0092-8674(19)30218-1?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867419302181%3Fshowall%3Dtrue#sec3" kb="" sizes="" small="">STAR Methods50>
We also confirm the signal previously reported in East Asians (
), but again only for longer block lengths (Figures 3B
and 3C). The mismatch peak (here called D0) is additionally seen not
just in East Asian populations, where it was originally detected, but
also in Siberians, indigenous Americans and at very low frequency
elsewhere across Asia (Figures 3B and 3C; STAR Methods
S10a). The blocks in this predominantly East Asian mismatch peak have
relatively low divergence to the Altai Denisovan, suggesting that modern
humans in Asia mixed with a Denisovan population that was closely
related to the Denisovan reference individual.
As
longer blocks are better able to capture demographic complexity from
mismatch distributions, we profiled mismatch patterns using the 2,000
longest blocks to maximize the signal in each population. Regional
patterns are apparent: D1 is restricted to Papuans, while D2 has a wider
geographical distribution spanning much of Asia and Oceania (Figure 3C; STAR Methods S10a; Figure S2).
Gaussian mixture model testing strongly supports the presence of two peaks in Papuans (Figure 4A, AIC = −5,809 versus unimodal −5,583, STAR Methods
S10d). In Papuans, blocks of length >180 kb were assigned to one or
the other peak based on >80% support from the mixture model. We
confirmed that D1 (less divergent from Altai Denisovan) and D2 (more
divergent) blocks do not differ in a wide range of molecular genetic and
bioinformatic parameters, including GC content, genotype call quality
of the archaic reference allele, alignability, recombination rate,
sequencing batch effects, and levels of background selection (B values;
) (STAR Methods
10e and 10f). We also checked whether variants within the D1 and D2
blocks have the same tree topologies. Both peaks show a strong signal of
an origin within the Denisovan clade rather than branching from deeper
points in the hominin tree (Table S4; STAR Methods S10g).
We
further verified that the observed bimodal mismatch distribution in our
high-confidence Denisovan blocks is not due to misclassification of
Neanderthal blocks. The polymorphic sites of both peaks predominantly
show the Denisovan-specific topology, with the Neanderthal-specific
topology observed only at low levels (STAR Methods
S10g). Further, if one of the peaks were caused by Neanderthal
introgression misclassified as Denisovan, that peak should be seen in
West Eurasians, who have substantial Neanderthal but no Denisovan
admixture. However, West Eurasians have neither of the two Denisovan
mismatch peaks. To additionally check whether some portion of the
Neanderthal introgression signal could have been missed by only using
the Altai Neanderthal reference, we repeated several analyses using CP
and the Vindija Neanderthal (
). This approach yields highly consistent results with the original analysis (Figure S3; STAR Methods
S10c). Finally, we identified Neanderthal-specific blocks in Papuans
using the same methodology as for the high-confidence Denisovan blocks.
These do not show a bimodal mismatch distribution to the Altai
Neanderthal (Figure S3; STAR Methods
S10c), suggesting that the history of Denisovan introgression in Papua
differed markedly from modern human interactions with Neanderthals.
Deep Divergence between Denisovan Populations
Next, we sought to retrieve dates of divergence between D1, D2, and the Altai Denisovan genome through coalescent modeling (Tables S5A and S5B; STAR Methods S10i). After extending an archaic demographic model (
)
to encompass two deeply divergent Denisovan-related components, our
best fitting model indicates that D1 and D2 split from the Altai
Denisovan approximately 283 kya (9,750 generations, 95% confidence
interval [CI] 261–297 kya) and 363 kya (12,500 generations, 95% CI
334–377 kya), respectively (Figure 4B).
While clearly branching off the Denisovan line, D2 diverged so closely
to the Neanderthal-Denisovan split that it is perhaps better considered
as a third sister group (STAR Methods
S10i). For context, even the youngest of these divergence times is
similar to the evolutionary age of anatomically modern humans (earliest
known fossils, with varied morphologies, date to 198 kya (
) and 315 kya (
)).
Our model implies substantial reproductive separation of multiple
Denisovan-like populations over a period of hundreds of thousands of
years.
The Two Denisovan Lineages Introgressed at Different Times
The
distribution of block lengths retains a signal of introgression time,
with longer blocks expected from more recent introgression events. In
general, block length is expected to decay over time approximately as an
exponential distribution (
).
We confirmed the accuracy of introgression dating by exponential
fitting of the block length distribution through extensive simulation,
incorporating different introgression times over the time period of
interest (0–2,000 generations), and considering the impact of using only
long blocks rather than the entire distribution of block lengths,
substantial block length estimation errors, and the consequences if
introgression occurred as an extended process rather than a single pulse
(Figure S4; STAR Methods
S10h). We observed a slight tendency to infer overly recent dates under
some of these conditions, but never by more than 10%–15%. Filtering to
longer block lengths and fitting an exponential with a larger location
parameter help to reduce even these biases in date estimates.
While
the median block lengths of D1 and D2 are similar in Papuans (238 and
236 kb), their distributions are significantly different
(Kolmogorov-Smirnov statistic = 0.15, p = 2.2 × 10−6).
Exponential fitting of D1 and D2 haplotype lengths yields introgression
dates of 29.8 kya (95% CI 14.4–50.4) and 45.7 kya (95% CI 31.9–60.7),
respectively, which are younger, though overlapping with, previously
suggested estimates for Denisovan introgression (Figure 4B; STAR Methods S10h) (
).
The maximum likelihood introgression date for D2 introgression is 50%
more ancient than the date for D1. Based on simulations, and given the
greater statistical challenge of identifying shorter introgression
blocks, we consider these dates to be probable lower bounds on
introgression times, but with true dates no more than 15% more ancient.
Geographical Patterns of Denisovan Admixture in Papua
D1
and D2 introgression times that overlap the timescale of modern human
arrival and their variable dispersal across Papua raise the possibility
that Denisovan introgression occurred after local populations of modern
humans had differentiated. We find geographic structure associated with
the D1 variation between mainland New Guinea and the Baining, a
population on the offshore island of New Britain. We observe slightly
less high-confidence Denisovan introgression in the Baining than in
mainland Papuans (31.5 Mb versus 33.1 Mb per haploid genome, Welch’s t
test T = −3.4, p = 0.001), despite extremely similar population
histories (
), including similar levels of Asian ancestry (Figure 1A).
However, there is less D1 sequence in the Baining than in mainland
Papuans (1.33 Mb versus 1.82 Mb per haploid genome, Welch’s t test T =
−3.9, p < 0.01), although both carry similar levels of D2 sequence
(1.28 Mb versus 1.37 Mb, T = −0.8, p = 0.41) (Figures 5A and 5B; STAR Methods S10h).
To
determine whether this difference in D1 sequence could be due to random
drift in the two populations or to different Denisovan introgression
histories, we extended the simulation model (
)
to incorporate population structure representing both New Guinea
mainlanders and Baining, in addition to the two introgressing Denisovan
populations (D1 and D2) (Figure S5; STAR Methods
S10j). To test a conservative model offering maximum opportunity for
isolation and drift, we did not include any migration between Papuans
and Baining after their population split. Archeological evidence
suggests that New Britain was settled by at least 35 kya (
), and from the genomic data, SMC++ (
) infers a genetic split time between mainland Papuans and Baining of 15.7 kya (Figure 5C).
We therefore implemented three alternative demographic models: using
the SMC++ genetic split times and population sizes (M1); using the SMC++
split time and more conservative (smaller) population sizes, thus
generating more drift (M2); and a model with a more conservative (older)
genetic split time of 23.2 ky (800 generations), also generating more
drift (M3) (Figure S6; STAR Methods
S10j). As expected, the observed difference in rates of D2
introgression between Baining and mainland Papuans are within the
distributions predicted by the simulations. However, in all three cases,
the observed ratio of D1 in mainland Papuans to Baining lies outside
simulated values (Figure 5D).
Together,
these coalescent simulations suggest that the reduced frequency of D1
blocks among the Baining is unlikely to result from shared D1
introgression into a common ancestral Papuan population, followed by
drift as each population subsequently diverged into the modern Baining
and mainland groups. Instead, the difference in D1 levels more likely
reflects different amounts of introgression from Denisovan populations
into mainland New Guinea and the islands to the northeast, which
occurred after the separation of the two Papuan populations (Figure 5E). The overall genetic similarity and relatively recent divergence of these Papuan groups (Figures 1 and 5C; STAR Methods S10h, S10j) have implications for the past distribution of D1 Denisovan populations and the process of archaic introgression.
First,
our data suggest that the D1 Denisovans, in contrast to D2, contributed
additional DNA to the mainland New Guinea population after the mainland
and Baining populations diverged from their common Papuan ancestor (Figure 5E).
This, together with the nearly complete absence of D1 in continental
Asia, is most consistent with the scenario that D1 Denisovans were
present in New Guinea or East ISEA (e.g., Wallacea). In turn, this would
imply that at least some Denisovan populations had the ability to cross
large bodies of water, such as the one represented by the Wallace Line.
This idea does not seem implausible given archaeological evidence of
archaic hominin dispersals—notably, the discovery of stone tools in the
Philippines dating to 700 kya (
) and the related finding of H. floresiensis on the island of Flores (
),
both across substantial water boundaries that persisted throughout the
Pleistocene. Such geographical barriers would limit gene flow and might
help to explain the extent of divergence between the D1 Denisovan
population and other Denisovan groups.
Second,
the late date for the D1 introgression and geographic structure in
modern populations suggests that Denisovans survived until 30 kya, and
perhaps as recently as 14.5 kya. This is longer than Neanderthals, who
died out around 40 kya (
), or H. floresiensis, which recent dating suggests did not persist on Flores beyond 50–60 kya (
).
The implication is that Denisovans living in ISEA may have been among
the last of all the archaic hominins to survive. This provides an
argument to screen for Denisovan remains possibly misclassified as other
hominins in existing archaeological collections and encourages more
archaeological research in the poorly accessible and hence incredibly
understudied New Guinea region.
Third,
the combined evidence of geographic structure and a recent D1
introgression date suggest that Denisovan introgression did not occur
immediately following the first modern human settlement in the region
(45–50 kya) (
).
This implies that introgression with archaic hominins may not be an
inevitable and immediate result of joint occupation of the same
territory.
High-Frequency Denisovan Blocks Include Many Archaic Gene Variants
We also investigated whether the Denisovan DNA that entered modern Papuans could have included regions with adaptive benefits (Table S6; STAR Methods
S11). We initially focused on genes introgressed from D1 and D2. As we
could only assign long blocks to D1 or D2 ancestry, we can only partly
describe diversity contributed by specific Denisovan groups. However,
we did identify 412 unique genes in introgressed blocks assigned to D1
and D2, including high-frequency blocks. The haplotypes with highest
frequency in either lineage included the linked genes FAM178B/FAHD2B/ANKRD36 (65% frequency), ZNF280D (38%), and FBXL20/MED1/CDK12 (28%) from D1, and ANKRD28 (30%), a region 15 kb downstream of CENPW (29%) and NFAT5/NQO1 (22%) from D2 (STAR Methods S11d).
To
explore adaptive introgression from Denisovans more broadly, we
profiled the frequency of all >20 kb introgressing haplotypes in East
ISEA and Papua (Figure 6; STAR Methods
S11a), an approach that considers the actual introgressing haplotypes
rather than being window based and thus offers greater precision in
identifying genes that may have contributed to adaptation. We first
searched for evidence of ontology enrichment (
) in genes found in the top 1% most frequent Denisovan haplotypes (STAR Methods
S11c). Enrichment was observed in categories related to smooth muscle
cell proliferation, immunity, and adipogenesis in both Papuans and East
ISEA.
Focusing on the 10 highest-frequency introgressed haplotypes (STAR Methods S11a), we replicated several previously known signals—WARS2 in East ISEA but not Papua (
), introgression in TNFAIP3 in both East ISEA and Papua (
), and FAM178B (
)
but seen here more in Papuans than in East ISEA. We additionally
observe two previously unknown high-frequency introgression signals in
both regions, centered around the TMPO, IKBIP, and APAF1 genes, as well as in a single gene, WDFY2. The latter has been identified as a focus of accelerated evolution in humans since the Neanderthal-Denisovan split (
) and is involved in endocytosis (
) and adipogenesis through regulation of PPARG (
), which is also a high frequency Denisovan introgressed gene in Papua and East ISEA. Depletion of WDFY2 in 3T3-L1 adipocytes is associated with reduced insulin-stimulated glucose uptake (
), indicating a role in both the differentiation and functioning of adipoctyes.
To
determine whether Denisovan gene variants in modern humans may have
experienced recent positive selection, we calculated nSL (
)
in 200 kb windows across the genome for mainland Papuan samples, the
Baining, and East ISEA separately. Several top 1% high-frequency
Denisovan introgressed genes were in the top 5% of nSL windows (STAR Methods S11b). Overlapping hits in the Baining included TNFAIP3 (nSL window percentile 3.6%) and WDFY2 (2.2%). The possibility of adaptive introgression at TNFAIP3 has been raised previously in the context of selection on immunity (
). The function of WDFY2 has been discussed above. We also note the top 1% nSL signals in genes with important roles in both lipid metabolism (FASN in Baining, mainland Papua, and East ISEA, and a window containing both FADS1 and FADS2 in Baining only) and carbohydrate metabolism (most notably AGL
in both Baining and mainland Papua). Taken together, it appears that
Denisovan introgression may have been an important source of diversity
for recent adaptation, both in the context of immunity and, potentially,
dietary adaptation.
Limited Evidence of Further Introgression Complexity in East ISEA and Papua
Given the recent presence of Homo floresiensis in our study area (
,
), and the possibility that late Homo erectus was contemporary with the earliest anatomically modern humans in ISEA (
),
we investigated whether there might be any hints of archaic hominin
ancestry, other than Denisovan or Neanderthal, in the dataset. We
attempted to detect such signals by analyzing S∗ windows that exhibit minimal overlap with Denisovan or Neanderthal blocks as identified by CP and HMM (residual S∗, STAR Methods S12).
We first note a pronounced excess in total S∗
signal in our Papuan samples (97.2 Mb) compared to East Asians (50.9
Mb), South Asians (48.3 Mb), and West Eurasians (40.8 Mb). After
confirming that this excess was primarily driven by introgressing
Denisovan ancestry, we estimate that any additional introgression from
outside the Human-Neanderthal-Denisovan clade was limited with an upper
bound of about 1% (STAR Methods S12a). Next, by profiling residual S∗
among different continental groups, we detect a slight excess of unique
variation that is not shared with other humans, the Altai Denisovan or
the Altai Neanderthal in East ISEA and Papua (Figure S7; STAR Methods S12b). The signal is not strong, and the difference in total residual S∗
between different global populations is small, suggesting at most
little introgression from outside the Human-Neanderthal-Denisovan
lineage in these two populations. This could hint at a more complex
introgression history involving unknown archaic hominins in ISEA and
Papua, such as H. erectus, as has been recently suggested for other Asian populations (
). For instance, the Altai Denisovan is also thought to have some H. erectus ancestry (
,
,
,
,
),
although it is not yet clear whether this is also true for
introgressing Denisovan populations. Equally, however, these genomic
signals could arise without further introgression events, notably
through balancing selection or incomplete lineage sorting, and so
warrant careful further study.
Finally, our dataset includes Rampasasa, a village on Flores that is close to the cave site where the H. floresiensis bones were found (
), and whose inhabitants were the subject of a recent genetic study (
) The proportion of Neanderthal and Denisovan introgression, and the amount of residual S∗ in this village is comparable to neighboring populations (Figure 7; STAR Methods S13), suggesting the absence of unusual archaic admixture in Rampasasa villagers relative to other people in East ISEA.
Conclusions
The discovery and characterization of archaic hominins has typically begun with the analysis of fossil remains (
,
,
,
).
However, as Denisovan admixture has its center of gravity in ISEA and
Papua where DNA rarely survives more than a few thousand years in the
humid tropical environment (
,
),
studying the genetic record from modern humans remains the sole way to
shed light on the substructure and phylogeography of archaic hominins in
this important but understudied region.
Here,
we use a statistical approach on new genomes from ISEA and Papua to
identify two new Denisovan groups (D1 and D2) and describe the
relationships between these archaic hominins long before they first
interacted with anatomically modern humans. Both groups branched off
early from the Altai Denisovan clade at 283 and 363 kya and were
reproductively isolated from the individuals at Denisova cave in Siberia
and from each other. Yet both groups bred with modern humans,
contributing around 4% of the genomes of Papuans, including over 400
gene variants enriched for traits involving immunity and diet. Some of
this introgression is restricted to modern New Guinea and its
surrounding islands and may have occurred as late as the very end of the
Pleistocene, making the admixing D1 Denisovan population among the last
surviving archaic hominins in the world.
The
genetic diversity within the Denisovan clade is consistent with their
deep divergence and separation into at least three geographically
disparate branches, with one contributing an introgression signal in
Oceania and to a lesser extent across Asia (D2), another apparently
restricted to New Guinea and nearby islands (D1), and a third in East
Asia and Siberia (D0). This suggests that Denisovans were capable of
crossing major geographical barriers, including the persistent sea lanes
that separated Asia from Wallacea and New Guinea. They therefore
spanned an incredible diversity of environments, from temperate
continental steppes to tropical equatorial islands. The emerging picture
suggests that far from moving into sparsely inhabited country, modern
humans experienced repeated and persistent interactions as they expanded
out of Africa into this highly structured archaic landscape across
Eurasia. This genetic contact yielded a rich legacy, including hundreds
of gene variants that continue to contribute to the adaptive success of
anatomically modern humans today.
STAR★Methods
Key Resources Table
Contact for Reagent and Resource Sharing
Further
information and requests for resources and reagents should be directed
to and will be fulfilled by the Lead Contact, Murray Cox (m.p.cox@massey.ac.nz).
For
consistency of process, all requests to access the sequences presented
in this work are managed through the Data Access Committee of the
official data repository (accession EGAS00001003054) at the European
Genome-phenome Archive (EGA; https://www.ebi.ac.uk/ega/home).
Experimental Model and Subject Details
Human subjects
All
samples were obtained from adult human subjects. For full information
about the new and published samples used in this study, refer to Table S1 – Sample and combined dataset list.
Ethical approvals and dataset description
Here
we report a new genomic dataset, the Indonesian Genome Diversity
Project (IGDP), which includes 161 high coverage genomes from 14 islands
and a wide variety of ethnic groups from Island Southeast Asia (ISEA),
spanning the Indonesian archipelago and stretching into Island Melanesia
(Table S1).
The samples used in this study were collected by J. Stephen Lansing,
Herawati Sudoyo and an Indonesian team from the Eijkman Institute for
Molecular Biology, with the assistance of Indonesian Public Health
clinic staff, and also by Mark Stoneking (see
) and Jonathan Friedlander (see
),
in collaboration with the Institute for Medical Research of Papua New
Guinea. A subset of the samples have also been incorporated into the
Genome Asia 100K Project (http://www.genomeasia100k.com).
All
collections followed protocols for the protection of human subjects
established by institutional review boards at the Eijkman Institute,
Nanyang Technological University and the University of Arizona. All
individuals gave their informed consent for participation in the study.
The study as a whole, including the generation of whole genome
sequencing data for the samples, was approved by the institutional
review board at the Eijkman Institute (EIREC#90). Full ethical approval
and oversight of consenting processes was also granted by the Nanyang
Technological University institutional review board (IRB-2014-12-011).
Permission to conduct research in Indonesia was granted by the State Ministry of Research and Technology (RISTEK).
Method Details
A schematic overview of the analytical pipeline presented here is shown in Figure S1A (STAR Methods S1–5) and Figure S1B (STAR Methods S6–13). Datasets used are shaded in green; analyses and inferences in yellow; and key steps are outlined in bold.
S1 - Sequencing and SNP calling
Sequencing
libraries were prepared using TruSeq DNA PCR-Free and TruSeq Nano DNA
HT kits depending on DNA quantity. 150 bp paired-end sequencing was
performed on the Illumina HiSeq X sequencer.
Individuals were sequenced to expected mean depth of 30x, with an achieved median depth of raw reads across samples of 43x.
These newly generated whole genome sequences were combined with the following published genomes (raw reads):
- a)292 genomes from the Simons Genome Diversity Project (SGDP) ( )
- b)25 Papuan genomes from the study
SNP
calling was performed on the combined dataset, with published genomes
analyzed from raw reads exactly as for the new sequence data.
Trimmomatic v. 0.38 (
)
was used to cut adapters and low-quality sequences from the reads.
After trimming, the vast majority of reads were longer than 145 bp;
those below 60 bp were excluded. We aligned the reads to the ‘decoy’
version of the GRCh37 human reference sequence (hs37d5) using BWA MEM (
). We removed duplicate reads with picard-tools v. 2.12.0 (http://broadinstitute.github.io/picard) and performed local realignment around indels with GATK v. 3.5 (
).
After
alignment, and keeping only properly paired reads that mapped to the
same chromosome, the sequencing depth across the samples used in
downstream analyses was as follows: min = 18x, Q1 = 35x, median = 38x,
Q3 = 43x, max = 48x. Only three samples had median coverage rates below
30x: CBL34, RAM005 and RAM067.
Base
calling was undertaken with GATK v. 3.5 following GATK best practices.
Per-sample gVCF files were generated using GATK HaplotypeCaller (using
only reads with mapping quality ≥ 20). Single sample gVCFs were combined
into multisample files using CombineGVCFs, and joint genotyping was
performed using GATK GenotypeGVCFs, outputting all sites to a
multisample VCF. Exactly the same base calling steps were
applied to new and published samples, and the joint genotyping included
all samples in this study.
Using BCFtools v. 1.4 (
),
the following filters were applied to each genotype call: base depth
(DP) ≥ 8x and ≤ 400x, and genotype quality (GQ) ≥ 30. We then kept only
biallelic SNPs and invariable reference sites. For the majority of our
analyses, we kept only sites that had high quality variant calls in at
least 99% of samples. (Specifically, all analyses in STAR Methods
S5-S9 and S11, and all analyses in S10, apart from two result
robustness checks that assessed phasing and archaic haplotype
topologies. Additionally, we did not apply the call rate filter in the
motif-counting analysis in STAR Methods
S12). Applying this 99% call-rate filter yielded a total of 36,462,963
SNPs in the combined dataset. We removed sites within segmental
duplications, repeats and low complexity regions. These masks were
downloaded from the UCSC and Broad Institute genome resources:
In
the filtered and masked VCF files, we examined several statistics
across the samples: the percentages of no-calls and singletons; the
average depth; transition/transversion ratio; the number of variants;
and heterozygosity. One highly heterozygous sample from the SGDP
(LP6005441-DNA_A09, Naxi-2) was excluded based on these metrics, as well
as on the basis that the original authors determined that this sample
had been contaminated (
).
S2 - Kinship and outlier analysis
We performed sample kinship analysis using KING v. 2.1 (
).
Of the 161 new genomes, 6 were excluded due to the presence of a
first-degree relative in the dataset, leaving a total of 155 genomes for
downstream analysis. This relatedness is a consequence of the
village-scale sampling strategy employed in this study. In addition, 7
sample pairs display second-degree relatedness (BNA05 / BNA12-F, BNA21-F
/ BNA26-F, CBL018 / CBL019, RAM045-F / RAM067, RAM022 / RAM039-F,
NIAS08 / NIAS12, NIAS01 / NIAS10). These samples were kept for further
analyses, and the final dataset comprised 471 genomes: 155 newly
generated complete genomes, 25 genomes from the
study and 291 genomes from SGDP.
Principal
component analysis (PC) was used to detect sample outliers and
characterize regional diversity. First, we applied LD pruning of our SNP
set using PLINK v. 1.9 (
). Pruning was performed in 1 Kb sliding windows with a step size of 100 bp, and SNPs with R2 > 0.1 were removed. Next, PCA was performed in EIGENSOFT v. 7.2.0 (
,
) without the outlier removal step. The results of a PCA without African samples (N = 429) is shown in Main Text Figure 1B.
S3 - Adding archaic data and ancestral information
The combined primary dataset of 471 modern human genomes was merged with two high-coverage archaic hominin genomes:
- a)Denisovan ( ). Downloaded from
- b)Neanderthal ( ). Downloaded from http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/
Positions
with missing or low-quality calls (marked as ‘LowQual’ in the original
VCF files) in one of the archaic samples were excluded during the
merging procedure. In addition, heterozygous archaic SNPs were also
removed to improve phasing quality (but see later analyses). Both types
of SNPs were masked for both archaic and modern individuals. These
additional filters resulted in a dataset comprising 35,395,615 SNPs.
A
second merged dataset was produced by excluding missing and low-quality
archaic calls but retaining SNPs that were heterozygous in archaic
individuals. This dataset was used to cross validate our results and
assess potential bias (if any) introduced by trimming archaic
heterozygous positions.
A third
dataset was produced by merging the primary dataset containing 471
modern and 2 archaic genomes without heterozygous archaic SNPs with the
Vindija Neanderthal genome data (Vindija33.19, downloaded from http://cdna.eva.mpg.de/neandertal/Vindija/VCF/Vindija33.19/).
As with the main dataset, positions that were missing, low-quality or
heterozygous in the Vindija individual were masked for all archaic and
modern individuals.
We added
ancestral allele information to our dataset using the Ensembl Compara 71
database (ensembl_compara_71@ens-staging2:3306, MethodLinkSpeciesSet: 6
primates EPO (548)).
S4 - Phasing and phasing assessment
The combined dataset comprising both modern and archaic samples was phased using read aware phasing with SHAPEIT v. 2.r837 (
).
Phase informative reads (PIRs) were extracted using software guidelines
from both modern and archaic BAM files. The HapMap phase II b37
recombination map (
) was used and the phasing was performed using the following arguments: –states 400 –window 0.5 –states-random 200.
We
assessed the performance of our phasing approach, which incorporated
phase informative reads, by comparing our phased haplotypes to two
experimentally phased individuals (
)
that overlap with the Simons Genome Diversity Project genomes included
in our dataset – WON,M (corresponding to sample B_Australian-4 in the
Simons dataset, Illumina ID SS6004477) and BUR,E (corresponding to
sample B_Australian-3 in the Simons dataset, Illumina ID SS6004478).
These two samples come from Aboriginal Australians with unknown
geographic origin, but containing no genetic signals of recent admixture
with Papuan or European populations.
For each of the two individuals, we applied the following procedure:
- •Divide the genome into 8 Kb windows that are non-overlapping and at least 50 bp from any chunk masked by the alignability mask
- •Retrieve the corresponding SNPs from the experimentally phased FASTA files and our computationally phased VCFs. The BWA algorithm (v0.7.16a) ( ) was used to confirm that the windows aligned correctly.
- •Mask sites that disagree or are missing in the 8 Kb chunks (due to different SNP calling procedures)
- •Discard chunks that contain ≤ 1 heterozygote site.
- •Count the number of heterozygous sites and the number of switch errors in the remaining chunks, and divide the totals to obtain a switch error rate.
This
approach accounts for the impact of solitary heterozygote sites and the
alignability mask, which can hide switch errors (if an even number of
switch errors occur within a masked region).
Based
on 11164 8 Kb windows for WON,M and 11245 windows for BUR,E, we
calculated switch error rates of 3.7% and 4% respectively. These are
expected rates (see supplementary text S9 in the
study). Note that singletons are randomly placed in our phasing
procedure, unless they are part of a phase informative read, and will
also act to disrupt haplotypes.
Quantification and Statistical Analysis
S5 - Dataset genetic diversity and SNP novelty
As an exploratory analysis of our phased dataset, we retrieved lists of SNPs in the 1000 Genomes Project (Phase 3) (
), dbSNP (Build 150), ESP (ESP6500SI-V2-SSA137) (
) and ExAc (ExAcRelease 1.0) (
).
We counted the number of SNPs defined when considering each population
group separately, and determined how many were observed in one or more
of these datasets, and how many were not observed. Across all of our
newly reported samples, we observed 11,859,578 SNPs, of which 21%
(2,525,213) were novel. Existing datasets are better able to capture
variation in West ISEA (9.3% SNPs novel) than East ISEA (14.3%) or Papua
(21.1%), likely because previously published datasets incorporate a
large number of mainland Asian samples.
S6 - Estimating Asian-Papuan admixture proportions
To estimate Asian-Papuan admixture proportions in ISEA and Papua, we used local ancestry inference implemented in LOTER (
). LOTER has been shown to outperform similar methods, such as HAPMIX (
) and RFMix (
), and its accuracy is greatest when admixture is more ancient than 150 generations. This time frame (ca
4500 ya) is directly relevant for Indonesian prehistory – linguistic,
archaeological and genetic evidence all point to the spread of
Austronesian speakers beginning 4000–4500 ya from Taiwan, reaching
eastern Indonesia 3500–3000 ya (
).
We specified two reference datasets: Papuans from the current study (N = 72; see Table S1)
and East Asians (N = 293). The East Asian reference dataset included
293 geographically diverse samples, specifically 43 samples from the
Simons Genome Diversity Project, and 50 random samples from each of the
five East Asian populations in the 1000 Genomes Project (CDX – Chinese
Dai, CHS – Southern Chinese Han, CHB – Northern Chinese Han, KHV –
Vietnamese Kihn and JPT – Japan).
Samples
from ISEA were analyzed separately using the Papuan and East Asian
reference datasets. To infer local ancestry tracts in Papuan samples, we
created an individual Papuan reference dataset for every single target
genome by excluding the individual sample from the full Papuan reference
set to avoid self-copying. Results are reported in Main Text Figure 1A and Table S2.
S7 - Archaic block identification
a. ChromoPainter
We used ChromoPainter v.2 (CP) (
)
to detect archaic ancestry in the genomes of our present-day
individuals. This method relies on phased haplotype data and describes
each individual recipient chromosome as a mixture of genetic blocks from
the set of predefined donor individuals. This process, known as
chromosome ‘painting’, generates a matrix of copying vectors, which can
be analyzed further. For each recipient haplotype, CP also outputs the
expected probability of copying from each donor population at each SNP.
CP uses an Expectation-Maximization (EM) algorithm to re-estimate the
proportion of genetic material copied from each donor by using the
previous estimates as a new prior under the model, and then iterating.
We
painted each of our modern non-African human genomes individually using
a set of 35 sub-Saharan African genomes from SGDP, which represents
modern non-Eurasian human ancestry, and each of the archaic samples
separately. We used a two-step approach:
- 1.The initial run was performed with 10 EM steps to estimate prior copying probabilities for each individual and chromosome separately using the following command line arguments: -i 10 -in -iM.
- 2.Next, estimated prior copying probabilities were averaged across the genome for each individual. The main run was performed with a recombination scaling constant, global mutation (emission) probability and genome-wide average prior copying probability from the first step using the following command line arguments: -n Ne -M mu -p prior_copying_probability.
Either
archaic or modern ancestry was then assigned to individual SNPs using a
probability threshold of 0.85. Unknown ancestry was assigned to SNPs
with intermediate copying probability.
b. Hidden Markov Model
To explore the behavior of the CP approach, we additionally implemented a hidden Markov model (HMM, https://github.com/guysjacobs/archHMM). Our approach is inspired by that of Racimo and colleagues (
,
), and we follow their notation where possible.
The
problem is to partition the genome into blocks of DNA that have
introgressed from an archaic hominin and non-introgressed DNA. To do
this, we consider a test haplotype vector of SNPs of length k, with each SNP indicated by either a 0 (if ancestral) or 1 (if derived). The SNPs have an associated genetic map vector where if . We assume that each SNP has one of two hidden states (human, 0, or archaic, 1), which form the ancestry vector . We also define an observation vector with
- 1.if and and
- 2.otherwise
where denotes the frequency of the derived state in a panel of comparative African populations and is
the frequency of the derived state in a panel of archaic hominins. In
our case, the panel of archaic hominins is the single Altai Denisovan
individual and our condition simply requires that the derived state is observed, whether in the homozygous or heterozygous state. We apply a non-0 of
such that the observed state 1 may still be assigned if a single
African individual is heterozygous at a site. This guards against
sequencing errors given the moderately large panel of sub-Saharan
Africans ,
and importantly, also provides some robustness against low levels of
recent back migration into Africa from populations with archaic
introgression.
We only consider the
subset of SNPs that have known ancestral state and are not missing in
the test haplotype or archaic panel. We also chose to exclude SNPs that
are variable only in the African panel as these are prone to bias the
density of observations depending on local African diversity patterns.
Local African diversity may be impacted by complex molecular (e.g.,
mutation rate variation) and evolutionary processes (e.g., purifying
selection).
We want to identify blocks of the test haplotype that have elevated observed state
corresponding to derived alleles that are shared with the archaic
hominin and are rare in Africa. To do this, we define a two-state,
time-homogeneous HMM, with four parameters – the emission probability of
the two observed states for each hidden state: and , which also define and ; and asymmetric transition rates between the two hidden states and , which define and in a similar manner. We seek to estimate these parameters and retrieve a most likely ancestry vector z given them, and use a Viterbi algorithm and expectation maximization procedure to do so (see also
).
We first assume transition probabilities that are linear with genetic distance with a maximum cut-off,
and
where and k1 and k2 are rates to be fitted. The maximum cut-off is to ensure realistic behavior when is large, although SNPs are rarely sufficiently separated for this to be required. Note that we do not constrain k2 to be a function of k1
as in the Racimo approach, which can be used to include an assumption
of a simple introgression model. Our intention is to identify the most
plausible archaic blocks, but we purposely make no assumptions about the
introgression history where possible.
Our block estimation and parameter estimation algorithm is as follows:
- 1.Initialize parameters to The probability of must also be initialised and is set to 0.01.
- 2.Apply the Viterbi algorithm to estimate z, given y and the model parameters at iteration t.
- 3.Re-estimate the model parameters based on the estimated ancestry vector z. , , and .
- 4.Repeat steps 2 and 3 until either the ancestry vectors at t and t – 1 are identical or t = 50, and output the ancestry vector and parameter estimates.
Note that parameter is
fixed during the estimation procedure. This is to ensure that the model
is inferring archaic blocks with similar properties when the model is
run on different chromosomes and different individuals, while still
offering flexibility in the timing of introgression (by re-estimating
transition parameters) and human diversity patterns (by re-estimating ). We fixed parameter by running a free estimation, where is re-estimated in a manner analogous to ,
for a set of 10 Papuans, who have a well-established history of
Denisovan introgression, and 10 West Eurasians, who do not. We found
that setting was successful in detecting Denisovan blocks in Papuans, while limiting the false positive rate in West Eurasians.
Our
approach differs from the Racimo method in its parameter estimation
procedure and the transition probabilities. It also purposely ignores
African-specific diversity. However, the principle of applying an HMM to
detect tracts enriched for non-African derived alleles shared with an
archaic hominin has been widely applied in several slight variations
(e.g.,
,
,
,
and others) and as such is a standard approach in archaic introgression block estimation.
c. S∗
We used the S∗ method (
,
,
),
which seeks to detect introgressed haplotypes from archaic hominins
without using an archaic reference genome, following the implementation
of
. S∗
is sensitive toward highly diverged sequences with high LD, and thus is
adequate to detect hominin introgressed haplotypes without having a
reference sequence of the hominin in question.
To calculate empirical S∗
in our non-African samples, we used 35 sub-Saharan Africans as a
reference population and analyzed one non-African sample at a time. We
removed any non-segregating sites from our data. We used the HapMap
phase II b37 recombination map (
), and ancestral information from the Ensembl Compara 6 primates EPO (548) database. Empirical S∗ values were estimated in 50 Kb genomic windows.
After calculating the S∗
value from the real dataset, we used simulations to assign statistical
significance to our empirical estimates. We performed simulations using
different combinations of the number of segregating sites and
recombination rates from the previous step. We simulated a previously
described demographic model (
) with the simulator ms (
). For non-West Eurasians samples, we performed simulations with the East Asian demographic model as presented in the
study. We obtained a neutral distribution of S∗
without hominin introgression for 50 Kb regions using 60000 simulations
for every combination of the number of segregating sites and
recombination rate. The following ms command line was used:
- ms
60000 -I 3 -s -r 38.7∗ 50001 -n 1 2.20 -n 2 4.47 -n 3 6.53 -g 2 101.69 -g 3 146.31 -m 1 2 1.49 -m 2 1 1.49 -m 1 3 0.46 -m 3 1 0.46 -m 2 3 1.85 -m 2 3 1.85 -ej 0.029 3 2 -en 0.029 2 0.29 -em 0.029 1 2 8.93 -em 0.029 2 1 8.93 -ej 0.087 2 1 -en 0.23 1 1 -p 5
We then used the computationally efficient strategy employed to estimate the S∗ value for every region as in
.
Finally, we assigned a p value for each 50 Kb genomic window in every
individual sample in our non-African dataset using the neutral
distribution of S∗ from the simulations without archaic
introgression. Genomic windows with a recombination rate less than 0.005
or more than 20.01 and/or a number of segregating sites less than 20 or
more than 511 were excluded. To assess the overall fit of the neutral
demographic simulations to the data, we built General Additive Models
(GAM) of the 95th and 99th percentiles of simulated neutral S∗ as a function of the number of segregating sites and recombination rate.
We established two thresholds of significance:
- 1.To estimate the overlap between CP, HMM and S∗, and define high confidence Denisovan blocks, we identified regions as introgressed if the S∗ in the real data is more than the 95th percentile of the simulated data distribution.
- 2.In addition, for the residual S∗ analysis, we identified a region as introgressed if the S∗ in the real data is more than the 99th percentile of the simulated data distribution. This yields a more conservative, high confidence S∗ set, which allows for more robust inferences of residual S∗ signal (see below).
S8 - Introgression results from the three methods
Details of the amount of introgression based on the different methods, and profiling of the methods, are shown in Table S3.
The total amount of introgression detected by the three methods, per phased haploid genome, is shown in Table S3A.
There is a clear correlation between the methods, with most archaic
introgression detected by all three methods in Papua and East ISEA, and
with least in West Eurasia. The total amount of archaic introgression
detected by CP was slightly less than that detected by the HMM, which
was in turn considerably less than that detected by S∗ with
95% confidence. We note that different methods are expected to extract
different amounts of introgression signal (see e.g.,
Table 1 for a recent comparison). The relative enrichment of Denisovan
signal in Papua over West Eurasia was greater for CP (6.4-fold) than the
HMM (5.0-fold), based on all pairwise Papuan/West Eurasian comparisons
(paired t test, T = 280.5, p ≈ 0.0; Cohen’s D = 1.49). As Denisovan
introgression is not expected in West Eurasia, the clear excess in
enrichment seen in CP over the widely applied HMM approach strongly
supports CP – when used in the manner described above – as an effective
method for detecting archaic hominin introgression. Nevertheless, as Table S3A
shows, a substantial amount of Denisovan introgression was detected by
both methods in West Eurasia, suggesting that the three methods are
individually detecting a number of false positive introgressed blocks.
We
hypothesize that this phenomenon is driven by ‘spillover’ signal from
Neanderthal introgression. Indeed, incomplete lineage sorting could lead
Neanderthal-introgressed blocks to have greater genetic similarity to
the Altai Denisovan than Altai Neanderthal; and even blocks that
coalesce with the Altai Neanderthal before the Denisovan/Neanderthal
common ancestor will often show greater similarity to the Altai
Denisovan than humans due to Denisovan/Neanderthal common ancestry. As
an initial check on this hypothesis, we categorized the genome into
regions that only had evidence for Denisovan or Neanderthal
introgression (from one or both of the HMM and CP methods); regions that
had conflicting evidence for both Denisovan and Neanderthal
introgression from the two methods; and regions with an unknown signal
arising from S∗ (> 95% confidence), thus identifying a region as introgressed but with no support from the HMM and CP methods (Table S3B).
As predicted, discounting ambiguous signal, there is now 12.5 times as
much Denisovan introgression in Papua compared to West Eurasia.
We
used these observations – that higher confidence Denisovan blocks are
supported by multiple methods, and that unexpected signal can be driven
by spillover from Neanderthal introgressed blocks – to refine our
Denisovan block set (STAR Methods S9 below).
S9 - Refining archaic block sets
a. Iterated filtering improves specificity
Both
CP and the HMM were run using the Denisovan and Neanderthal genomes
independently. However, Neanderthals and Denisovans are believed to
share common ancestry more recently than the human/Neanderthal/Denisovan
common ancestor (e.g., 495 versus 657 kya in
),
increasing the probability of introgressing archaic blocks showing
greater similarity to both Neanderthals and Denisovans than modern
humans. Table S3B
profiles the degree to which the archaic portion of genomes overlaps in
different continental groups. Importantly, we note that while only very
little of the West Eurasian genome is assessed as unambiguously
Denisovan (that is, identified by at least one of the CP or HMM methods
as Denisovan, but as Neanderthal by neither), a considerable amount is
ambiguous – that is, identified by different methods as both Neanderthal
and Denisovan.
This suggests that
looking at the overlap of methods, and actively removing ambiguous
blocks, may be a promising way to obtain a high confidence set of
Denisovan blocks. A logical approach to this is to iteratively discard
Denisovan CP blocks that are either i) not supported by the other
methods (Denisovan HMM and S∗) or ii) also identified as
Neanderthal by CP. We can do this by requiring more than a minimum
overlap between each Denisovan CP block and the supporting methods, or
less than a maximum overlap with a Neanderthal CP block. However, it is
not clear how much overlap is appropriate to ensure that ambiguous or
weakly supported Denisovan introgression blocks are removed, but that
sufficient signal remains for further analysis.
We
therefore sought parameters for an incremental filtering procedure that
maximizes the excess Denisovan signal in Papuans against West
Eurasians, the two samples expected to have highest and lowest Denisovan
introgression, respectively (Main Text Figures 2A–2C). For each Denisovan CP block on a chromosome copy of an individual, the procedure progresses in three steps:
- 1.Discard the block if the proportion of its length that is overlapped by Denisovan HMM blocks on that chromosome copy of the individual is under .
- 2.If the block survived (1), discard it if the proportion of its length overlapped by S∗ windows for the individual is under .
- 3.If the block survived (1) and (2), discard it if the proportion of its length overlapped by Neanderthal CP blocks on the chromosome copy of the individual is over .
Here,
indicates a requirement that the entire CP Denisovan block is entirely
covered by a single HMM Denisovan block, or by the union of multiple HMM
Denisovan blocks, while
indicates that only 0.1% of the block needs to be covered. We
specifically chose to use CP Neanderthal blocks to exclude ambiguous
signal, rather than HMM Neanderthal blocks, in order to additionally
exclude any regions that might be readily, and perhaps falsely,
identified as non-human specifically by the CP method; for instance, due
to local patterns of African variation. We used the bedtools suite
v.2.27.0 (
) with the subtract command and -N flag and -f flags to perform this procedure.
We
sought to retrieve high confidence Denisovan blocks while still
retaining enough Denisovan signal for further analysis. Noting that the
total genome inferred as Denisovan in Papua relative to West Eurasia
(Main Text Figure 2C) increases greatly even with minimal overlap requirements (, and ), we chose these to define our high confidence Denisovan block set. Thus, the refinement procedure is:
- 1.Starting with the Denisovan CP output, remove any block that is less than 0.1% overlapped by Denisovan HMM output. For example, for a 100 Kb Denisovan block identified by CP to be kept, we require at least 100 bp to be covered by Denisovan HMM output in the same chromosome copy of the individual as well.
- 2.Of the remaining blocks, remove any block that is less than 0.1% overlapped by S∗ output.
- 3.Of the remaining blocks, remove any block that is over 99.9% overlapped by Neanderthal CP output.
After
applying these filters, we are left with approximately 32.3 Mb of
high-confidence Denisovan introgressed blocks per genome copy for a
Papuan individual, from our original set of 59.2 Mb. For a West
Eurasian, we are left with just 688 Kb, down from 9.5 Mb. There is now
nearly 50 times as much Denisovan signal in Papuans, a profound
enrichment from the 6 times in the original method output. Some
genuinely Denisovan-like blocks may be found in West Eurasians due to a)
introgressing blocks from a Neanderthal population that randomly
coalesce with the sampled Denisovan before the sampled Neanderthal, b)
incomplete lineage sorting dating to the common ancestor of modern
humans and Denisovans, and c) limited migration between populations with
known Denisovan ancestry (e.g., East and South Asians) and West
Eurasians since introgression occurred.
b. Filtering supported by SGDP introgression values
The
substantial increase in the enrichment of Denisovan signal in Papuans
over West Eurasians following our iterated filtering approach is a
strong indication that we are successful in removing spillover signal
from Neanderthal introgression. This increased accuracy comes at the
cost of decreased power to detect true Denisovan introgression. While
the filtering method reduces the Denisovan signal in West Eurasia from
9.5 to 0.7 Mb (93%), the signal halves in Papua from 59.2 to 32.3 Mb,
which is a greater signal depletion in absolute terms despite broadly
similar proposed levels of Neanderthal introgression in both regions (
).
To
profile this behavior in real data, we turn to genome-wide estimates of
Denisovan and Neanderthal introgression reported for the SGDP samples
in our dataset, calculated using counting statistics (
).
Genome-wide estimates of introgression have a benefit over haplotype
inference in that they measure the average signal of introgression over
the entire genome rather than pulling out small chunks; as a result,
they are better able to estimate the overall level of introgression, and
necessarily yield genome-wide proportions that are greater than the sum
of excavated archaic sequence haplotypes. The SGDP samples represent
global populations, including groups thought to harbor just Neanderthal
introgression (West Eurasian), but also Asian and Oceanian groups with
both Neanderthal and Denisovan ancestry, and so are ideally suited for a
comparison of methods.
We performed multiple linear regression using the Python statsmodels v.0.8.0 package (
) on the SGDP genome-wide estimates of Denisovan (SGDPDeni) and Neanderthal (SGDPNean) introgression (Table S1 in
)
as independent variables, and either the total Denisovan sequence
identified for each sample by CP alone or our high confidence Denisovan
introgressed sequence as the dependent variable.
In
this analysis, our introgression estimates are expressed as a
proportion of introgressed sequence per individual genome, such that the
units are the same in the two datasets. Given the high correlation
coefficients, the factors in Table S3C
provide an indication of the amount of spillover signal (significant
correlations of our Denisovan introgression estimates with SGDPNean, or of our Neanderthal estimates with SGDPDeni)
and the relative power of our methods to detect overall levels of
introgression. We assume that the published introgression estimates from
the SGDP data are accurate, which is supported by the strong
consistency of their results with other published findings (e.g., the
Denisovan introgression peak in Papua; more Neanderthal introgression in
East Asians than West Eurasians as in
).
We find evidence of considerable ‘spillover’ signal in the raw CP results. On a worldwide scale, SGDPNean significantly predicts CP Denisovan signal (factor 0.15), while SGDPDeni predicts CP Neanderthal signal (factor 0.19). In West Eurasia, where SGDPDeni is minimal (highest 0.2%, in populations with some Asian ancestry like the Saami), levels of SGDPNean alone are sufficient to predict our CP Denisovan signal with high accuracy (factor 0.21, r2 = 0.99); the spillover effect accounts for the entire signal.
Studying
the Denisovan CP signal in Papuans only – a population with high levels
of both Denisovan and Neanderthal ancestry – is especially informative
about the amount of spillover signal, with a SGDPNean factor of 0.28 only marginally smaller than the SGDPDeni
factor of 0.36. Note, however, that the substantially greater amount of
Denisovan than Neanderthal introgression in Papuans (4%–6% compared to
∼2%) means that the CP Denisovan signal is still largely composed of
Denisovan introgression. The high spillover rate in the CP Denisovan
signal and generally lower power to detect Denisovan than Neanderthal
introgression reflect the more distant relationship between the Alai
Denisovan and the introgressing Denisovan population, compared to the
Altai Neanderthal and introgressing Neanderthal population.
Spillover largely disappears when using the high confidence blocks. Worldwide, while SGDPNean and SGDPDeni
remain significant predictors of our high confidence Denisovan and
Neanderthal introgression signals, their factors drop from 0.15 to −0.01
and 0.19 to 0.04, respectively. There is minimal spillover when
estimating our high confidence Denisovan signal in West Eurasia (SGDPNean factor falls from 0.21 to 0.03), and both SGDPNean and SGDPDeni are now only significant predictors of the high confidence signal for their corresponding archaic species.
Our
high confidence intersection method thus substantially reduces the
false positive rate as reflected in the spillover signal when estimating
genome-wide levels of introgression. There is also a substantial
decrease in relative power – at a worldwide scale, from about 41% to 28%
when searching for Denisovan introgression, and 56% to 39% when
searching for Neanderthal introgression. The low overall power of
methods that detect introgression blocks compared to genome-wide
statistics is expected, as is the higher power to detect Neanderthal
introgression due to the more recent ancestry between the Altai and
introgressing Neanderthal than the Altai and introgressing Denisovan.
The reduction in power when using the more conservative high-confidence
signal is also expected – it reflects the cost of higher specificity. We
note that in cases where Denisovan introgression is essentially absent
(West Eurasia), a reduction in relative power is still observed for our
high confidence Neanderthal signal due to spillover of Neanderthal
introgression signal into the CP Denisovan block set. Thus, when there
is strong evidence of only one introgression source, it may be better to
avoid stringent block filtering.
c. Distribution of high confidence introgression
Table S3D
shows the amount of remaining Denisovan signal after removing
Neanderthal spillover per haploid phased genome. The results are
particularly interesting because our trimming method leads to
substantial depletion of the signal in Asian populations, as well as in
West Eurasia. Previous work has suggested approximately 0.5% Denisovan
introgression into South and East Asians, as compared to a 4%–6%
contribution to Papuans (
,
,
).
The relatively small amount of Denisovan signal in non-Papuans strongly
implies either that previous work has overestimated the Denisovan
contribution outside of Papua, likely due to ambiguous Neanderthal
spillover, or that the amount of Denisovan introgression into Papua is
higher than previously thought. If the former is the case, a rough
calculation based on a linear additional contribution from the
introgression percentage, and assuming West Eurasia has 0% introgression
and Papua 4%–6%, implies continent average Denisovan introgression
levels of 0.17%–0.33% in mainland Asia.
Performing
a similar trimming procedure using Neanderthal, rather than Denisovan,
blocks similarly yields estimates of Neanderthal introgression levels (Table S3E).
We are also able to clearly detect the known excess Neanderthal signal
in East Asians over West Eurasians, with East Asians, Siberians,
Americans and Southeast Asians having about 45% more Neanderthal
introgressed sequence than West Eurasians. This is highly consistent
with previous estimates of a 40% increase (
), as is the placement of South Asians as intermediate between the groups.
The
standard deviation of the amount of archaic introgression within
continental groups reflects variation in introgression levels both
within and between continental subpopulations, which in turn reflect
patterns of introgression into ancestral populations and more recent
demography. We note that the variation in Neanderthal ancestry is
greatest, and that the coefficient of variation is greatest, among West
Eurasians compared to other groups. There is a higher coefficient of
variation for Denisovan ancestry among East Asians and Siberians
compared to Papuans. The highest coefficient of variation is in West
ISEA, where it is driven by the inclusion of the Sulawesi population,
which has relatively higher Papuan ancestry (13.4%, Table S2) and greater high-confidence Denisovan introgression (∼4.2 Mb) than more westerly groups.
d. Denisovan signal follows Papuan ancestry
We
assessed the correlation between Papuan ancestry and Denisovan
introgression in southeast Asia using the LOTER output and various
measures of archaic introgression. Models were fit using Ordinary
Least-squares and the Python package statsmodels v.0.8.0. We fitted the
gradient of a linear model with point [1,1] fixed to ensure that a 100%
Papuan individual has 100% the Denisovan introgression observed in
Papua. We fitted both the raw CP Denisovan output (Main Text Figure 2D, left pane) and the high-confidence Denisovan blocks (Main Text Figure 2D, right pane). In each case, the correlation was strong (r2 = 0.98) and the linear fitting highly significant (p < 1 × 10−30).
The linear fitting of the raw CP Denisovan blocks against Papuan
ancestry had a gradient of 0.8 [0.79-0.81] implying an intercept of 0.20
[0.19-0.21]. A gradient close to 1, 1.01 [1.0-1.03], and a
correspondingly low intercept of −0.01 [0.03-0.0] is observed for the
high-confidence Denisovan blocks, consistent with the signal of
Denisovan introgression being primarily related to Papuan ancestry in
ISEA. Note that LOTER tends to predict a small proportion (< 5%) of
Papuan ancestry in West ISEA, which matches the limited Denisovan
introgression in the region (Table S2).
This may imply that a considerable portion of Denisovan signal in East
Asia more broadly is due to very low levels of Papuan ancestry, despite
some additional introgression known to be specific to the region (see
and our Main Text Figure 3).
Comparing the linear regressions in Main Text Figure 2D,
the predicted amount of Denisovan ancestry for a fully Asian population
falls when using the high-confidence blocks in which the Neanderthal
spillover signal has been removed. This emphasizes that Neanderthal
introgression needs to be carefully controlled for when asking questions
about Denisovan-specific ancestry. The village of Rampasasa, near the
Liang Bua cave site at which Homo floresiensis (
)
remains were found, and the island of Flores on which the site is
located, do not emerge as regional anomalies; the Denisovan signal
observed is closely predicted by their level of Papuan ancestry. This
suggests that any introgression specific to this part of East ISEA would
not be contributed by an archaic hominin on the Denisovan clade; we
study the possibility of other local introgression signals from
alternative hominin species below (e.g., S∗, STAR Methods S12 and S13, and Main Text Figure 7).
S10 - Archaic mismatch analysis
a. Mismatch against the Altai Denisovan genome
An
informative way of assessing the relationship of our high confidence
Denisovan blocks to the Altai Denisovan genome, which was used to
extract them, is by mismatch analysis. As longer blocks are better able
to resolve a mismatch distribution (explored in Figure S2
and below), we extracted the 2000 longest blocks from each continental
population. For each block with > 50 Kb of total unmasked sequence
data (counts in Table S3F),
we calculated the number of differences compared to the Denisovan
reference (with Denisovan and Neanderthal heterozygous positions masked;
see STAR Methods
S3). We then calculated the effective block length by subtracting the
portion of each block covered by the alignability mask from the total
block length, and converted these values into a mismatch
(difference/bp). As the number of differences per bp will be impacted by
our masking of low quality and heterozygous archaic sites, quality
control decisions (e.g., call rate > 99% filter) and the SNP calling
protocol, it is not possible to directly translate mismatch distance
into times based on a standard human genome mutation rate. We therefore
chose to scale the observed mismatches by dividing the mismatches of
each block by the average genome-wide mismatch rate between the Altai
Denisovan and West Eurasian samples, mD. We chose
West Eurasians as our baseline mismatch rate because that population has
the lowest amount of Denisovan introgression. Note that the average
genome-wide mismatch rate between Altai Denisovans and West Eurasians,
assuming 0% Denisovan ancestry in West Eurasians, reflects both the
divergence time of humans and the Neanderthal/Denisovan clade, and the
ancestral population size of the common ancestor of humans, Neanderthals
and Denisovans. We calculated the scale factor by summing the total
number of mismatches between each of the 75 samples × 2 = 150 West
Eurasian haploid genomes and the Altai Denisovan, and dividing this by
150 times the total length of unmasked sequence in the dataset.
Plotting these mismatches by continent yields the distribution pictured in Main Text Figure 3C.
The variation in the number of blocks > 50 Kb between continental
groups reflects varying sample size, as well as the total amount of
Denisovan introgression and the time since Denisovan introgression in
different groups.
Two primary patterns emerge. First, we replicate the results of
in identifying a mainland Asian-specific peak with relatively low
divergence to the Altai Denisovan. This peak is not limited to East
Asian populations (in whom it was originally detected), but also extends
into Siberia and the Americas. Interestingly, our American populations
show some reduction in the extent of this peak, potentially suggesting
that introgression was ongoing more recently than the divergence of
American and East Asian populations.
Second,
there is a clear dual mismatch peak in the Papuan population that is
not apparent in other groups, which instead show a single divergent
mismatch peak with some fluctuation in its exact placement. Given these
fluctuations, we sought to confirm that the twin Papuan peaks were not a
result of a very small number of common blocks being overrepresented
due to high frequency in the sample, which could emphasize stochasticity
in the coalescent and mutation processes. For example, in an extreme
case, if a set of 2000 blocks consists of just 14 blocks near fixation
then only a few independent coalescent histories might be represented in
the mismatch distribution.
Focusing
on the full set of 2000 largest Papuan blocks, we instead observed 226
entirely disjoint regions using the bedtools ‘merge’ function. Of these,
151 were > 0.5 Mb from any other region. Although the top ten most
introgressed regions contributed 515 blocks, this leaves 216 disjoint
regions with an average frequency of 4.8% in Papuans. The two highest
frequency blocks, at 94/144 and 57/144 in the 72 (diploid) Papuan
samples, lie outside both of the Papuan-specific mismatch peaks, the
former having an intermediate mismatch and the latter an unusually large
mismatch versus the Altai Denisovan.
To
assess the impact of using a different minimum block length cut-off, we
determined the mismatch using thresholds from 0 to 260 Kb (Main Text Figure 3A).
To ensure mismatch accuracy is not being impacted by the alignability
mask, we additionally required that blocks contained total called
sequence over the minimum cut-off. We did not calculate the mismatch
when less than 50 entirely non-overlapping blocks are involved in the
analysis. The twin peaks obtain resolution at approximately 130 Kb
(based on 3556 blocks). As expected for small block sizes, the overall
resolution is poor below 50 Kb. We also explored the role of the minimum
block length cut-off for our combined Siberia and East Asia sample
(Main Text Figure 3B) and East ISEA continental groups (Figure S2).
Again, the East Asian and Siberian specific peak only resolves when a
minimum block length cut off ≥ 50 Kb is applied. The two peaks observed
in Papua do not clearly resolve for East ISEA, due to the reduced sample
size and small Papuan ancestry leading to fewer observed blocks in this
region.
We hypothesize that the
challenge in resolving mismatch peaks using short blocks is related to
two factors. First, mutations are discrete. In our analysis, it might be
typical to observe 0 or 1 mismatching base pairs in a small 1 Kb block,
corresponding, respectively, to extremely low and extremely high
mismatch compared to the average mismatch of larger, ‘higher resolution’
blocks. Equivalently, it is not possible to observe a theoretically
expected mismatch of, for example, 0.0005 in a 1 Kb block as 1/1000 =
0.001 and 0/1000 = 0.0; a better approximation of the expected value is
possible with larger blocks. Second, the ratio of stochasticity in the
mutation process along the branches of a coalescent tree versus the
difference in coalescent times driving two mismatch peaks is also an
important variable.
To explore these
phenomena, we performed a simple simulation of expected mismatch for
two populations diverging from a source population at two fixed times, t1 and t2. The distribution of mismatches corresponds to a Poisson with mean where = 1.45 × 10−8 and is the mutation rate per base pair per generation, l is the block length and c is a coalescent time. The values of c are repeatedly drawn from the distribution of coalescence times – either an exponential distribution with location parameter t1 and mean t1 + 2Ne, or with location parameter t2 and mean t2 + 2Ne . In this way, we are able to incorporate random coalescence and mutation. Using the values of t1 = 9750, t2 = 12500 and Ne = 100 from our simulation fitting (see below) for the two Denisovan-like populations contributing to Papua, and t1 = 5000, t2 = 12500 and Ne =
100 for the two Denisovan-like populations contributing to East Asia,
we confirm that longer blocks are required to detect two signals of
divergence (Figure S2C).
This is particularly true when the divergence times are less widely
separated, as is the case for Papua. The value used here for the
divergence time of t1 = 5000 generations for the
Asian-specific Denisovan introgression signal is an approximation based
on the location of that mismatch peak and not based on explicit
modeling.
This phenomenon also
explains why we are able to detect the multiple peaks in East Asia and
Siberia, despite having considerably shorter blocks in these
populations. For example, the 2000 longest blocks in these populations
have an average length of just 43 Kb and 35 Kb respectively, yet we are
still able to detect two Denisovan mismatch peaks due to the probable
difference in population divergence driving these of 7500 generations
(5000 versus 12500). In contrast, the longest 2000 blocks have an
average length of 263 Kb in Papuans, and such long blocks appear
necessary to capture two populations diverging from the Altai Denisovan
in a narrower time frame, approximately 2750 generations apart (9750
versus 12500). Among East ISEA, the longest 2000 blocks have an average
length of 152 Kb; Figure S2C implies that this lack of resolution may be hiding the two peaks that we detect in Papua.
We
further assessed the hypothesis that the lack of a clearly bimodal East
ISEA mismatch distribution could be related to our power to resolve the
true mismatch distribution. To address this, we re-sampled Papuan
blocks to mimic the distribution of block lengths in East ISEA based on
rejection sampling (with replacement). We sampled 1000 sets of 2000
blocks, and generated mismatches against the Altai Denisovan. Plotting
these mismatch distributions against the observed Papuan and East ISEA
mismatch distributions (Figure S2B)
demonstrates that the signal of a dual mismatch peak would be expected
to be weak in our East ISEA sample based on the smaller number of long
blocks available.
b. Confirming the dual Denisovan mismatch signal
Before
proceeding to analyze possible causes of the dual Denisovan mismatch
signal in Papuans, we sought to confirm that it is observed in block
sets retrieved using a variety of methods. We first confirmed that the
signal is apparent both in the raw CP Denisovan and HMM Denisovan output
(Figures S3A
and S3B) to verify that our iterative trimming approach is not causing
the signal. The HMM tends to detect longer archaic blocks than CP, which
incorporates sequence with greater divergence from the Altai Denisovan,
consistent with its lower specificity (higher West Eurasian Denisovan
signal in Table S3A). The signal remains (Figure S3B) when performing a similar trimming procedure to that described above on our HMM block set (see STAR Methods
S9a), this time removing blocks that were i) less than 50% overlapped
by CP Denisovan blocks, ii) less than 0.1% overlapped by S∗
windows, or iii) over 99.9% covered by Neanderthal HMM output. These
criteria were chosen to obtain a high level of enrichment of Papuan
Denisovan signal over West Eurasian Denisovan signal.
We
next considered the possibility that our masking of heterozygous sites
in the archaic genomes to simplify phasing was causing the double peak.
Briefly, in a tree of a single introgressing haplotype X and two
haplotypes of the Altai Denisovan, Da and Db,
there are two possible coalescent topologies – either the first
coalescent event is between two Altai Denisovan haplotypes, (X,(Da,Db)), or between an Altai Denisovan haplotype and the introgressing haplotype, (Da,(Db,X)) or (Db,(Da,X)).
As more homozygote mismatches are expected in the former case, our
masking of sites that are heterozygous in one of the archaic genomes
could generate a complex mismatch signal. To confirm that the masking of
heterozygotes was not causing the multiple peaks, we re-phased the data
retaining loci with archaic heterozygotes, and performed the CP and
mismatch analysis on this data. This time, we trimmed the CP Denisovan
set by simply removing blocks that were more than 99.9% overlapped by CP
Neanderthal blocks. As before, the twin peaks are clearly visible (Figure S3A).
c. No dual mismatch signal in Neanderthal blocks
To
better understand the potential demographic implications of the dual
mismatch peaks observed in introgressed Denisovan blocks among Papuans,
we generated similar mismatch distributions based on our 2000 longest
high-confidence Neanderthal-specific introgressed blocks for each
continental group. These were generated using the same trimming protocol
described above (STAR Methods S9a), but starting with CP Neanderthal blocks and requiring overlap with both the Neanderthal HMM output and S∗, and only allowing limited overlap with CP Denisovan blocks (see Table S3E
for continental distributions of these blocks). As before, we only used
blocks with > 50 Kb of unmasked sequence data. A large number of
blocks remained for all continental groups (1978–1999 blocks), and the
average block length ranged from 168 Kb (America, with 27 samples) to
287 Kb (Papua, with 72 samples). As with the Denisovan introgressed
chunks, the average length of the 2000 longest blocks reflects both
sample size and introgression history; the higher levels of Neanderthal
introgression in the continental groups translates to longer chunks
being successfully extracted. The mismatch for each continental group is
shown in Figure S3D.
For ease of comparison, the curves are again scaled to the genome-wide
average mismatch between West Eurasians and the Altai Denisovan.
Interestingly,
despite slight fluctuations, a single dominant Neanderthal peak is
observed for all populations. This strongly suggests a) that any
demographic cause of the dual mismatch peak relates to events occurring
in the Denisovan population, rather than among Neanderthals or the
Denisovan/Neanderthal common ancestor, and b) that the dual peak is not
caused by a bioinformatic error or property of our methods (see further
discussion below) that might give rise to a bimodal mismatch
distribution against any archaic hominin.
We
confirmed the single mismatch peak using the HMM Neanderthal block set,
removing blocks that were i) less than 50% overlapped by CP Neanderthal
blocks, ii) less than 0.1% overlapped by S∗ windows, or iii)
over 99.9% covered by Denisovan HMM output. This confirmed the unimodal
mismatch distribution of Neanderthal introgressing blocks against the
Altai Neanderthal (Figure S3E).
As a final consistency check, we sought to make use of a second ancient Neanderthal genome, the Vindija Neanderthal (
).
This sequence is known to be more closely related to the Neanderthal
that introgressed into modern humans than the Altai Neanderthal, and
thus may be better suited for extracting introgressed Neanderthal blocks
(a 10%–20% increase is reported by
).
We used CP to extract introgressing blocks from Papuan and East ISEA
individuals, this time with either the Vindija Neanderthal genome or
both Neanderthal genomes as our Neanderthal group. The mismatch of these
blocks against the Altai Neanderthal again shows a unimodal
distribution (Figure S3F).
We additionally repeated our trimming procedure of the CP Denisovan
blocks to create high-confidence block sets, now with CP Vindija blocks
or blocks identified by CP as affiliated with either Vindija or the
Altai Neanderthal removed. Again, the bimodal mismatch distribution is
observed (Figure S3C).
If the more divergent peak were Neanderthal spillover signal, then we
would expect it to be reduced by removing more Neanderthal introgressed
sequence; such behavior is not apparent.
d. Assigning blocks to Denisovan ancestries
To
statistically confirm the bimodal distribution, we fitted a Gaussian
and Gaussian mixture to the mismatch distribution of long (> 180 Kb;
Main Text Figure 3A) Denisovan introgressed blocks identified in Papuans for 0.1 < MD < 0.23 using the Python package sklearn v. 0.19.1 (function sklearn.mixture.GaussianMixture;
). We used blocks > 180 Kb because i) large blocks are needed to resolve a complex mismatch distribution (Figure S2C); ii) the mismatch distribution is relatively stable with a minimum block length of 180 Kb or more (Figure 3A);
and iii) sufficient Papuan blocks (n = 1683) remain for downstream
analysis using this minimum block length. The bimodal distribution was
strongly supported (AIC = –5808.85, versus unimodal –5582.72). Note that
the negative AIC values occur due to the high probability density in
the range of mismatch values observed, with the probability density
function of the Gaussian mixture model concentrated over a mismatch
range less than 1; and that the difference between AIC scores rather
than their values are relevant here. The Gaussians are distributed
according to N(0.146, 0.018) and N(0.199, 0.015), and are weighted [58%, 42%] respectively, where N(μ,s) indicates a Normal distribution with mean μ and standard deviation s (Main Text Figure 4A).
We
additionally statistically confirmed that a bimodal mismatch
distribution was supported for unique chunks > 180kb in Papuans.
Here, we recorded the mismatch of sets of exactly overlapping chunks as
their average mismatch, to limit the impact of high frequency chunks on
the distribution. The bimodal distribution remains strongly supported
(AIC = −2158.53, versus unimodal −2076.80). This confirms that the
bimodal mismatch distribution is unlikely to be an artifact due to
selection (i.e., a small number of introgressed regions with high
frequency) or sampling effects.
We
used the Gaussian mixture model to assign blocks to either the
less-divergent Denisovan component (D1) or the more divergent Denisovan
component (D2), classifying as ambiguous any block with less than 80%
support for one model over the other (i.e., 0.2 < p(D1) < 0.8). We
were able to classify 1538 of 1683 blocks in this manner, 718 to D1 and
508 to D2; we did not classify blocks outside the 0.1 < mD <
0.23 bounds. These block sets were used in most subsequent analyses.
When studying the demographic implications of the bimodal mismatch
distribution, we studied either the full mismatch distribution (when
retrieving split dates of Denisovan populations), or the entire set of
high-confidence Denisovan blocks, classifying blocks with mD < 0.1 to D1 and blocks mD
> 0.23 to D2. This allows us to maximize the information in those
analyses. Spillover from D0 or Neanderthal – into D1 or D2
respectively – will be minimal as those introgression sources are
extremely limited in our high confidence Denisovan block set.
In
total, only 6 blocks were assigned to both D1 and D2 in 6 different
genomic regions. We suspected that these cases reflect occasional
misclassification due to the stochasticity of the mutation process; most
blocks in a region are consistent with the D1 or D2 classification, but
some spill over to the other peak. We expect a negative correlation
between block frequencies in this case. Calculating the frequency as in
described in STAR Methods
S12, we compared the correlation between D1 and D2 frequencies in our
small sample of 6 overlapping blocks with 1000 randomly selected sets of
6 non-overlapping block pairs. We observed a weak negative correlation
trend in the set of overlapping blocks (gradient = −1.15, lower than 92%
of resampled sets), consistent with the probability of occasional block
misclassification.
A simple way to
achieve a bimodality in a mismatch distribution is through a demographic
model involving multiple sources of archaic introgression.
recently inferred that two Denisovan-like ancestries introgressed into
East Asians on the basis of a bimodal mismatch distribution. Before
proceeding to analyze the mismatch distribution of Denisovan
introgressed blocks in Papuans in this context, we sought to exclude
other factors, including bioinformatic bias, and to confirm that the
driving signal is consistent with introgression from two source
populations on the Denisovan clade and not potential confounders.
e. D1/D2 signal is not a bioinformatic bias
The set of Papuan samples that we study come from three different sources – 30 newly generated sequences, 25 samples from
and 17 samples from
.
We used exactly the same pipeline for mapping and base calling on all
three datasets, and while we did not observe any obvious differences
between data from the three sources during our quality control steps, we
thought it possible that the bimodal distribution could conceivably be
caused by this batch effect. However, blocks assigned to D1 and D2 were
common over all three sample sources, and the number of blocks assigned
to D1 and D2 in sequences from the three sources showed no significant
deviation from random expectations (χ2 = 0.0997; p = 0.95).
While
the high coverage of the Altai Denisovan genome argues for high
confidence in SNP calls, a bimodal distribution of mismatch distances
could be generated by certain quality biases. To rule this out, we
performed two checks. First, a well-documented form of ancient DNA
damage is cytosine deamination leading to C-to-T substitution (
).
If there were a strong bias in GC content in the genomic regions of the
two block sets, with higher GC content in the D2 regions, then
deamination of the ancient Altai Denisovan genome could increase the D2
mismatch between the modern (introgressing blocks) and ancient
sequences. We therefore assessed GC content in the two Denisovan block
sets using the UCSC Table Browser (GRCh37/hg19, Mapping and Sequencing,
GC percent, gc5Base). The average GC content is very similar in both
sets (39.75% for D1 and 38.75% for D2) and the distributions clearly
overlap.
Second, we confirmed that
the number of Denisovan low-quality alleles does not differ greatly
between the D1 and D2 blocks sets (D1: 0.122%, D2: 0.124%; of the 23.1
Mb and 19.2 Mb of D1 and D2 sequence identified, respectively). While
low-quality SNP calls were masked in the dataset used to identify the
two block sets (see STAR Methods
S3), a strong bias in the amount of low-quality calls in the genomic
regions covered by one block set could imply relevant biases in region
quality. We did not observe this.
These
checks, along with the lack of a bimodal mismatch distribution when
comparing introgressing blocks retrieved using the Altai Neanderthal
genome against the Altai Neanderthal (Figures S3C–S3E) and the fact that our filtering process to retrieve high confidence Denisovan blocks requires that blocks overlap S∗
output (which does not use a reference archaic genome), suggest that
genetic properties of the Altai Denisovan genome do not substantially
contribute to the mismatch difference between D1 and D2 blocks.
We
then sought to determine whether the mismatch seen in the two block
sets might reflect challenges related to alignability. The overall
proportion of coverage by the alignability mask was extremely similar in
the genomic regions covered by the two block sets (median: D1 = 6.55%,
D2 = 6.52%), indicating that alignability challenges are unlikely to be
leading to biases in sequence diversity.
We
then considered whether phasing errors might be higher in one block set
than the other. For example, if a considerable amount of human sequence
were erroneously incorporated into D2 blocks but not D1 blocks, this
could drive up the D2 mismatch versus the Altai Denisovan. A simple
calculation based on the
model of hominin evolution, with a split between a single introgressing Denisovan clade and the Altai Denisovan tD,DI = 11998 generations ago (ND = 13249) and the Denisovan/Human split tH,D = 20225 generations ago (NHA =
32671), suggests a considerable human input would be required for D2
blocks to be approximately 30% more divergent from the Altai Denisovan
than D1 blocks. The approximate human component required is h where:
such that h
≈0.25. This can be considered a conservative lower bound, in that many
introgressed Denisovan haplotypes will survive into the larger Ne regime and hence have greater mismatch versus the Altai Denisovan.
Such
a pronounced phasing error is unexpected, and would be easily observed.
As a check, we asked whether the average recombination rate differs
between D1 and D2 genomic regions. A higher recombination rate in one
set of blocks would be expected to lead to faster recombination between
human and introgressing haplotypes and could complicate phasing.
Additionally, recombination rate is expected to generally impact local
genetic variation as it determines the linkage of neutral regions to any
nearby selected variation. We compressed the sets of blocks in the two
components to their respective overlapping subsets using the bedtools
function ‘merge’, leaving 86 D1 regions and 68 D2 regions. For each
region we calculated the average recombination rate using the same
combined HapMap phase II b37 genetic map as used for phasing (
);
the distribution of recombination rate between the two chunk sets was
not significantly different (standardized 2-sample Anderson-Darling test
statistic –0.60, asymptotic p = 0.71).
Ideally,
we would directly profile switch point errors by comparing our
Denisovan introgressed blocks to those retrieved using experimentally
phased haplotypes. However, experimentally phased data are only
available for two Australian samples in our dataset, and given the low
Australian sample size, we only called D1 and D2 blocks in Papuans. We
therefore searched for local variation in the mismatch within D1 and D2
blocks. For CP to call a block as Denisovan that has human sequence data
incorporated into it, we expect that the human sequence data would need
to be closely surrounded by real introgressing Denisovan sequence. If
this were not the case, then CP is expected to simply bisect the
Denisovan block. A Denisovan block containing human sequence is
therefore expected to have a sharp increase in the mismatch against the
Altai Denisovan and a sharp dip in the mismatch against a human genome
toward the center of the block. We assessed this signal by dividing D1
and D2 blocks into thirds, and calculating the mismatch of each third
against the Altai Denisovan and a West Eurasian sample. For this
analysis, we used a minimally masked version of the dataset where the
main phased data was combined with unphased data in which archaic
heterozygous/low quality sites were not masked, and the call rate filter
was not applied; heterozygote/homozygote mismatches in unphased regions
were half-weighted to reflect the average 50% probability of applying
to the block, and heterozygote-heterozygote mismatches were not counted.
The West Eurasian was chosen to be sample LP6005441-DNA_G10, a Russian
with approximately average high confidence Denisovan-specific
introgression as compared to other West Eurasian samples.
We
second counted the number of blocks showing a pattern indicative of
potential incorporation of human sequence – lower mismatch against the
human sample and higher mismatch against the Altai Denisovan in the
middle third. 3.2% of blocks were consistent with this pattern in D1,
compared to 5.0% of blocks in D2 (non-significant Chi square statistic
1.87, p = 0.17). The patterns of mismatches over individual blocks
suggest that incorporation of human sequence due to phasing errors is
rare, and similarly rare in D1 and D2.
We
additionally confirmed that the tendency for Denisovan blocks to be
called on both chromosome copies of an individual – the Denisovan
haplotype homozygosity within that individual – was similar between D1
and D2 blocks (10.9% and 10.2%, corrected χ2 = 0.066, p = 0.80).
Based
on these consistency checks, we do not interpret the mismatch
difference between D1 and D2 as being bioinformatic in origin.
f. D1/D2 signal is not caused by selection
We
second considered the possibility that the two peaks represented
differential selection. Under this explanation, we might expect D1
blocks to be under strong purifying selection, reducing variation and
mismatch, with D2 blocks evolving under neutrality.
Purifying
selection is expected to be focused on genic regions. We therefore
assessed whether D1 blocks have more overlap with genes than D2 blocks.
As when assessing recombination rate differences, we compressed the sets
of blocks in the two components to their respective overlapping subsets
using the bedtools function ‘merge’, leaving 86 D1 regions and 68 D2
regions. 79 and 60 regions overlapped genes (Ensemble 91 GRCh37)
respectively (non-significant χ2, p = 0.63).
As
an additional test for differing levels of purifying selection, we
asked whether D1 and D2 genomic regions differed in their B values. B
values are measures of background selection over the genome based on
observed diversity in an alignment of human, chimp, gorilla, orangutan
and macaque (
).
After converting original B values to GRCh37/hg19 genome coordinates,
we calculated the average B value over each D1 and D2 region. The
distributions of average B values in D1 and D2 regions were not
significantly different (standardized 2-sample Anderson-Darling test
statistic −0.87, asymptotic p = 0.91). The total average and standard
deviation for all D1 and D2 regions were 0.48 ± 0.24 and 0.50 ± 0.23,
respectively, and hence statistically overlapping.
As
D1 bocks have lower mismatch estimates compared to D2, they could have
been under strong purifying selection since the time of introgression
and might show a more pronounced skew toward rare blocks. We therefore
performed a Mann-Whitney U test to assess whether the frequency
distribution of blocks in D1 and D2 are different in our Papuan samples.
There was a significant statistical difference in the frequency
distributions (U = 3304, two-tailed p = 0.031), but summary statistics
describing the distributions were similar (mean D1: 0.048, D2: 0.049;
median D1: 0.028, D2: 0.021; standard deviation D1: 0.062, D2: 0.069).
Importantly, the proportion of rare blocks with frequency < 5% was in
fact lower in D1 (70%) than D2 (76%), which is consistent with
neutrality.
Based on the lack of a
clear genic/non-genic division between D1 and D2 blocks, their similar B
values, and no pronounced frequency skew toward rarer D1 blocks, we do
not interpret the mismatch difference between D1 and D2 as being driven
by selection.
g. The topology of D1/D2 blocks
The
network of interacting hominin populations in the Middle Pleistocene is
becoming increasingly complex. One phenomenon that is included in some
models (
,
,
,
,
), but not others (such as the main model in the
study) is a usually small Homo erectus component in the Altai Denisovan. Our approach does not allow us to identify genomic regions derived from H. erectus that may have introgressed into modern humans only, but not into the Altai Denisovan (but see STAR Methods S12). However, if the genetic contact between H. erectus
and Denisovans occurred before the divergence of the Altai Denisovan
and the Denisovan population that introgressed into humans, our D1
and/or D2 blocks could include regions with H. erectus ancestry, which were introduced into modern humans by a Denisovan population that was already pre-admixed with H. erectus.
If so, there would be two categories of blocks identified as
introgressed in the modern human – those derived from the Denisovan
clade and those with an H. erectus origin – which could have
different mismatch distributions and create a bimodal signal. While this
phenomenon is likely to be rare if the proportion of H. erectus in Altai Denisovan is small (2–14%), some models incorporate a surprisingly large contribution (e.g., 66% in Figure 3 of
).
We
therefore attempted to assign D1 and D2 blocks to specific coalescent
topologies by counting mutation sharing patterns and assessing
consistency with the fifteen possible topologies implied by the four
leaves of the tree (the block, Altai Denisovan, Altai Neanderthal, and
human). For this analysis, we followed the phasing assessment (see STAR Methods
S10e above) in using the Russian LP6005441-DNA_G10 to represent humans
and a minimally masked dataset. We counted the number of mutations in
the 16 possible sharing categories. For example, with notation 0
indicating the ancestral allele and 1 indicating the derived allele and
in order [Human, Neanderthal, Denisovan, Block], a derived mutation that
is unique to the block would contribute to mutation motif 0001, a
derived mutation shared between the block and the Altai Denisovan would
contribute to motif 0011, and a fixed derived mutation would contribute
to mutation motif 1111. When counting, we downweighed the contribution
of variants in unphased regions such that all possible mutation motifs
represented were counted, in proportion to their probability given an
equal chance of each variant being on each haplotype. The approach we
take – studying the frequency of ancestral and derived variants in a set
of samples of interest, and assessing the consistency of these with
different phylogenetic tree topologies – is allied to that taken in
recent work investigating the different demographic models of modern
human history (
).
We
note that relating topologies to demographic histories is complex; for
example, given a sufficiently large ancestral population size, each
coalescent topology would have an approximately equal probability even
if all blocks are Denisovan-introgressed. Nevertheless, studying the difference in topology proportions in D1 and D2 to clarify the history of these block sets can be useful.
We categorized the topologies of the blocks based on four heuristic criteria (see inset in Table S4):
- I.Mismatch order. For a block to be consistent with a given topology in terms of Mismatch order, the order of the mismatches between its leaves must be consistent with that topology. For example, given topology (H,(N,(D,X))) and with mij indicating the mismatch between leaves i and j, the mismatch order is required to be . For topology ((H,N),(D,X)), there are two mismatch order conditions, and . The specific cause of inconsistency in the Table S4 inset is .
- II.Branch length order. The Mismatch order requirement above does not consider ancestral or derived status. This makes it robust to multiple hit mutations and incorrect ancestral state inference, but may reduce accuracy if such phenomena are rare. For a block to be consistent with a given topology in terms of Branch length order, the order of the average branch lengths, measured in shared derived alleles, between its leaves must be consistent with that topology. For example, given topology (H,(N,(D,X))), and keeping mutation motif notation order [H,N,D,X], the branch length order constraints are and , where e.g., denotes the average of mutation motif counts 0001 and 0010. For topology ((H,N),(D,X)), there are two mismatch order conditions, and . The specific cause of inconsistency in the Table S4 inset is .
- III.Consistency threshold. For a block to be consistent with a given topology in terms of Consistency threshold, the ratio of the number of topology-inconsistent mutation motifs (ignoring multiple hits and ancestry errors) to the number of topology-consistent mutation motifs should be under some threshold, Tc. For example, given topology (H,(N,(D,X))), and keeping mutation motif notation order [H,N,D,X], . For topology ((H,N),(D,X)), . The specific cause of inconsistency in the Table S4 inset is .
- IV.Subtree balance threshold. For a block to be consistent with a given topology in terms of the Subtree balance threshold, the absolute log-ratio of two tree branches that are predicted to be equal under that topology should be under some threshold, Ts. For example, given topology (H,(N,(D,X))), and keeping mutation motif notation order [H,N,D,X], we have six conditions , , , ,, . For topology ((H,N),(D,X)) there are six conditions ,, , , , . The specific cause of inconsistency in the Table S4 inset is .
Assessing support for different topologies using each of these constraints independently with Tc = 0.10 and Ts = ln(1.5) (Table S4)
reveals support for only five topologies – (H,(N,(D,X))),
(H,(D,(N,X))), (H,(X,(D,N))), (N,(H,(D,X))) and ((H,N),(D,X)). These
topologies reveal a strong signal of human as an outgroup to
Neanderthal, Denisovan and the block sequence, and/or the Denisovan and
block sequence forming a clade. Support is substantially greater for the
(H,(N,(D,X))) topology, which follows the proposed population tree and
suggests that incomplete lineage sorting is not driving the signal of
the long blocks we have identified. Importantly, topology (H,(D,(N,X)))
is similarly infrequent in both D1 and D2 blocks. This topology is
consistent with the introgressing block being placed as most similar to
the Altai Neanderthal and is expected to be more common in the D2 block
set if the excess divergence were driven by spillover signal from
Neanderthal introgression. We further note that we do not observe a D2
peak in all non-African populations, and especially that we do not see
this peak in West Eurasians, despite each having considerable
Neanderthal introgression – again arguing against this explanation of
the D2 mismatch.
Topologies with either the Altai Denisovan or D1/D2 block acting as an outgroup to other genomes – which are consistent with H. erectus
introgression into blocks or into the Denisovan – are extremely
uncommon. While it is helpful to confirm the absence of such topologies,
they are not expected given our block identification methodology as
they do not allow for excess similarity between blocks and the Denisovan
genome. In this context, the ((H,N),(D,X)) topology is interesting.
This topology is consistent with, but certainly not unambiguous evidence
for, shared H. erectus introgression into both an
introgressing block and the Altai Denisovan genome. While the topology
is common, importantly we do not see clear evidence that the frequency
of this topology is substantially different in D1 and D2. This may imply
either that the H. erectus introgression identified in the
Altai Denisovan genome in other studies (see above) occurred earlier
than the divergence of D2 from Altai Denisovan or later than the
divergence of D1 such that it is Altai Denisovan-specific.
Other
frequency differences in topologies are observed and can be informative
regarding the demographic drivers of the D1 and D2 block sets. In
particular, three topologies show prevalence differences – an increase
in the prevalence of (H,(N,(D,X))) in D1 versus D2, and decreases in D1
versus D2 in both (H,(X,(D,N))) and (N,(H,(D,X))) (Table S4,
column < D2 > / < D1 > ). The increased frequency of
(H,(N,(D,X))) in D1 is consistent with D1 originating from a less
divergent Denisovan clade, as the D1 blocks join the Altai Denisovan
ancestor more recently and hence will be more likely to coalesce with
the Altai Denisovan genome before other events. A similar argument
applies to the (H,(X,(D,N))) topology – here, proportionally more of the
D1 blocks have coalesced with the Altai Denisovan already by the time
the Altai Denisovan and Altai Neanderthal share common ancestry, such
that (H,(X,(D,N))) should be under-represented in D1 versus D2. Again,
this is observed. The increased prevalence of (N,(H,(D,X))) in D2 is not
clearly expected and may be associated with detection bias; the greater
distance of D2 chunks from the Altai Denisovan render them harder to
differentiate from Neanderthal introgression, such that there is a
tendency to retrieve topologies in which the Altai Neanderthal is placed
deeper in the tree.
The
(H,(N,(D,X))) topology is consistent with a series of coalescent events
that mirror the proposed population model – with an introgressing
population showing greatest affinity to the Altai Denisovan, then the
Altai Neanderthal, and then humans. As such, the hypothesis that D1 and
D2 represent two divergent populations on the Denisovan clade makes
clear predictions about the frequency of different mutation motifs,
conditioned on a block showing this topology. Specifically, recalling
the mutation motif notation [H,N,D,X], the counts should substantially
reflect the branch lengths, as the block- and Altai Denisovan-specific
branches are shorter for D1, the motifs 0001 and 0010 are expected to be
rarer in D1 than in D2. Given an identical position of the
Denisovan/Neanderthal common ancestor for D1 and D2, any shortening of
these branches will lead to a corresponding lengthening of the common
ancestral branch of D1 and the Altai Denisovan, such that the 0011 motif
is expected to be more common in D1. We sought to confirm this pattern.
Without conditioning on topologies, the Denisovan-specific motif 0010
was 23% more common in D2 than D1, and the introgressing block-specific
motif 0001 was 33% more common. Conditioning on the (H,(N,(D,X)))
topology using the mismatch order criteria, for example, places
these motifs as 28% and 33% more common in D2 than D1, respectively.
Providing further evidence of consistency, the Denisovan/block shared
motif 0011 is appropriately shortened when conditioning on the
(H,(N,(D,X))), such that it is 20% less common in D2. The frequency of
this motif is approximately equal when we do not condition on the
topology (3% more frequent in D2).
To
summarize, the topology proportions in D1 and D2 blocks do not support
the idea that D1/D2 mismatch differences are driven by H. erectus
introgression into the Altai Denisovan, into D1 or D2 blocks
independently, or both. The prevalence of the (H,(N,(D,X))) topology,
and the topology differences between D1 and D2, are consistent with
Denisovan-like introgression in Papuans originating from two populations
on the Denisovan clade.
The
analysis above is useful in teasing apart the proportions and qualities
of coalescent topologies represented in D1 and D2, and can help to rule
out specific causes of their mismatch distributions. However, especially
given the complexity of block identification, we emphasize that the
topologies show consistency with a demographic scenario of interest
rather than discounting all other scenarios. The topology differences
between D1 and D2 could be consistent with other scenarios, including
introgression from a Neanderthal/Denisovan sister-clade or extremely
complex bottlenecks on the Denisovan clade.
Having
established the likely cause of D1 and D2 mismatches as complexity in
archaic hominin introgression, we sought to further explore differences
between the two block sets. We proceed by assessing evidence for
different introgression dates based on block lengths, and by asking
whether a model with introgression from two deeply divergent
Denisovan-clade populations is supported by simulations.
h. D1/D2 Denisovan lineages introgressed at different dates
Given
the likely demographic origin of the D1 and D2 haplotypes, the question
of whether D1 and D2 have different introgression dates is of
particular interest. We therefore sought to estimate introgression dates
based on haplotype lengths, which are expected to follow an exponential
distribution with its decay parameter depending on the introgression
date and the amount of introgression (Equation 1 in
).
The accuracy of this approach depends on the power of our methods to
detect haplotypes of different lengths. Two factors are relevant. First,
shorter haplotypes are expected to be harder to detect as signals of
introgression may be mistaken for noise. Second, haplotypes are expected
to be broken up by phasing errors; while the incidence of switch point
errors is low in our data (see STAR Methods
S4), and while our use of archaic genomes in phasing is expected to
substantially reduce errors in introgressed regions, some errors are
probable. The haplotypes that we assign to D1 and D2 are extremely long
(> 180 Kb), such that the signal of introgression is clear, and the
power of our methods is expected to be consistently high. Nevertheless,
we caution that date estimates derived from this method may be better
considered as relative rather than absolute dates.
We
sought to use simulations to profile the potential of date estimation
through the distribution of block lengths using a set of long
introgressing blocks, and to confirm that fitting dates using longer
blocks only does not lead to substantial biases. We first note that
deviations from the exponential distribution are known to occur under
certain combinations of introgression parameters (especially smaller
population sizes and larger admixture proportions; see Figure 3 in
).
We therefore assessed the correspondence between simulations and the
exponential expectation for our parameter range of interest.
The results of 200 replicates of a Wright-Fisher forward-time simulation of 5 Mb chunks, with recombination rate 1.27 × 10−8
(average rate from the HapMap combined genetic map) and the chromosome
discretized into 1 Kb segments for computational simplicity, using an
introgression proportion of 0.04, haploid Ne of 8334 (Australian Ne in
),
and introgression times from 50 to 2950 generations in steps of 50
generations, show that the exponential fit is close but not exact (Figures S4A
and S4B), even when all individuals in the population are sampled. We
attempted to fit each simulation both by maximum likelihood (using the
scipy.stats.expon.fit function), using either all blocks, those with
minimum length 50 Kb, or those with minimum length 180 Kb. We were able
to retrieve accurate introgression dates (Figure S4C),
although there is a tendency to underestimate the introgression
date for more ancient introgression times. In the regime of interest
(introgression times < 1800 generations), the deviation is at most
10%–15%. These fittings confirm that it is possible to fit dates using
longer block lengths only.
Additional
challenges in inferring introgression dates arise from errors in blocks
length estimation. We profiled these using the forward-time simulations
and fittings as above, but now modified the introgressing block lengths
to Berror = B + Laplace(μ, b) on sampling, where B is the error-free simulated block length and the Laplace distribution location parameter μ = 0, while scale parameter b = 20000.
Choosing the Laplace distribution as our error model assumes equal
probability to over and under-estimate block length with a constant
probability of error per base pair. If Berror < 0, the
block was discarded, capturing the difficulty in correctly identifying
short blocks. The Laplace(0,20000) distribution has a cumulative density
function at 2.5% and 97.5% of −59915 and 59915 bp, respectively,
capturing very substantial errors in block sizes. Under these models, we
are still able to achieve accurate fitting of dates (Figures S4B
and S4D), although bigger biases arise when including smaller blocks in
the fitting. Again, there is a slight tendency to underestimate
introgression dates by 10%–15% when using longer blocks.
Introgression
may well have occurred over many generations rather than as a single
event, and we considered it probable that fitting using block lengths
would emphasize the most recent introgression time. For example, if very
weak introgression were (hypothetically) to occur up to the present and
even a small number of very long introgressing blocks were sampled,
this could lead to a very recent inferred introgression date, likely
depending on the fitting procedure. We therefore repeated the
forward-time simulations, this time simulating introgression over 520
generations (15080 years with a generation time of 29 years) at a rate
0.04/520 = 7.69 × 10−5. Note that the effective introgression
rate is only marginally reduced compared to the single-event
introgression model due to replacement of haplotypes that are already
partially introgressed under this model, and the expected correction
when there is relatively low total introgression (as in our case) is
minimal. Measuring the simulated introgression date as the mid-point in
the introgression process (e.g., weak introgression from 1090 to 1610
generations ago corresponds to a mid-point of 1350 generations), there
is again a slight bias to infer more recent introgression dates (Figure S4E). For the introgression times of interest, this bias is again no more than 10%–15%.
Finally, we fitted the output of coalescent simulations generated in STAR Methods S10i, which builds on the
model with Denisovan introgression at 1353 generations ago. A slight
bias toward inferring more recent introgression dates was observed, of
approximately 5%. Adding errors to the sampled block lengths as above
did not change the inference when using long blocks > 180 Kb.
The
simulations above suggest that introgression date fitting based on
block lengths is effective given our demographic parameters and our use
of larger block lengths, even under a strong block length estimation
error model or introgression occurring over many generations rather than
as a single event. Nevertheless, in each case there is a slight bias
toward recent dates, that is greatest when introgression is more
ancient. The downward bias in date estimation is limited to 10%–15% for
the times most likely corresponding to archaic introgression in humans,
and is likely closer to 5%. Based on these simulations, we consider the
dates we report to be probable lower bounds on introgression dates, with
true dates up to 15% more ancient than our fitting suggest.
We proceeded to perform exponential fittings on the lengths of blocks in the D1 and D2 sets (Figures S4F and S4G) using the Python package statsmodels v.0.8.0 (
) and maximum likelihood fitting, assuming either a constant recombination rate or the combined HapMap genetic map (
).
We confirmed 95% CIs through a block-bootstrap procedure, whereby the
genome was divided into 2 Mb chunks, consecutive chunks were combined if
blocks spanned boundaries, and artificial samples were generated by
sampling the same chunks over all individuals with probability
proportional to chunk length (usually 2 Mb, but sometimes 4 Mb or more)
until the total observed number of blocks corresponded to that expected
from the data. When calculating the date of introgression, we assumed an
introgression proportion of 2% for each of D1 and D2, such that half of
a proposed total of 4% Denisovan introgression (
) entered from each ancestry into Papuans, and a generation time of 29 years.
The
results of the block length fittings are consistent with relatively
recent introgression times. Under a constant recombination rate, we
estimate dates of D1 introgression as 17.9 kya (95%CI 8.7–29.4) and D2
as 32.9 kya (95%CI 22.9–44.2). While there is no Papuan recombination
map, we also sought to incorporate local recombination rates into the
fitting by scaling block lengths by the average combined HapMap genetic
map recombination rate over all blocks in the D1 or D2 sets,
respectively. This maintains the approximately exponential distribution
of block lengths. Under this model, we retrieved date estimates of D1
introgression at 29.8 (95%CI 14.4–50.4) and D2 at 45.7 (95%CI
31.9–60.7). The weight of probability supports younger dates within this
range (Main Text Figures 4B and 5E).
On average, D2 introgression is relatively older than D1 introgression,
by 1.84 and 1.53 times for the two recombination models above,
respectively.
To assess the
robustness of this finding, we repeated the fitting after removing
replicates of haplotypes observed in multiple individuals. In this way,
we are seeking to observe recombination histories that are independent,
and to limit the impact of haplotypes at higher frequency due to
selection. Under a constant recombination rate and using unique
haplotypes (Figures S4F
and S4G), we estimate dates of D1 introgression as 20.2 kya (95%CI
10.4–33.5) and D2 as 29.5 kya (95%CI 23.2–36.1); under the HapMap scaled
recombination rate, we estimate dates of D1 introgression as 33.7 kya
(95%CI 16.8–57.7) and D2 as 44.3 kya (95%CI 34.4–55.4). The D2 average
introgression date is now estimated as 1.46 or 1.31 times as ancient as
D1, depending on the recombination model. The various block length
fittings all suggest that D1 introgression was relatively more recent
than D2 introgression.
Several other
lines of evidence are consistent with D1 having a more recent
introgression date. First, the variance in the frequencies of
non-overlapping D1 blocks is less than that observed for D2 blocks
(Fligner’s test for equal variance = 7.1, p = 0.008). After a pulse of
introgression, we expect the variance in haplotype frequencies to
increase as haplotypes drift away from their initial frequency. Second,
there is structure in the geographic distribution of D1 and D2
introgression (Main Text Figures 5A and 5B). The amount of identified sequence in the D1 block set (including blocks with mD <
0.1) is significantly lower in Baining samples as compared to mainland
Papuans (1.33 Mb per phased haploid genome versus 1.82 Mb on Papua;
Welch’s t test T = –3.9, p = 4 × 10−4), while there is no evidence of a different amount of D2 (including blocks with mD
> 0.23) sequence (1.28 Mb versus 1.37 Mb, T = –0.8, p = 0.41). To
account for sampling differences between New Guinea (N = 52) and Baining
(N = 16) and to estimate CIs around the observed mean, we resampled one
million times (with replacement) 16 samples from each of two islands.
While the average amounts of D2 per individual in both populations
overlap (NG: 2.75, 95%CI 2.36–3.12; Baining 2.37, 95%CI 1.96–2.79),
Baining has significantly fewer D1 chunks (NG: 3.64, 95%CI 3.10–4.19;
Baining 2.60, 95%CI 2.19–2.99). To additionally assess whether the ratio
of D1 between mainland Papuans and Baining is greater than the ratio of
D2 between mainland Papuans and Baining, we resampled sets of mainland
Papuan (N = 52) and Baining (N = 16) individuals with replacement one
million times, recording whether the median pairwise D1 ratio was
greater than the median pairwise D2 ratio. D1[Papuaresample]/D1[Bainingresample] was greater than D2[Papuaresample]/D2[Bainingresample] in 95.7% of resampling iterations.
Together,
these geographical patterns raise the intriguing possibility that the
D1 component is more typical of mainland Papua, and introgression may
even have been ongoing in the time-frame of the split between Baining
and Papuan populations. Similarly, there is a tendency for populations
outside ISEA to show a Denisovan signal more consistent with D2 (Main
Text Figure 3C),
although the limited amount of Denisovan introgression means that the
blocks contributing to the mismatches are shorter, and hence there is
less resolution in the mismatch distributions. While it is possible that
the reduced D1 signal in Baining samples is caused by weak Asian
admixture (given the lack of evidence for the D1 signal in mainland
Asia), this would additionally be expected to reduce the Baining D2
signal compared to mainland Papuans (not observed, see above) and
generate an excess signal of Asian ancestry in the Baining as compared
to mainland Papuans (not apparent in LOTER results, Main Text Figure 1A and Table S2; or in the PCA, Main Text Figure 1B,
where the Baining cluster toward Australians rather than with the
Asian-admixed Bougainville samples). We additionally note the detailed
demographic analyses in
,
which strongly place the Baining as a recently separated Papuan
population that does not harbor additional admixture signals from a wide
range of other regional populations.
i. Assessing the multiple-ancestry hypothesis
Three
lines of evidence support the probability that D1 and D2 represent
introgression from two different archaic populations, likely both on the
Denisovan clade. First, our approach to identifying D1 and D2 blocks,
and the coalescent topologies that they represent, are consistent with
both sets of blocks showing clear affinity to the Altai Denisovan over
the Altai Neanderthal genomes, and clear divergence from most modern
humans. Second, there is spatial variation in the prevalence of D1 in
populations with some Denisovan introgression (e.g., mismatches
consistent with D1 are underrepresented in Baining versus mainland
Papua, Main Text Figures 5A and 5B, and may also be rarer in East ISEA, West ISEA and mainland Asia compared to D2, Main Text Figure 3C,
although resolution is limited, Figure S2). This supports the
likelihood of D1 and D2 arising from different source populations rather
than a single population of composite ancestry. Third, and supporting
the same conclusion, there is some evidence that the introgression dates
of D1 and D2 blocks are different (STAR Methods
S10h) and that there is spatial heterogeneity in the frequency of D1
chunks within Near Oceania. We therefore sought to determine whether a
model with two pulses of introgression from archaic populations on the
Denisovan clade could generate the mismatch distribution observed in
modern Papuans using coalescent modeling.
We used the msprime v.0.6.1 program (
) to simulate a modified version of the highest-likelihood demographic model inferred by the
study, using Australians as a proxy for our Papuans. We first
translated the original fastsimcoal2 model (provided by the authors)
into msprime. We then modified the model as shown in Figures S4A
and S4B, allowing two pulses of introgression from populations on the
Denisovan clade. We did not incorporate reported inbreeding in the Altai
Neanderthal. Apart from this modification, the structure of the model
remains unchanged. As in that model, sampling times of the Altai
Denisovan and Altai Neanderthal are 2058 and 2612 generations
respectively. Parameters of the model are also unchanged and are given
in Table S5 and Figures S5A and S5B, except:
- 1.The single time of divergence between the Altai Denisovan and the introgressing Denisovan is replaced by two times, t1 and t2 indicating the divergence between the Altai Denisovan and branch D1, and the divergence between the Altai Denisovan/D1 common ancestor and D2
- 2.The population size of all internal Denisovan-clade branches (Altai Denisovan/D1 common ancestor and Altai Denisovan/D1/D2 common ancestor) is set to NDeniAnc.
- 3.Instead of a single Denisovan introgression into Australians 1353 generations ago of 4%, there are two introgressions 1353 generations ago into Papuans, one from D1 and one from D2. These are in proportion p1 × 0.04 and (1.0 – p1) × 0.04.
To
determine whether our modified model can return the two observed peaks,
and to propose demographic parameters, we simulated genetic data using
the modified model and studied the mismatch distribution. Specifically,
we simulated 5 Mb of sequence data for a sample of 144 Papuan
haplotypes, two Altai Denisovan haplotypes and two Altai Neanderthal
haplotypes. As in the
study, the mutation rate was set to 1.4 × 10−8 per base pair per generation. The recombination rate was constant and set to 1 × 10−8
per base pair per year. To mimic our own data, we masked all sites that
were heterozygote in either the two Altai Denisovan haplotypes or the
two Altai Neanderthal haplotypes. We extracted introgressed blocks from
each Papuan haplotype in the simulation (using custom scripts and the
detailed migration tables recorded by msprime) and recorded the mismatch
between these blocks and the Altai Denisovan. The process was repeated
240 times for each parameter set, yielding a total of 1200 Mb simulated
data.
The simulated data output by
msprime is ‘perfect’ – there is no SNP calling process that might miss
variation, and no missing data that would be masked by QC filters. As we
want to avoid consequent biases in model inference, we sought to
express the mismatch between introgressed blocks (simulated and real)
and the Altai Denisovan genome as a proportion of the average mismatch
to the Altai Denisovan observed in a population lacking introgression.
We therefore converted the observed mismatch against Denisovan for each
block in the real data into a fraction of the genome-wide average
mismatch of our 75 West Eurasian samples. For the simulated data, we
generated 1200 Mb of data for 150 West Eurasian chromosomes sampled with
two chromosomes from the Altai Denisovan and Altai Neanderthal, using
the unmodified model from the
study. As before, we masked archaic heterozygote sites, but this time
calculated the mismatch over the entire dataset to obtain a
population-average mismatch between simulated humans and Denisovan. When
fitting the model, we expressed the mismatch distribution in
introgressing Denisovan blocks found in simulated Papuans as a fraction
of this simulated West Eurasian genome-wide average.
We first explored the parameters on a coarse grid with , , and .
For each parameter set, we retrieved the mismatch distribution based on
the 2000 longest introgressing blocks in the real and simulated data
and calculated the sum of squared difference. Based on the output of
this coarse fitting, we were able to localize the region of the
parameter space showing a close fit to the data, and fitted a finer grid
in this region with , , and . Assessing the mismatch fit using sum of squared differences yielded our final parameter values of , , and .
To
cross-validate our methodology, we simulated 5 Mb of sequence data
24000 times using our final parameter set. We resampled 400 of these 5
Mb simulations (representing the introgression ideally observed from
2000 Mb of total data) from this pool, with replacement, 100 times. We
then repeated our fitting procedure using the previously defined
parameter space grid, yielding time estimates of t1 [9000–10250], t2 [11500–13000], N < 350, p1 = [0.45–0.7]. The distribution of cross-validation mismatches is shown in Figure S5C.
Our
simulations support the probability that the D1 and D2 mismatch peaks
reflect archaic ancestry in modern Papuans that derives from two
populations on the Denisovan clade. Both of these populations were very
distantly related to the Altai Denisovan, and more divergent that than
the East Asian- (and Siberian-, in our data; Main Text Figure 3C) specific Denisovan introgression (D0 in Main Text Figure 4B). Based on a mutation rate of 1.4 × 10−8
and a generation time of 29 years, simulations suggests that the
population contributing D1 chunks split from the Altai Denisovan
approximately 261–297 kya, while the population contributing D2 chunks
split from the Altai Denisovan approximately 334–377 kya. These dates
are population split times, measured in years before present. In order
to fit the sharp peaks we observed in the data, a small population size
of the ancestral Denisovan population (< 350) is required. The model
we explore is extremely simple, and we do not consider our results as
proving that an ancestral Denisovan population of this size necessarily
persisted for > 100 ky; a low population size or population
bottlenecks, however, are strongly implied.
Our
results suggest that Denisovans were highly structured when modern
humans encountered them, consisting of multiple, highly diverged
populations that remained sufficiently separate for hundreds of
thousands of years to show distinct signatures when their genes are
identified in modern humans. This raises the intriguing possibility that
biogeographical barriers, and possibly islands, played an important
role in maintaining Denisovan population structure. While the absence of
the D1 signal in East ISEA and elsewhere may reflect a lack of
resolution (Figure S2), the different amounts of D1 in our mainland Papua and Baining samples (Main Text Figures 5A
and 5B) hints at geographic variation, potentially indicating different
introgression histories in these populations (Main Text Figure 5E).
We therefore sought to modify our simulation protocol to directly
assess the probability of drift and sampling error generating the
difference in D1 observed between mainland Papuans and Baining.
j. Less D1 signal in New Britain than New Guinea
We further modified the
simulation model with two introgressing Denisovan populations to
incorporate a population representing the Baining. We sought to
construct this model such that it incorporates an amount of drift
between the Baining and Papuan population on the high end of realistic
values, in order to generate a very conservative estimate of the model
distribution of ratios of D1 signal between the two groups. Our model
follows the demographic analyses presented above, which indicate that
the Baining have no excess Asian admixture compared to mainland Papuans
(LOTER; Figure 1A; Table S2) and are closely related to Papuans (PCA; Figure 1B), and additional analyses in
,
which show that the Baining cluster with Papuans and have no additional
admixture signals when analyzed together with an even broader set of
regional populations. The structure of the model involves the Baining
budding from the Papuan population at a time tB,
after Denisovan introgression into their common ancestor. The haploid
population size of the ancestral Papuan/Baining population before the
split is 8834 (see Figure S5B). After budding, the Baining have a population size of and the mainland Papuans have a population size of ;
these are either constant or functions of time (see discussion below).
The two populations are modeled as entirely isolated after budding, to
maximize drift and ensure that the model is conservative. Based on the
extremely similar levels of Asian ancestry in Baining and Papuans
(LOTER, Main Text Figure 1A; Table S2) and the similar placement of Baining and Papuans by PCA (Figure 1B),
the migration rate between Baining and Asians was set to the same rate
as between Papuans and Asians. The Baining and Papuan populations have
the same introgression history in the simulations, such that both D1 and
D2 introgress at 1353 generations ago into the Papuan/Baining ancestral
population. The sample sizes of the Baining and Papuans were set to 32
and 104 haploid chromosomes, respectively, corresponding to the 16 and
52 individuals in our Baining and mainland Papuan samples.
To determine the split time between the mainland Papuan and Baining population, we used the SMC++ v1.9.3 split
option, which analyses pairs of populations simultaneously to infer
genetic divergence times jointly with population size histories (
).
These split times are effective split times, based on a hard split
model without migration after populations diverge. These estimates do
not depend on phasing. We used unphased data with the 99% call-rate
filter applied, which yielded a split time between mainland Papuan and
Baining populations at 15680 years BP with split time diploid Ne = 4620 (Main Text Figure 5C). Using genomic data without the call-rate filter resulted in a very similar estimate (split time = 16280 years BP, Ne = 4940). The mutation rate was fixed to 1.45 × 10−8 (
) and generation time to 29 years; chromosome 6, which contains the hypervariable HLA region, was excluded from the analysis.
We used these SMC++ results to implement Model 1, with tB = 540 generations ago and the population sizes for Baining and mainland Papuans after tB
as inferred by SMC++ (incorporating a recent population bottleneck
among the Baining and recent population growth for mainland Papuans). We
simulated 40 Gb of data using the model as 8000 independent 5 Mb
simulations using msprime, recording the total amount of D1 and D2
introgression observed in the mainland Papuan and Baining samples using
the migration tables in msprime. We used this information to construct a
simulated null distribution of the ratio of D1 (and D2) sequence in
mainland Papuans relative to the Baining. We performed 5000 resampling
iterations whereby we drew 5 Mb simulations from the set of 8000
simulations until the average total amount of introgressing D1 and D2
sequence in a simulated individual was equal to or just greater than the
average amount observed in our Papuan and Baining samples (3.05 Mb). On
average, this led to us using just eighteen 5 Mb simulations totaling
90 Mb of simulated data per resampling iteration. The observed median
D1[Papua]/D1[Baining] pairwise ratio (1.36) is placed on the 98.5th percentile of the simulated distribution (Main Text Figure 5D),
indicating that the excess D1 found in mainland Papuans compared to
Baining is highly unlikely to be explained by drift alone. In contrast,
the observed median D2[Papua]/D2[Baining] pairwise ratio (1.06) is
placed on the 65.3th percentile of the simulated D2 ratio
distribution, indicating that drift alone is sufficient to explain the
excess D2 in mainland Papuans.
Given
that we use population sizes inferred based on two different methods
(SMC++ for recent times and the joint site frequency spectrum for times
before tB and non-Papuan/Baining populations), we
sought to confirm that the model correctly captures the drift observed
in mainland Papuans and the Baining, and the average divergence of these
populations. The observed average heterozygosity of the Papuan and
Baining samples was 6.43 × 10−4 and 6.00 × 10−4 respectively, and the weighted FST (Equation 10 in
)
calculated using all sites that were variable between the samples was
0.0934. We used the simulation model described above to generate null
distributions of these values by performing the same calculations on
5000 resampling iterations of four hundred 5 Mb simulations (totaling 2
Gb simulated data per iteration). The observed heterozygosity of the
mainland Papuans was at the 0.2th percentile of the simulated distribution and the heterozygosity of the Baining was 98th percentile of the distribution, while the observed FST was well over the range generated by the simulations (Figure S6A).
This suggests that the simulated Papuans have insufficient drift and
that the two populations are insufficiently diverged, such that the
SMC++ informed model may simulate Papuan and Baining samples that are
more similar in their D1 ratios than would be expected based on observed
drift and divergence.
We therefore
supplemented this model with two additional, more conservative models,
to check the robustness of the result that the difference in D1 between
mainland Papua and the Baining is unlikely to be generated by drift. We
first assumed that the relatively low population divergence in
simulations was caused by incorrect recent population sizes, and tried
to identify a fixed value of that leads to observed heterozygosity and weighted FST values, fixing as in our implementation of the
model above. We assessed the fit of constant values of
by, as before, simulating 40 Gb of data using the model as 8000
independent 5 Mb simulations and generating heterozygosity and weighted FST null distributions by resampling. We found that the new fixed value of fitted the Papuan heterozygosity extremely well. yielded a good fit between simulated and observed FST (39th percentile) and Papuan heterozygosity (43rd percentile); the observed Baining heterozygosity was marginally greater than the simulated Baining heterozygosity (99th percentile), indicating that the revised model is conservative with respect to incorporating more drift than is observed (Figure S6B). Thus, our Model 2 used parameters tB = 540 and .
We
second asked how an older population split time between mainland
Papuans and the Baining might impact results. There is early
archaeological evidence of early human occupation on New Britain from
35.5 kya (
),
although importantly genetic divergence is often expected to be more
recent due to migration between two populations after separation, which
can affect SMC++ split time estimates (
).
Transportation of obsidian tools occurs within the Bismarck Archipelago
during the Pleistocene and externally to mainland Papua by the
mid-Holocene (
). Evidence for post-settlement contact can also be found in the translocation of plant and animal species (
), though these appear to have occurred considerably after initial occupation (
),
suggesting limited contact over long periods. To assess the
implications of an earlier split date, we re-implemented the model
setting tB = 800 (23.3 ky). This is more ancient than the inferred split times between mainland Papuan populations (10–20 kya;
) and is of a similar order to the Papuan/aboriginal Australian split (11–27 kya;
).
We re-estimated using the same procedure as above, exploring parameter values . No parameter value could fit both heterozygosity and weighted FST. We therefore conservatively chose to explore , which yields FST values considerably higher than observed FST (observed 0.0934, 95% CI of simulated FST 0.112–0.115) and places the observed Baining heterozygosity in the 98.9th percentile of the simulated distribution (Figure S6C), such that our Model 3 used parameters tB = 800 and .
As with Model 2, these parameter values are expected to generate
simulated data with more drift in the Baining than is observed in our
dataset. Model 3 also simulates substantially greater divergence between
Papuans and the Baining than the observed FST, such that the model will tend to generate more variable D1 ratios and is extremely conservative.
The results from these two additional models are shown in Main Text Figure 5D.
As before, the observed D2 excess in mainland Papua is not outside the
distribution expected due to drift. Similarly, in both cases the
observed D1 excess in mainland Papua is unexpected given a model whereby
the difference is introgression followed by drift and sampling – for
Model 1 the observed D1 ratio is in the 97.8th percentile of the simulated distribution and for Model 2 the D1 ratio is in the 95.4th
percentile of the simulated distribution. Together, these simulations
suggest that the reduced frequency of D1 blocks among the Baining is
unlikely to result from drift, and instead is more likely to reflect a
different Denisovan introgression history among Baining compared to
mainland Papua.
S11 - Frequency distribution of archaic blocks
The frequency of archaic introgressed blocks are shown in Table S6.
a. High frequency Denisovan blocks
We
sought to assess evidence of adaptive introgression from the two
Denisovan ancestries, in our high confidence set of Denisovan
introgressing blocks, by calculating the frequency of Denisovan ancestry
over the genome. While archaic blocks may drift to high frequency, they
are more likely than low-frequency blocks to have been subject to
natural selection – either adaptive introgression during the initial
introgression process or subsequent selection on introgressed variation.
We first retrieved all introgressing blocks > 20 Kb from our data,
and filtered out the blocks having more mismatch with Denisovan compared
to mismatch with Neanderthal. We then used the bedtools v.2.27.0
‘multiinter’ command to obtain intersected Denisovan-introgressed
regions and frequencies among all Papuan individuals. We assigned genes
from the Ensembl 91 (GRCh37) database to each intersected block, and
report the top 1% frequency regions and the frequency of all
introgressed regions in Table S6A. We repeated this procedure for East ISEA individuals (Table S6B).
A genome-wide map of Denisovan introgression in the Papuan and East ISEA samples (Main Text Figure 6), based on Tables S6A and S6B, reveals several sharp peaks at known (e.g., WARS2,
; TNFAIP3,
; and FAM178B,
and
) and unreported (e.g., WDFY2, the TMPO/IKBIP/APAF1
gene cluster) loci. The replication of several loci that have
previously been proposed to be subject to adaptive introgression
strongly supports our approach to detecting Denisovan introgression and
potentially adaptively introgressed regions. Some of our higher
frequency blocks overlap previously identified deserts of introgression (
),
but this is not unexpected given the large size of the proposed deserts
and the small number of Papuans samples they were identified from (N =
35). Specifically, we see high frequency introgressing Denisovan blocks
at the ROBO2 gene in Papuans (chr3:76572330-76634485, 39.6%
frequency) overlapping a proposed 14 Mb desert (chr3:76500000-90500000).
Lower frequency Denisovan blocks (maximum 16%) also occur in a proposed
10.9 Mb desert (chr8:54500000-65400000).
b. Overlap with modern selection signals
High
frequency Denisovan introgressed blocks could arise due to two
different selective processes – either directly and immediately on the
introgressing haplotypes, leading to longer high frequency haplotypes,
or on introgressed diversity some time after the introgression event. We
can approximately predict that the former relate primarily to
biological differences between humans and the archaic species, while the
latter relate to interactions between humans and their environment
(e.g., disease, diet, etc.). In the latter model, introgression provides
a source of genetic variation that is non-random in the sense that it
has already been subject to evolutionary forces in the archaic
population. This genetic variation may provide novel opportunities for
adaptive selection in human groups with archaic introgression, even many
thousands of years after the introgression ended.
To detect signals of recent positive selection in genetic regions with high Denisovan introgression, we calculated nSL (
)
on all SNPs with ancestral information for the Baining of New Britain,
mainland Papuan population of New Guinea and East ISEA continental
group. We divided the genome into non-overlapping 200 Kb windows and
defined the nSL statistic score of a window as the proportion of SNPs
with |nSL| > 2.0. We discarded windows with fewer than 10 SNPs. We
then assessed overlap between top 5% nSL window scores and top 1%
frequency introgressed Denisovan blocks (see above), comparing
introgression signals in Papuans to nSL for the Baining and mainland New
Guinea groups, and introgression signals in East ISEA to nSL for the
East ISEA group.
We found that only
3/34 Denisovan introgressed haplotypes that were high frequency in Papua
were in nSL top 5% windows in the Baining group and these were not
significant. For completeness, these genes were TNFAIP3 (nSL percentile 3.6%), WDFY2 (2.2%) and SUMF1 (4.5%).
In mainland Papuans, 2/34 high-frequency Denisovan introgressed haplotypes were top 5% nSL hits – GLT8D2 (1.1%) and ZNF280D (2.9%). In East ISEA, just 1/39 top 1% high-frequency Denisovan introgressed haplotypes was a top 5% nSL hit – TMEM131 (nSL percentile 1.7%).
Given the suggested role of WDFY2
in lipid metabolism adaptation, we assessed the mainland Papuan and
Baining nSL top 1% gene lists for enrichment of fat metabolism pathways
(method described below). We did not observe enrichment, but did note
the presence of genes important in lipid metabolism and synthesis – most
notably windows including FASN in Baining, New Guinea and East ISEA, and FADS1 and FADS2 in Baining only. We also note the presence of the important carbohydrate metabolism gene AGL
in our top 1% nSL gene list in both the Baining and mainland New
Guinea. Further work is required to determine the precise role of
adaptation and the detailed evolutionary history of these genes in
Oceanian populations.
c. Gene ontology enrichment
We
tested whether specific sets of genes have significantly elevated
frequencies of Denisovan ancestry using the Ontologies tab of the
Enrichr web interface (
). Specifically, we retrieved all genes identified as introgressed at high frequency (top 1%, using Tables S6A
and S6b) and searched for enrichment in the Gene Ontology lists ‘GO
Cellular Component 2018’, ‘GO Biological Process 2018’ and ‘GO Molecular
Function 2018’, as well as the two phenotype lists ‘MGI Mammalian
Phenotype 2017’ and ‘Human Phenotype Ontology’ and one tissue expression
list ‘Jensen TISSUES’. We performed this analysis for both combined
Papuan and East ISEA groups; results that survive multiple hypothesis
test corrections and are not driven by clusters of co-located genes are
reported in the Main Text, and include enrichment associated with
expression in adipose and uterine tissue and fetus development. Full
results, including categories that either i) had uncorrected p
values < 0.005 (the corrected p values using Benjamini-Hochberg
method are also reported) and/or ii) were driven by multiple co-located
genes are given in Table S6C.
Enrichment was observed in categories related to smooth muscle cell
proliferation, immunity and adipogenesis in both Papuans and East ISEA,
e.g., ‘negative regulation of smooth muscle proliferation’ involving TNFAIP3/PPARG genes (NG: p value = 0.001, corrected p value = 0.1; EISEA: p = 0.0005, corr-p = 0.049); ‘negative regulation of inflammatory response’ involving TNFAIP3/SAMSN1 genes in Papuans (p = 0.0005, corr-p = 0.09) and TNFAIP3/PPARG genes in East ISEA (p = 0.003, corr-p = 0.07); and ‘positive regulation of fat cell differentiation’ involving PPARG and WDFY2 genes (NG: p = 0.002, corr-p = 0.1; EISEA: p = 0.0009, corr-p = 0.05).
d. High frequency long and D1/D2 blocks
As
longer high-frequency introgressing blocks are expected when a
Denisovan haplotype rises to high frequency early in the introgression
process, we repeated our introgressing block frequency analysis for
blocks > 180 Kb that we were able to assign to one of the two
Denisovan ancestries, D1 (Table S6D) and D2 (Table S6E). Analyzing blocks assigned to D1 introgression revealed two regions at high frequency (> 20% in Papua), containing FAM178B/FAHD2B/ANKRD36, ZNF280D and FBXL20/MED1/CDK12. Analyzing blocks assigned to D2 introgression revealed five regions at high frequency, containing ANKRD28, NFAT5/NQO1, COG7/GGA2/EARS2/UBFD1/NDUFAB1 and ARID4A/TOMM20L/TIMM9/KIAA0586. A gene-free region 15 Kb downstream of CENPW
was also highly introgressed based on D2 blocks. We observed an extreme
nSL signal (nSL percentile 0.2%) in the window containing CENPW in the Baining group, and note that the window containing ZNF280D (D1) also had a high nSL signal (see above).
As
we are only able to assign D1 and D2 ancestry to large blocks > 180
Kb, we additionally explored a concept whereby confident D1 or D2 blocks
might be used as local ‘flags’ for their respective ancestries. In this
way, we can adopt an assumption – that short, < 180 Kb,
introgressing Denisovan blocks overlapping a > 180 Kb D1 (or D2)
chunk are also from the D1 (or D2) population – to leverage off our high
confidence Denisovan ancestry dataset. We performed a bootstrapping
analysis whereby we repeatedly sampled two Papuan individuals from our
dataset and, using their > 20 Kb high confidence introgressing
Denisovan blocks, identified the overlap between them. We divided the
genome into 40 Kb non-overlapping windows and, for each window, recorded
whether the pair had overlapping introgression. We performed this
resampling 100000 times, counting the number of observations in each 40
Kb genomic window. Then, for each > 180 Kb D1 and D2 block, we
identified the most commonly observed 40 Kb window, and ranked D1 and
D2 blocks according to this frequency (Tables S6D
and S6E, column ‘BOOTSTRAP_RANK_20KB’). While the frequency of D1 and
D2 blocks themselves and their ranks according to the above analysis are
highly correlated, some rare low-frequency introgressed blocks assigned
to D1 and D2 cover regions that are highly introgressed based on
smaller blocks. This may reflect occasional misclassification (STAR Methods
S10d), such as when a small number of D2 blocks are erroneously
identified as D1 leading to an apparent low-frequency D1 introgression
event, or adaptive introgression of primarily smaller Denisovan blocks.
e. High frequency residual S∗ windows
We additionally determined the frequency of uncommon residual S∗ windows (see STAR Methods S12, below) found in Papua (Table S6F) and East ISEA (Table S6G). Residual S∗
is a signal that would be expected given non-Neanderthal, non-Denisovan
archaic introgression (e.g., it would be consistent with introgression
from H. erectus), but could equally be caused by other
processes including balancing selection or local properties of molecular
evolution (e.g., an accelerated mutation rate in non-African
populations). In our combined Papuan sample, top 1% residual S∗ frequency blocks include a cluster of genes around VN1R1 (Vomeronasal 1 Receptor 1), as well as PDE1C (Phosphodiesterase 1C), DPH6 (Diphthamine Biosynthesis 6) and PRKCH (Protein Kinase C Eta). In our East ISEA sample, top 1% residual S∗ frequency blocks include a cluster of genes around HLA-A (Major Histocompatibility Complex, Class I, A), and PDE1C and PRKCH. There is considerable correlation between the frequency of residual S∗ blocks in Papua and East ISEA, such that the HLA-A region is also found at high frequency in Papuan residual S∗ data and the VN1R1 region is at high frequency in East ISEA residual S∗. While these two genes are especially intriguing – VN1R1 may be involved in the species-specific pheromone system in other species (
) and sociosexual behavior in humans (
), and the hypervariable HLA-A
gene plays a critical role in immunity – further study is required to
determine the evolutionary history detected by the residual S∗ signal in each case. In particular, archaic introgression (
) and balancing selection masquerading as archaic introgression (
) have both been proposed for the HLA region, and may likewise play a role in the signal around VN1R1.
S12 - Residual S∗ signal
The S∗
method is designed to identify archaic introgression without requiring
the introgression to be derived from a population with similarity to a
known, sequenced archaic hominin. As such, studying this signal may
reveal otherwise cryptic evidence of introgression from hominins outside
the Neanderthal and Denisovan clades. Additionally, the signal that S∗
identifies – non-African variation in high linkage disequilibrium –
would be expected to occur due to structure in the Out of African
migration(s). Given the known presence of Homo floresiensis in our study area (
,
), the possibility that late Homo erectus was contemporary with the earliest anatomically modern humans in ISEA (
), and that a proposed early Out of Africa model may be required to explain genetic diversity patterns in Papuans (
), we sought to further profile the S∗ signal (Main Text Figure 7).
a. No more than 1% unexplained archaic introgression
The output of the S∗
analysis consists of non-overlapping 50 Kb windows reported for each
genome (rather than each chromosome copy). Global patterns of S∗ (> 99% confidence, i.e., higher confidence signal, see STAR Methods S7c) show a sharp peak in Papuans (Table S3A; Figure S7A). Calculating pairwise sharing of these S∗ windows (Figure S7B)
indicates that the signal is quite broadly shared, with Papuans again
unusual in sharing a lot of signal between each other and with East
ISEA. These patterns are consistent with known patterns of Denisovan
introgression, but could also be caused by other demographic or
introgression processes. We first sought to assess to what extent this
signal might be driven by Denisovan introgression. We used bedtools to
remove S∗ > 99% confidence windows that were inferred to
be caused by Neanderthal introgression, based on > 5% coverage of the
merged set of HMM and CP Neanderthal introgressed blocks over both
chromosome copies. We refer to this trimmed dataset as S∗NoNean (Figure S7A, right pane). We observe that S∗NoNean
retains its sharp peak in Papua, while causing a reduction in the
overall introgression signal of 75%–80% in all populations when compared
to S∗. Repeating this process but instead removing S∗ windows inferred to be caused by Denisovan introgression leads to a slight dip in S∗NoDeni signal in Papua (Figure S7A, right pane), consistent with Denisovan introgression explaining the majority of the excess S∗ Papuan introgression signal. While the overall introgression signal drops by 84% in Papua compared to S∗, there is still a fall of 54% in West Eurasia.
This
analysis raises two interesting points. First, it is possible to detect
the distinctive introgression signal in Papua using S∗. With
only a Neanderthal genome available, we would further be able to
classify the source of introgression as non-Neanderthal using S∗NoNean. Alternatively, with only a Denisovan genome available, we would be able to use S∗NoDeni
to identify the primary driver of this signal as ‘Denisovan’
introgression, as opposed to early out-of-Africa (OOA) processes
involving modern humans, or additional introgression from an unknown
archaic source. This suggests that S∗ is also well-suited to
discovering introgression from unknown hominins, by studying signal
behavior when masking introgression from known hominins.
Second,
studying the West Eurasian signal is particularly informative as West
Eurasians carry minimal known Denisovan introgression. Two statistical
patterns are important. Removing Denisovan introgression blocks from
West Eurasian S∗ might be expected to cause a minimal
reduction in introgression signal. Instead, there is a substantial 54%
reduction in introgression signal relative to S∗ when studying S∗NoDeni,
confirming that our CP and HMM Denisovan block sets contain
considerable spillover from outside the Denisovan clade. This spillover
is likely due to Neanderthal introgression (as explored in STAR Methods
S9a and S9b), but as we do not study this ambiguous signal in depth, we
cannot rule out introgression from Neanderthal/Denisovan sister clades.
Conversely, removing Neanderthal introgression blocks from West
Eurasian S∗ might be expected to remove virtually all the
introgression signal. Instead, a considerable 27% of the introgression
signal remains. This could be due to false positives in the S∗
signal; or limited power of other methods to detect Neanderthal
introgression; or a result of hitherto unknown introgression processes
detected by S∗ but not CP or the HMM. The overlap in the S∗ signal that is removed when trimming Neanderthal or Denisovan introgression confirms our observation in Table S3B
– that a great deal of introgressing blocks are ambiguous, showing
greater similarity to both the Neanderthal and Denisovan genomes than
human variation, likely due to the more recent common ancestry of the
archaic hominins.
Because the excess Papuan S∗ signal is so completely eliminated by filtering out Denisovan introgression (Figure S7A,
right pane), a very simple calculation puts a tentative upper bound on
the amount of introgression into humans from outside the
human/Neanderthal/Denisovan clade. Papuans have 97.2 Mb S∗
signal compared to the 40.8 Mb observed in Europeans. Assuming the 56.4
Mb excess corresponds to 4% Denisovan introgression, we might expect S∗ to detect 28.2 Mb from 2% Neanderthal introgression – given that the power of S∗
to detect introgression from Neanderthals and Denisovans in humans is
expected to be similar following their similar genetic distance from
humans and introgression times. This leaves 12.6 Mb of S∗
signal unaccounted for in West Eurasians and Papuans – a combination of
false positives, limited power of the CP and HMM, and, potentially,
unknown introgression signals – suggesting a maximum of ∼1%
introgression from outside the human/Neanderthal/Denisovan clade. As
such introgression would be expected to be easier to detect, given a
similar introgression date, than Neanderthal or Denisovan introgression
due to greater divergence from humans, this is only intended as an
approximate upper bound. The bound is on average additional
introgression in West Eurasia and Papua, but the absence of obvious
excess signal in other regions suggests it applies more broadly.
Nevertheless, individual populations within continents or isolated
groups not captured by our sampling may have greater amounts of highly
divergent introgression.
b. Profiling the residual S∗ signal
While
the calculation above suggests that if introgression from outside the
(Human, Denisovan, Neanderthal) clade occurred, it was limited; it
remains interesting to attempt to identify possible regional peaks in S∗
that are consistent with such introgression. We therefore attempted to
remove introgression signals from the (Neanderthal, Denisovan) clade by
filtering out both Neanderthal and Denisovan introgressing blocks, as
inferred by CP and the HMM. Starting with the S∗ > 99% confidence output, we now used bedtools to remove any S∗ windows with more than 5% cumulative overlap from the union of CP and HMM Neanderthal and Denisovan blocks (see Figure S7C schematic). We are interested in the remainder, which we call residual S∗ (RS∗).
On average, individuals had 172 residual S∗ windows. We observed that sharing of residual S∗
between continental groups is common. To quantitatively profile this
pattern while taking sample size into account, we randomly down sampled
each population to 20 individuals 1000 times and counted the number of
residual S∗ windows that were observed in all continental
groups (‘global’), 4 to 8 continental groups (‘widespread’), or 1–3
continental groups (‘uncommon’). The southeast Asian group was excluded
due to its small sample size, such that the analysis incorporated Papua,
East ISEA, West ISEA, South Asia, East Asia, Siberia, America and West
Eurasia. Table S3G shows the average amount of residual S∗ sequence per individual in each category (also see Figure S7D).
Papua has the lowest signal of residual S∗ (151 blocks/individual covering 8.8 Mb, 10.4% of the original 1265 S∗ blocks/individual), while South Asia has the highest residual S∗ signal (185 blocks/individual covering 10.6 Mb, 23.1% of the original 710 S∗ blocks/individual). The differences between groups are small (Figure S7A, right pane), and approximately 15%–20% of residual S∗ windows are found globally (Figure S7D),
and over half are widespread. This broad distribution may reflect
limitations of our African sample in capturing African variation,
demographic events such as pre-OOA genetic structure or shared drift
during the OOA bottleneck, evolutionary forces such as purifying
selection within Africa, or unusual genomically local patterns of
molecular evolution that are not captured by the simulation model.
Interestingly, Papua, West Eurasia and South Asia show the highest
proportion of uncommon residual S∗ signal (Table S3G; Figure S7D).
While this may partly reflect an Asian ancestry bias in our definitions
of continental groups, the pattern is consistent with local demographic
processes specifically impacting these populations.
A potential cause of excess uncommon residual S∗
is region-specific introgressive sequences that coalesce earlier than
the (H,N,D) group of known hominins. This is consistent with the
placement of Homo erectus on the hominin species tree (but also
many other causes; see below). Such sequence could be caused by direct
introgression from H. erectus; or introgression from Denisovans if the introgressing Denisovan population had, like the Altai Denisovan (
), mixed with H. erectus.
It could also be caused by incomplete lineage sorting within the
Neanderthal or Denisovan populations that are known to have mixed with
modern humans; by balancing selection; and by increases in local
mutation rate in non-Africans. The topologies of interest are
(X,(H,(D,N))), (X,(D,(H,N))) and (X,(N,(D,H))). While we cannot
accurately calculate topologies (see STAR Methods
S10 g) on genomic windows, which cover both chromosome copies and will
frequently be chimeras of different coalescent histories, we can make
simple predictions about the frequency of certain mutation motifs given H. erectus
introgression – following the [H,N,D,X] notation, a substantial
increase in the frequency of 0001 and an increase in 1110 that is
dependent on the split time of H. erectus and modern humans. To assess evidence for H. erectus introgression, we therefore retrieved all global and uncommon residual S∗
blocks in each continental group, and divided the sum of the 0001 and
1110 mutation motifs observed in these blocks by their total sequence
length.
While our initial
calculations identified a clear excess in average 1110 and especially
0001 mutation motifs/bp in East ISEA and Papua, further investigation
revealed that this was largely driven by high-frequency introgressed
windows around the HLA-A gene. HLA regions have been discussed in the context of archaic introgression (
), and balancing selection masquerading as archaic introgression (
).
Given the possible role of balancing selection or locally accelerated
evolution in the HLA region, we profiled the frequency of 0001 and 1110
motifs when excluding chromosome 6 from the analysis (Figure S7E).
There is a tendency toward higher 0001 and 1110 in East ISEA and Papua,
centered on the islands of Flores and Lembata. Uncommon residual S∗ windows in West Eurasia tend to have relatively high rates of the 1110 motif.
These
mutation motif patterns suggest a slight excess of unique variation
that is not shared with humans, the Altai Denisovan or the Altai
Neanderthal in East ISEA and Papua. However, the signal is not strong,
and the difference in total RS∗ between populations is small,
suggesting at most little introgression from outside the
Human/Neanderthal/Denisovan lineage in these populations. Still, the
question remains as to the cause of this mutation motif pattern. Homo erectus introgression has been suggested among Andaman populations (
), but debate is ongoing (
). The broader region is known to have harbored both H. floresiensis and H. erectus
in a time frame potentially overlapping occupation by modern humans.
However, Papua especially is the global center of gravity of Denisovan
introgression among modern human populations. The Altai Denisovan is
thought to have some H. erectus ancestry (
,
,
,
,
),
though it is not yet clear whether this is also true for introgressing
Denisovan populations. Alternatively, region-specific introgression from
either Neanderthals or Denisovans could introduce haplotypes coalescing
outside the (H,N,D) tree due to incomplete lineage sorting. Further
analysis of other statistics, or the specific haplotypes driving the
residual S∗ signal, coupled with complex simulations, would
be required to fully clarify this question, and are beyond the scope of
this work.
S13 – Rampasasa is not an introgression outlier
Our
dataset includes 19 samples from Rampasasa, Flores, a village that is
home to some individuals of unusually short stature and close to the
cave where Homo floresiensis (
)
bones were found. The dataset also includes two other villages on
Flores (Cibol and Bena) and samples representing many other islands in
East ISEA. This offers the opportunity to test for anomalous signals of
unusual archaic introgression into Rampasasa, as recently also assessed
by
,
with the benefit of being able to include samples from nearby and
regional populations. Compared to surrounding regions, we did not detect
any unusual signs of Neanderthal or Denisovan introgression in the
village – the total amount of the genome with evidence for Neanderthal
introgression only (62.9 Mb) was similar to neighboring villages (e.g.,
Cibol 64.4 Mb) and at the lower end of the East ISEA range
(62.9–67.2 Mb). Denisovan introgression is similarly low (23.6 Mb; Cibol
23.8 Mb; region 23.6–42.9 Mb). The levels of Denisovan and Neanderthal
introgression are exactly as expected based on the proportion of Papuan
ancestry in Rampasasa (Main Text Figure 7).
As with all other populations, the S∗
statistic detected a substantial archaic signal in Rampasasa that could
not be assigned by CP or the HMM to either Denisovan or Neanderthals.
We also studied residual S∗: S∗ windows that
explicitly exclude Neanderthal or Denisovan introgression and so may be
enriched for introgression signal contributed by genetically
uncharacterized hominins, such as H. erectus introgression if it occurred. Although the residual S∗ signal in Rampasasa is high, there was no clear evidence of excess residual S∗
signal compared to regional or global populations (per individual, 20
Mb in Rampasasa; Cibol 19.4 Mb; region 18.7–20 Mb; also see Main Text Figure 7). An analysis of the composition of the residual S∗ signal (STAR Methods S12) indicates that Rampasasa, and East ISEA and Papua more broadly, are relatively more consistent with limited H. erectus introgression (Figure S7E), but we emphasize again that the signal is not conclusive and that various other explanations exist (see STAR Methods S12 above). Our findings are consistent with those of
,
in that while they cannot rule out additional archaic introgression
into Rampasasa, they suggest that any such introgression must have been
extremely limited. By including highly local populations in our analysis
we are able to provide additional regional context, further emphasizing
that Rampasasa is not an outlier compared to nearby villages and
islands.
Data and Software Availability
FASTQ
files for each individual are available in the European Genome-phenome
Archive (EGA) with accession number EGAS00001003054 (https://www.ebi.ac.uk/ega/home). Variant files are available from the Estonian Biocenter data archive (http://evolbio.ut.ee).
New code has been uploaded to GitHub:
- The new HMM model: https://github.com/guysjacobs/archHMM
- Topology counting code: https://github.com/guysjacobs/archTopoCount
- Code to combine BED-format introgressed windows: https://github.com/guysjacobs/archBedCombine
Acknowledgments
We
thank Hie Lim Kim (Nanyang Technological University), Matthew Leavesley
(University of Papua New Guinea), Irene Gallego Romero (University of
Melbourne), Nicolas Brucato (Université de Toulouse Midi-Pyrénées), Jeff
Wall (University of California San Francisco), and Vitor Sousa
(University of Bern) for helpful comments. We also thank Anto Aasa
(University of Tartu) for help with preparing the geographical map of
the D1 distribution, and Jonathan Friedlaender (Temple University) for
providing New Britain samples as part of the GenomeAsia 100K project. We
especially thank all of our study participants. This study was
supported by National Science Foundation Grant SES 0725470 and Singapore
Ministry of Education Tier II Grant MOE2015-T2-1-127 to J.S.L., an NTU
Presidential Postdoctoral Fellowship to G.S.J., an NTU Complexity
Institute Individual Fellowship to P.K., a French ANR grant
ANR-14-CE31-0013-01 to F.-X.R., a European Union grant through the
European Regional Development Fund (Project No. 2014-2020.4.01.15-0012)
to M. Metspalu, and a Royal Society of New Zealand Marsden Grant
17-MAU-040 and a German Alexander von Humboldt Foundation fellowship to
M.P.C. Computational resources were provided by a Microsoft research
grant for Azure cloud computing and the High Performance Computing
Center, University of Tartu, Estonia.
Author Contributions
G.S.J.
and G.H. performed the primary analyses; L.S. performed SNP calling and
validation; P.K. analyzed the genes; C.C.D. provided laboratory
support; M. Mondal, D.J.L., and L.P. advised on ancestry detection;
F.-X.R., M.S., and M. Metspalu advised on analyses and interpretation;
G.S.J., G.H., H.S., J.S.L., and M.P.C. designed the project; G.S.J.,
G.H., and M.P.C. wrote the manuscript based on input from all the other
authors. The sequence data are available from the European
Genome-phenome Archive (accession EGAS00001003054). Variant files are
available from the Estonian Biocenter data archive (http://evolbio.ut.ee).
Declaration of Interests
The authors declare no competing interests.
Supplemental Information
-
Table S1. Sample and Combined Dataset List, Related to Figure 1 and STAR MethodsComplete list of 161 newly sequenced samples, and the 317 additional genomes incorporated into the analysis dataset.
-
Table S2. LOTER Results for ISEA and Papuan Groups, Related to Figures 1, 2, and 7 and STAR Methods
-
Table S3. Quantifying the Amount of Denisovan, Neanderthal, and Ambiguous or Unknown Signal in Different Populations Based on Different Detection Methods, Related to Figures 1 and 2 and STAR Methods
-
Table S4. Proportion of Blocks Consistent with Each of the Fifteen Possible Coalescent Topologies, under Four Different Definitions of Consistency, Related to STAR MethodsValues closer to 0 indicate that a topology is not observed in a set of blocks; only four topologies are observed at moderate frequency in the D1 and D2 sets. The total number of blocks consistent with at least one topology is indicated in the final row of the table; blocks can be consistent with more than one topology.
-
Table S5. Demographic Parameters of Our Modified Simulation Model (Malaspinas et al., 2016), Related to Figures 2 and 3 and STAR Methods
-
Table S6. Frequency of Denisovan-Introgressed (all Denisovan, and D1 and D2) Blocks, and Residual S∗ Windows, in East ISEA and Papuan Populations, Related to Figure 6 and STAR Methods
References
- A global reference for human genetic variation.Nature. 2015; 526: 68-74
- The shaping of modern human immune systems by multiregional admixture with archaic humans.Science. 2011; 334: 89-94
- The ‘human revolution’ in lowland tropical Southeast Asia: the antiquity and behavior of anatomically modern humans at Niah Cave (Sarawak, Borneo).J. Hum. Evol. 2007; 52: 243-261
- A Neolithic expansion, but strong genetic structure, in the independent history of New Guinea.Science. 2017; 357: 1160-1163
- Trimmomatic: a flexible trimmer for Illumina sequence data.Bioinformatics. 2014; 30: 2114-2120
- A new small-bodied hominin from the Late Pleistocene of Flores, Indonesia.Nature. 2004; 431: 1055-1061
- Analysis of human sequence data reveals two pulses of archaic Denisovan admixture.Cell. 2018; 173: 53-61
- Second-generation PLINK: rising to the challenge of larger and richer datasets.Gigascience. 2015; 4: 7
- Autosomal and X-linked single nucleotide polymorphisms reveal a steep Asian-Melanesian ancestry cline in eastern Indonesia and a sex bias in admixture rates.Proc. Biol. Sci. 2010; 277: 1589-1596
- Haplotype estimation using sequencing reads.Am. J. Hum. Genet. 2013; 93: 687-696
- Loter: a software package to infer local ancestry for a wide range of species.Mol. Biol. Evol. 2018; 35: 2318-2326
- On detecting incomplete soft or hard selective sweeps using haplotype structure.Mol. Biol. Evol. 2014; 31: 1275-1291
- A second generation human haplotype map of over 3.1 million SNPs.Nature. 2007; 449: 851-861
- The genetic structure of Pacific Islanders.PLoS Genet. 2008; 4: e19
- Akt- and Foxo1-interacting WD-repeat-FYVE protein promotes adipogenesis.EMBO J. 2008; 27: 1399-1410
- Archaic hominin admixture facilitated adaptation to Out-of-Africa environments.Curr. Biol. 2016; 26: 3375-3382
- Population genetics models of local ancestry.Genetics. 2012; 191: 607-619
- A draft sequence of the Neandertal genome.Science. 2010; 328: 710-722
- Genetic evidence for archaic admixture in Africa.Proc. Natl. Acad. Sci. USA. 2011; 108: 15123-15128
- The WD40 and FYVE domain containing protein 2 defines a class of early endosomes necessary for endocytosis.Proc. Natl. Acad. Sci. USA. 2006; 103: 11928-11933
- Estimating the mutation load in human genomes.Nat. Rev. Genet. 2015; 16: 333-343
- A missense polymorphism in the putative pheromone receptor gene VN1R1 is associated with sociosexual behavior.Transl. Psychiatry. 2017; 7: e1102
- The timing and spatiotemporal patterning of Neanderthal disappearance.Nature. 2014; 512: 306-309
- DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA.Nucleic Acids Res. 2001; 29: 4793-4799
- Indonesia – unravelling the mystery of a nation.Lancet. 2016; 387: 830
- New fossils from Jebel Irhoud, Morocco and the pan-African origin of Homo sapiens.Nature. 2017; 546: 289-292
- Complex patterns of admixture across the Indonesian archipelago.Mol. Biol. Evol. 2017; 34: 2439-2452
- Generating samples under a Wright-Fisher neutral model of genetic variation.Bioinformatics. 2002; 18: 337-338
- Physiological and genetic adaptations to diving in sea nomads.Cell. 2018; 173: 569-580.e15
- Earliest known hominin activity in the Philippines by 709 thousand years ago.Nature. 2018; 557: 233-237
- Discerning the origins of the Negritos, First Sundaland Peoples: deep divergence and archaic admixture.Genome Biol. Evol. 2017; 9: 2013-2022
- Coalescent simulation in continuous space.Bioinformatics. 2013; 29: 955-956
- Enrichr: a comprehensive gene set enrichment analysis web server 2016 update.Nucleic Acids Res. 2016; 44 (W90-7)
- Inference of population structure using dense haplotype data.PLoS Genet. 2012; 8: e1002453
- Analysis of protein-coding genetic variation in 60,706 humans.Nature. 2016; 536: 285-291
- A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.Bioinformatics. 2011; 27: 2987-2993
- Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 https://arxiv.org/abs/1303.3997.
- The lengths of admixture tracts.Genetics. 2014; 197: 953-967
- A working model of the deep relationships of diverse modern human genetic lineages outside of Africa.Mol. Biol. Evol. 2017; 34: 889-902
- Ancient genomes document multiple waves of migration in Southeast Asian prehistory.Science. 2018; 361: 92-95
- Ancestral origins and genetic history of Tibetan highlanders.Am. J. Hum. Genet. 2016; 99: 580-594
- A genomic history of Aboriginal Australia.Nature. 2016; 538: 207-214
- The Simons Genome Diversity Project: 300 genomes from 142 diverse populations.Nature. 2016; 538: 201-206
- Robust relationship inference in genome-wide association studies.Bioinformatics. 2010; 26: 2867-2873
- RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference.Am. J. Hum. Genet. 2013; 93: 278-288
- The prehistoric peopling of Southeast Asia.Science. 2018; 361: 88-92
- Stratigraphic placement and age of modern humans from Kibish, Ethiopia.Nature. 2005; 433: 733-736
- Widespread genomic signatures of natural selection in hominid evolution.PLoS Genet. 2009; 5: e1000471
- A high-coverage genome sequence from an archaic Denisovan individual.Science. 2012; 338: 222-226
- Genomic analysis of Andamanese provides insights into ancient human migration into Asia and adaptation.Nat. Genet. 2016; 48: 1066-1070
- Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes.Nat. Commun. 2017; 8: 303
- When did Homo sapiens first reach Southeast Asia and Sahul?.Proc. Natl. Acad. Sci. USA. 2018; 115: 8482-8490
- Pleistocene migration and colonization in the Indo-Pacific region.in: Anderson A. Barrett J.H. Boyle K.V. The Global Origins and Development of Seafaring. McDonald Institute for Archaeological Research, ; 2010: 41-55
- Genomic analyses inform on migration events during the peopling of Eurasia.Nature. 2016; 538: 238-242
- Population structure and eigenanalysis.PLoS Genet. 2006; 2: e190
- Ancient admixture in human history.Genetics. 2012; 192: 1065-1093
- 35,000 year old sites in the rainforests of west New Britain, Papua New Guinea.Antiquity. 1994; 68: 604-610
- Scikit-learn: Machine Learning in Python.J. Mach. Learn. Res. 2011; 12: 2825-2830
- Possible ancestral structure in human populations.PLoS Genet. 2006; 2: e105
- Scaling accurate genetic variant discovery to tens of thousands of samples.bioRxiv. 2017; 10.1101/201178
- Principal components analysis corrects for stratification in genome-wide association studies.Nat. Genet. 2006; 38: 904-909
- Sensitive detection of chromosomal segments of distinct ancestry in admixed populations.PLoS Genet. 2009; 5: e1000519
- The complete genome sequence of a Neanderthal from the Altai Mountains.Nature. 2014; 505: 43-49
- A high-coverage Neandertal genome from Vindija Cave in Croatia.Science. 2017; 358: 655-658
- BEDTools: a flexible suite of utilities for comparing genomic features.Bioinformatics. 2010; 26: 841-842
- A test for ancient selective sweeps and an application to candidate sites in modern humans.Mol. Biol. Evol. 2014; 31: 3344-3358
- Archaic adaptive introgression in TBX15/WARS2.Mol. Biol. Evol. 2017; 34: 509-524
- Genetic history of an archaic hominin group from Denisova Cave in Siberia.Nature. 2010; 468: 1053-1060
- Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania.Am. J. Hum. Genet. 2011; 89: 516-528
- Novel human vomeronasal receptor-like genes reveal species-specific families.Curr. Biol. 2002; 12: R409-R411
- The combined landscape of Denisovan and Neanderthal ancestry in present-day humans.Curr. Biol. 2016; 26: 1241-1247
- Statsmodels: econometric and statistical modeling with Python.in: van der Walt S. Millman J. Proceedings of the 9th Python in Science Conference. ; 2010: 57-61
- Paleogenomics. Genomic structure in Europeans dating back at least 36,200 years.Science. 2014; 346: 1113-1118
- Genomic insights into the peopling of the Southwest Pacific.Nature. 2016; 538: 510-513
- No evidence for unknown archaic ancestry in South Asia.Nat. Genet. 2018; 50: 632-633
- Detecting archaic introgression using an unadmixed outgroup.PLoS Genet. 2018; 14: e1007641
- The genome of the offspring of a Neanderthal mother and a Denisovan father.Nature. 2018; 561: 113-116
- Geographic variation in human mitochondrial DNA from Papua New Guinea.Genetics. 1990; 124: 717-733
- Revised stratigraphy and chronology for Homo floresiensis at Liang Bua in Indonesia.Nature. 2016; 532: 366-369
- Changing landscape and social interaction: looking at agricultural history from a Sepik-Ramu perspective.in: Pawley A. Attenborough R. Golson J. Hide R. Papuan Pasts: Cultural, Linguistic and Biological Histories of Papuan-Speaking Peoples. Pacific Linguistics, ; 2005: 289-327
- Evolution and functional impact of rare coding variation from deep sequencing of human exomes.Science. 2012; 337: 64-69
- Robust and scalable inference of population history from hundreds of unphased whole genomes.Nat. Genet. 2017; 49: 303-309
- Evolutionary history and adaptation of a human pygmy population of Flores Island, Indonesia.Science. 2018; 361: 511-516
- Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals.Science. 2016; 352: 235-239
- Inferring human demographic histories of non-African populations from patterns of allele sharing.Am. J. Hum. Genet. 2017; 100: 766-772
- Detecting ancient admixture and estimating demographic parameters in multiple human populations.Mol. Biol. Evol. 2009; 26: 1823-1827
- Higher levels of neanderthal ancestry in East Asians than in Europeans.Genetics. 2013; 194: 199-209
- Isoform-specific regulation of Akt signaling by the endosomal protein WDFY2.J. Biol. Chem. 2010; 285: 14101-14108
- Estimating F-statistics for the analysis of population structure.Evolution. 1984; 38: 1358-1370
- Elucidating the origin of HLA-B∗73 allelic lineage: Did modern humans benefit by archaic introgression?.Immunogenetics. 2017; 69: 63-67
- Gamma-ray spectrometric dating of late Homo erectus skulls from Ngandong and Sambungmacan, Central Java, Indonesia.J. Hum. Evol. 2008; 55: 274-277
Article Info
Publication History
Published: April 11, 2019
Accepted:
February 21,
2019
Received in revised form:
January 7,
2019
Received:
September 5,
2018
Publication stage
In Press Corrected ProofIdentification
DOI: https://doi.org/10.1016/j.cell.2019.02.035Copyright
© 2019 Elsevier Inc.
ScienceDirect
Access this article on ScienceDirectFigures
- Graphical Abstract
- Figure 1Modern and Archaic Ancestry
- Figure S1Flowcharts of Analyses, Related to STAR Methods
- Figure 2Effect of Block Filtering on the Denisovan Signal and the Correlation of the Denisovan Signal with Papuan Ancestry, Related to Tables S2 and S3 and STAR Methods
- Figure 3Mismatch Distributions by Block Size
- Figure S2Assessing the Mismatch Distribution in East ISEA and Resolution to Distinguish Mismatch Peaks Using Different Block Lengths, Related to Figure 3 and STAR Methods
- Figure 4Multiple Denisovan Ancestries in Papuans
- Figure S3Mismatch Distributions Using Different Block Sets, Related to Figure 3 and STAR Methods
- Figure S4Using Simulations to Assess the Accuracy of Introgression Time Inference Based on Exponential Fitting of Block Length Distributions, and Example Fits, Related to Figures 4 and 5 and STAR Methods
- Figure 5Geographic Patterns of D1 and D2 Ancestry
- Figure S5Simulation Model Schematic and Mismatch Results, Related to Figure 4 and STAR Methods
- Figure S6Using Heterozygosity and Fst to Inform Models of Mainland Papuan/Baining Drift, Related to Figure 5 and STAR Methods
- Figure 6Manhattan Plot of High-Frequency High-Confidence Denisovan Blocks (>20 kb) across Chromosomes
- Figure S7Regional Distribution of S∗ and Residual S∗, and Mutation Motif Characteristics of Residual S∗ Windows, Related to Figure 7 and STAR Methods
- Figure 7Correlations of Papuan Ancestry with Archaic and S∗ Components
Related Articles
- Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan AdmixtureBrowning et al.CellMarch 15, 2018Open Archive
- Population Turnover in Remote Oceania Shortly after Initial SettlementLipson et al.Current BiologyFebruary 28, 2018Open Archive
- 40,000-Year-Old Individual from Asia Provides Insight into Early Population Structure in EurasiaYang et al.Current BiologyOctober 12, 2017Open Archive
- Human Genetics: Busy Subway Networks in Remote Oceania?Bergström et al.Current BiologyMay 07, 2018
- Reconstructing the Deep Population History of Central and South AmericaPosth et al.CellNovember 08, 2018
Nenhum comentário:
Postar um comentário
Observação: somente um membro deste blog pode postar um comentário.