A detailed look at East & North Eurasian Gene Flow to South & West Asians

Here I perform a detailed analysis of East and North Eurasian gene flow to South, South-Central, and West Asians by doing a one-to-one comparison of genomes sampled at about 404,000 Single Nucleotide Polymorphism (SNP) positions, using the program BEAGLE. To do this I utilize many genomes from northern and eastern Asia publicly available from the Estonian Biocentre.

The results from my analysis are very surprising in that they indicate E & N Asian admixture in W/SC/S Asians is considerably underestimated using allele frequency based programs such as ADMIXTURE, and also from geneological testing companies such as 23andMe, AncestryDNA, and FTDNA. However, the results here do seem somewhat consistent with results from the company GPSOrigins, and with results from 23andMe’s Admixture Date Estimator utility.

Inferring minor gene flow using the commonly used programs such as ADMIXTURE is a haphazard affair, because the output is very much dependent on the number and types of populations used to create the calculator, as well as on the samples that are declared ancestral by the calculator creator. Additionally, and the number of components, “K” also affects the results.

Using the program ADMIXTURE, I have personally created my share of calculators, some of which are publically available at Gedmatch.com, under the GedrosiaDNA project; however, I have always been cognizant of the fact that the results from such calculators can’t be used to accurately infer gene flow from various populations. Here are a couple of good reasons:

The calculator’s components references are themselves admixed

Consider the following scenario where a calculator based on ADMIXTURE is used to determine E Asian admixture in an Arab and in a Kurd. ADMIXTURE uses a Bayesian approach and a Markov Chain Monte Carlo algorithm to sample the distribution of minor allele frequencies at various loci. So for this example, assume that the Kurd individual scores 20% Central Asian and 40% Caucasus, whereas the Arab individual scores 5% Central Asian and 20% Caucasus.

The first problem is that some of the Central Asian and Caucuses references are themselves E Asian admixed. This results in an underestimation of the Kurd individual’s E Asian score as compared with the Arab, because more of Kurd’s E Asian admixture will be “hidden” under the Central Asian or Caucasus component, which Kurd scores more of than the Arab individual.

The component references are not representative of the specific E Asian population which have genetically contributed to the test subject

Suppose a test subject has gene flow from Mongolians or Yakut’s, but an ADMIXTURE based calculator uses Han Chinese as E Asian references. In this situation, the test subject’s E Asian admixture will most likely be underestimated, because the Han references have frequencies of minor alleles at various loci different from the Mongolians or Yakuts.

Inadequate reference sample sizes

Assume the test subject has the “A’ variant at the position rs1234, and there are 10 E Asian references in the run, 7 of which also have the “A” minor allele at rs1234. Also assume that 8 out of the 10 Central Asian references in the ADMIXTURE run, also coincidently have the “A” minor allele at that locus. In this situation if a test subject has the ‘A’ allele at that position, it will be assigned C Asian, and not E Asian.

By contrast, had there been more E Asian references in the run, say 100, it can turn out that 85 of them have the “A” allele at rs1234, in which case rs1234 would be assigned E Asian, and not C Asian.

The commercial genealogical testing companies

To the best of my knowledge testing companies such as FTDNA use ADMIXTURE or STRUCTURE based programs, and thus the results obtained from them would be susceptible to the same issues described above.

23andMe on the other hand uses haplotype segment matching, which I have more fully described here

This method produces more accurate results than methods based on allele frequencies, because allele frequencies can get skewed for the reasons mentioned above. By contrast haplotype segment matching for IBD is more reliable, because it is a one to one comparison of the test subject’s genome against population references, however, the problem with companies like 23andMe is that their algorithm minimizes minor admixture. I have described this in more detail in my post http://www.eurasiandna.com/2017/02/07/23andmes/

The other problem is that the test subject may not have gene flow from their specific population references. As mentioned above, a test subject may have gene flow from Mongolians, but their E Asian references may be Yakuts and Han. This leads to an underestimation of the subject’s E Asian admixture in this situation.

IBD comparisons between South & West Asian test subjects and East & North Asian references

Here I attempt something similar to what 23andMe does, but without minimizing minor ancestry. I also utilize many more E Asian and N Asian populations in the IBD comparisons, recognizing that not all W and S Asians have E Asian ancestry from the same 2 or 3 East Asian groups. I also include Lithuanians in the analysis to look at shared genetic drift between S & W Asians and NE Europeans due to gene flow from the Eurasian steppe. In the future, I will utilize many more NE European populations to better analyze this.

I utilized references with a minimum of about 70-75% East or North Eurasian admixture for all test subjects. Later, for some test subjects, I included East and North Asian references with a minimum of 55% East or North Eurasian admixture. The following N & Asian populations were utilized in the IBD comparisons:

Fig 1 – A list of East & North Asian population references used in the analysis with percentages using ADMIXTURE.

I compare stretches of DNA for shared relatively RARE haplotypes between some S/SC/W Asian test subjects and various N & E Asians, and Lithuanians. However, to accomplish this, the sequenced genotypes, such as from 23andMe, or AncestryDNA, which contain unordered combinations of alleles for each position, have to be PHASED, so that we can compare haplotypes to determine which sequences of alleles were inherited together. In haplotype phasing we attempt to determine which allele belongs to which copy of the chromosome, or alternatively, which alleles appear together on the same chromosome.

Using PLINK, I managed to extract about 404,000 common denominator SNPs between the S/SC/W Asian test subjects, and the N & E Asian and Lithuanian genomes which are available from the Estonian Biocentre. I then used BEAGLE, which is a software program for imputing genotypes, inferring haplotype phase, and performing genetic association analysis. BEAGLE can phase genotype data (i.e. infer haplotypes) for unrelated individuals, parent-offspring pairs, and parent-offspring trios. BEAGLE can detect genetic regions that are shared identical-by-descent (IBD).

Lines of the fastIBD output file report haplotypes shared by pairs of samples within the corresponding input file that have fastIBD score less than the threshold specified by the fastIBD threshold parameter. The fastIBD output file has five columns. The first two columns list the two sample identifiers for the shared haplotype described on each line. The next two columns list the starting (inclusive) and ending (exclusive) marker indices for the shared haplotype.

The first marker has index 0. The last column gives the fastIBD score for the shared haplotype. A fastIBD score < 10 ^-10 provides strong evidence that the shared haplotype is IBD if the length of the shared haplotype length is ≥ 1 cM. I used a stricter threshold of a fastIBD score of > 10^-12 in the analysis.

RESULTS

I first perform a sanity check using a Papuan group, recognizing that they should not have any IBD segment matches with any of my East or North Eurasian references, using my self imposed thresholds of 200 SNPs and a fastIBD score <10^-12, and the results in fact indicate that except for a couple of segment matches with the Bajo Indonesians, they don’t have any IBD matches with my references.

Fig 2- IBD results for test subject Papuan Koinanbe 2

An Assyrian and a Jordanian

No surprises here, the Jordanian test subject had the least number of IBD matches from all my test subjects. This is likely due to the very few SW Asian subjects in the run. It also was my only test subject that did not have any IBD matches with any East or North Asians.

Fig 3 – IBD matches for Jordanian 2

The Altaian IBD match for my Assyrian test subject Zephyrous likely represents gene flow from a Turkic population, perhaps Seljuks.

Fig 4 – IBD results for Zephyrous

Comparisons with higher East & North Asians admixed samples

The following IBD comparisons are with samples which are more than 70% East & North Asian admixed. These populations were used to filter out some previous segment matches which may have been due to West Eurasian ancestry in East & North Asians.

Fig 5 – IBD segment matches with selected E & N Asians

Kurds

My Kurd test subjects included a couple of Kurmanji Kurds from North Iraq; Kurds C1 and C3, and a few Feyli Kurds from further south; Kurds F1, F4, F6, and F7.

Surprisingly, all of my Kurd subjects, especially the Feylis had a large amount of shared IBD with East & North Asians. They generally had a larger number of total shared IBD segments with E & N Asians than Iranians, W Asians, and Indians, except for a couple Punjabis and Pashtuns. Also notable was that 4 of the 5 Kurd subjects had IBD shared segments with various Mongol samples, and in fact Kurd F4 had the largest shared segment from all my test subjects; a 951 SNP segment with Mongolian 3, and Kurd F7 sharing a large 733 SNP segment with Mongolian 3! This suggests a relatively intense mixing between Kurds and various Turkic tribes and descendants of Mongols historically.

From a historical perspective, some of the populations that inferred E & N Eurasian admixture to Kurds, very likely include Scythians, Seljuks, Turkmens, and other Turkic groups from Central Asia and the NE Caucasus.

Also surprising, is large segment sharing between Lithuanians and some of the Kurd individuals (see below), especially in lieu of the fact that this sharing between Lithuanians and some of my other W Asian samples, such as Iranians, Jordanians, Syrians, Armenians, and Georgians seemed absent, but seemed present in some of my SC Asians such as Pakistani Pashtun, Sein. The Lithuanian-Kurd and Lithuanian-Pashtun IBD segment sharing can most likely be attributed to a gene pool from the Eurasian Steppe that contributed to Lithuanians, Kurds, and Pashtuns.

Another surprise is that the Lithuanian-Kurd & Lithuanian-Pashtun IBD segments are larger than IBD segments shared between Kurds and most other W Asians in the case of Kurds C1 and C3, and with the Sein sample, the Lithuanian-Pashtun IBD segment is larger than any of the shared Sein-SC/W/S Asian IBD segments. The large size of these segments indicates gene flow from the Eurasian Steppe to Kurds and Pashtuns much more recent than the late Bronze Age.

The fact that the IBD segment sharing between Lithuanians and Kurds is to the exclusion of segment sharing between Lithuanians and Armenians/Georgians is significant, because this indicates Eurasian Steppe gene flow to Kurds not via the Caucasus corridor, but rather via Central Asia.

Feyli Kurds (Iraq & Iran)

Fig 6 – IBD results for Feyli Kurd F4

Fig 7 – IBD results for Feyli Kurd F1

Kurmanji Kurds (Iraq)

Fig 8 – IBD results for Iraqi Kurd C3

Fig 9 – IBD results for Iraqi Kurd C1

Comparisons with higher East & North Asians admixed samples

Fig 10 – IBD segment matches with selected E & N Asians

Pashtuns

Pashtun-Pakistan (Sein)

Overall, Pakistani Pashtun, Sein, showed more E/N/S Eurasian IBD matches than most of my Afghan Pashtun samples, although more surprising, his largest IBD segment was shared with a Lithuanian. I have discussed the implications of this under the previous section, titled “Kurds”.