Nine out of ten samples were mistakenly swapped by the Orang-utan Genome Consortium
We first mapped the resequencing reads of the 10 Locke et al. whole genomes, plus those previously published from 27 congeners3.4to the latest iteration of the orangutan (female) reference genome (ponAbe35). To this, we had concatenated a recent assembly of the Y chromosome of the orangutan6. Using idxtools function in samtools 1.14sevenwe inferred sex by comparing the ratios at which sequence reads were mapped to X and Y chromosomes. After two rounds of bootstrap-based recalibration, we then co-named the genotypes with GATK 4.1.8.08all as previously described9. We randomly sampled 1,000,000 biallelic autosomal SNPs with no missing genotypes and ≥5% minor allele frequency (MAF), linked loci pruned in PLINKten (–indep-pairwise 50 10 0.1), and assigned populations in ADMIXTURE 1.3911 as supervised with reported provenance data for conspecifics3.4 (K=3).
Additionally, we sampled and analyzed eight orangutans known to be first-, second-, or third-degree relatives of seven of those allegedly sequenced by Locke. et al., using the Illumina iScan Multi-Ethnic Global Array, also as previously described12. The reproduction of these seven, and therefore these known relationships, had been recorded at the same time2 (Fig. 1). To convert microarray intensity data into variant calls, we mapped probe flank sequences to ponAbe3 (using –fasta-flank) and exported genotypes (–sam-flank) with bcftooolsseven gtc2vcf plugin (https://github.com/freeseek/gtc2vcf), subject to the following filter settings: meanR_AB 0.3, meanTHETA_BB 0.7, devTHETA_AA > 0.025, devTHETA_AB ≥ 0.07, devTHETA_BB > 0.025 and GenTrain_Score 9; merged these with the VCF microarray genotype and LD-pruned, MAF-filtered biallelic SNPs precisely as mentioned above. In order to avoid the spurious related associations that characterize highly structured data, we then initiated ADMIXTURE’s cross-validation procedure to infer the most appropriate K (test from 1 to 10) before estimating the kinship coefficients (Φij) in REAP13.
We adopted a three-pronged method to confirm the identity of each sample. Identities were first inferred using an exclusion approach, based on calculated sex and species (versus known and reported). Each was then confirmed, where appropriate, when the observed relatedness coefficients resembled those expected from known relationships. Third, we reviewed historical records of biomaterials maintained by the Frozen Zoo, the original source of the samples, as well as laboratory information system (LIMS) records maintained at Washington University in Saint Louis, where the samples were originally sequenced. Identity was assigned to a given sample when all of these factors matched.