Genetic Substructure Analysis via PCA

Overview

This project investigates how principal component analysis (PCA) can reveal genetic substructure across global populations. Using genome-wide single-nucleotide polymorphisms (SNPs) from the 1000 Genomes Project, PCA separates 2,504 individuals from 26 populations into distinct ancestry clusters without any prior population labels.

The work began as a college bioinformatics project in 2021 and has since evolved into a public interactive explorer.

Interactive Explorer (archived)

A Streamlit-based public explorer for this analysis was hosted at genomics.connorfaulkner.com through early 2026. The interactive deployment has since been retired; the underlying analysis code, datasets, and plots remain available in the project write-up below and in the associated GitHub repository.

Dataset

The analysis is built on the 1000 Genomes Project Phase 3 release, one of the most widely used reference panels in human genetics.

2,504 individuals drawn from 26 populations across 5 superpopulations
AFR (African): YRI, LWK, GWD, MSL, ESN, ASW, ACB
AMR (Ad-Mixed American): MXL, PUR, CLM, PEL
EAS (East Asian): CHB, JPT, CHS, CDX, KHV
EUR (European): CEU, TSI, FIN, GBR, IBS
SAS (South Asian): GIH, PJL, BEB, STU, ITU
Genome-wide SNP genotypes provide the input matrix for dimensionality reduction

Methodology

Quality Control with PLINK

Raw genotype data was processed through PLINK, the standard command-line toolset for genome-wide association studies:

SNP filtering removed variants with high missingness, low minor allele frequency, or deviation from Hardy-Weinberg equilibrium
Sample filtering excluded individuals with excess heterozygosity or high genotype missingness
LD pruning thinned SNPs in linkage disequilibrium so that PCA captures population structure rather than local LD patterns

PCA Computation

PLINK computed 30 principal components from the pruned genotype matrix. Each component captures a decreasing share of the total genetic variance:

PC1 and PC2 together explain the largest portion of variance and separate the five continental superpopulations
Higher PCs (PC3 through PC5) resolve finer sub-continental structure, such as differentiation within Africa or South Asia
A scree plot of eigenvalues confirms that the first few PCs carry the meaningful population signal while later components approach noise

K-Means Clustering

K-Means clustering (k equal to 5, matching the five superpopulations) was applied to the top PCs after standard scaling. The elbow method validated the choice of k, and cluster assignments aligned closely with known superpopulation labels.

Results

Continental Separation

The PCA scatter of PC1 versus PC2 produces five well-separated clusters corresponding to the five superpopulations. African populations (AFR) sit at one extreme of PC1, East Asian populations (EAS) at the other, with European (EUR), South Asian (SAS), and Ad-Mixed American (AMR) groups distributed between them.

Sub-Continental Structure

PC3 differentiates sub-populations within Africa (e.g., West African vs. African-Caribbean groups)
PC4 separates South Asian populations such as Gujarati Indian (GIH) from Bengali (BEB) and Sri Lankan Tamil (STU)
PC5 reveals structure within East Asia, distinguishing Han Chinese sub-groups from Kinh Vietnamese

Admixture Patterns

Ad-Mixed American populations (AMR) do not form a single tight cluster. Instead they spread along a cline between European and Indigenous American ancestry, reflecting varying admixture histories:

PEL (Peruvian) clusters closer to the Indigenous American end
PUR (Puerto Rican) and CLM (Colombian) show a bridge toward European and African clusters
ASW and ACB (African-American and African-Caribbean) fall between African and European clusters, consistent with documented admixture proportions

UMAP Confirmation

Non-linear dimensionality reduction via UMAP on the top 10 PCs reveals additional sub-structure not visible in linear PCA, including a Gujarati split, Mexican-Spanish clustering, and a Puerto Rican admixture bridge.

Tools and Technologies

PLINK 1.9 for genotype QC, LD pruning, and PCA computation
Python with pandas, scikit-learn (KMeans, StandardScaler), and NumPy
Plotly and Streamlit for the interactive explorer (scatter plots, choropleth maps, scree plots, violin plots)
umap-learn for non-linear dimensionality reduction
Previously deployed on Railway at genomics.connorfaulkner.com (archived)