← Back to Research

Research entry

Genetic Substructure Analysis via PCA

Completed · May 2021

Analysis of genetic substructure across global populations using genome-wide SNPs and principal component analysis on the 1000 Genomes Project dataset.

Bioinformatics Python PCA Data Science

View external resource →

Overview

This project analyses genetic substructure using genome-wide SNPs from the 1000 Genomes Project. Principal component analysis (PCA) was applied to identify population clusters across five superpopulations: AFR, AMR, EAS, EUR, and SAS.

Methods

  • PLINK used for QC and PCA computation (30 PCs)
  • Interactive visualisation built with Plotly and Dash
  • 3D scatter plots, choropleth maps, and scree plots

Results

Clear separation of superpopulation clusters in PC1/PC2 space, with notable substructure within continental groups.

Current Explorer

Open Genomics Explorer →

This work has since been extended into a broader public explorer. The current live app is hosted on Railway and is the version that makes the most sense to surface from the portfolio while a custom subdomain is being prepared.