Genetics & Genomics - Python Exercise 1
We have received
genotyping data from ~50 human samples. We plan to perform a GWAS analysis
(even if low powered). To do so we will need to remove potential confounding
factors from the data such as sex, batch or ethnicity since they can hide the
biological signal of interest. However, the clinical data from these samples
only contains information about the sex, batch effect, etc. (data not provided
for this exercise) but no information about the ethnicity. Your goal is to use the 1000 Genome
Project dataset for inferring the ethnicity of these samples.
You are given a VCF
file "genotypes.vcf" containing a
merged dataset of ~2500 samples from the 1000G project, and the ~50 unknown
samples from our study. To facilitate your task the VCF file is already
filtered to contain a restricted list of ~1500 SNPs that discriminate the
different ethnic populations.
An additional file,
"ethnicities.txt", downloaded from 1000 Genome project website, contains the list
of 1000 Genome samples and their known ethnicity.
For inferring
ethnicities, we will simply perform a PCA of the genotyping data (merged 1000G
and our samples), hoping that all population are sufficiently separated, so
that our samples will be clearly distinguishable.
1. 1000G ethnicities
a. Start by loading the ethnicity file in Python. You can use the pd.read_csv() function from the Pandas package.
b. Plot the 3rd column as a bar plot to
summarize the number of samples from all the different subpopulations. Note
that the bar plot needs a summary indicating the number of observation (human
samples) in each category (subpopulation). Sort the bars by their values. Make sure that category names are fully visible (hint: look into the ways of rotating of xticklabels parameter).
c. Also generate a PDF file containing the plot
using the plt.savefig() functions. Specify the resolution (300
should be good enough) and make the figure background transparent.
d. Plot similarly the 4th and 5th columns as bar
plot and compare their content. You can try playing with the plt.subplots()
function parameters to visualize two plots side by side. Sort the bars by their
values.
Note: For this exercise, use the matplotlib
library for visualization purposes.
2. Reading the VCF file
Read the VCF file (be careful, first two lines are comments/header (use the
skiprow parameter), third line should be used to generate the column names).
Check that the output dataframe is properly generated. You can use the pd.read_csv() function from
the Pandas library.
Note: The rows correspond to the different VCFs.
The 9 first columns are VCF annotations. The other columns are the samples genotype.
LA0# are the ~50 unknown human samples and HG# and NA# are the ~2500 1000G
samples. The values of the two alleles on each chromosome of the individuals
are separated by "|". A value of 0 correspond to the reference allele
(REF) and 1 to the alternative allele (ALT).
3. Merge both ethnic and VCF datasets
a. Check which samples from the VCF are also
present in the ethnicity file, and which ones are not (the "Unknown" samples
that need to be predicted).
b. For the samples with missing annotation, set
their ethnicities to "Unknown". Merge the ethnicities dataframe with the one with
samples having missing annotations.
c. Generate a pie chart (pie() function) of the final ethnicities
(including the "Unknown" samples). Similar to the bar plot input, the
pie() function needs a table indicating
the number of observation per category.
4. Prepare the data for the PCA
a. Create a matrix called df_for_pca,
from the VCF file, that contains only the genotyping data (without the first 9
columns).
b. Then, transform this matrix in a numerical
matrix containing 0 for homozygous ref values, 1 for heterozygous, and 2 for
homozygous alternative. Remember that for the PCA to be performed on the
samples (and not the genes), the rows of the matrix should be the samples (and
thus the columns are the SNPs). Be careful with the class of the values, it needs
to be numerical so that the PCA can run without error.
5. Running PCA
a. Perform the PCA using the PCA() function from the scikit-learn
library.
b. Visualize the first two components of the PCA
and color samples by ethnicity. Are the different groups clearly
visible/separated?
c. Visualize the first three components of the
PCA in 3D using the matplotlib library, color samples by ethnicity and save the
final figure into PDF.
d. Find a way to highlight the "Unknown" samples
more clearly and save the final figure into PDF.
e. Highlighting the "Unknown" samples with
different markers and different sizes.
f. Generate an interactive PCA plot in 3D using
the Plotly package and the Scatter3d() function.
6. Conclusion
Can you conclude on the ethnicities of the "Unknown"
samples?