Overview of the important Python packages and modules

A few important Python libraries used in the course:

Potentially useful packages and modules:

The general practice is to import packages in the first cell of the notebook. If you are using Anaconda, then most of the packages should be already installed. If it not the case, you can install a package by executing conda install numpy in a command line or use pip install nympy also in the command line. Alternatively, if you are using Jupyther Notebook, you can install a package by executing !pip install numpy.

It is common to use abbreviations for some packages, for example: np for numpy, pd for pandas, sns for seaborn, etc.

Always bear in mind that explicit syntax is always better than implicit! Try to not go overboard with abbreviations, since using abbreviations may reduce code readability.

1. On new data types (NumPy array and Pandas dataframe)

1.1. NumPy arrays

The main object in NumPy library is a multidimensional array (ndarray). Such objects are represented as matrices often containing numerical elements. Similar to lists, arrays are indexed.

A few important attributes of an ndarray are: .shape, .size, .ndim, etc. (to read more about NumPY, check out this link).

Let's look into a few examples:

NumPy has a lot of functionality which we will not cover in this course. However, below we provide a few examples that are representative enough:

NumPy is useful for various linear algebra tasks, e.g., matrix multiplication, eigenvalue decomposition, etc. To know more about NumPy functionality, please see this link.

1.2. Pandas dataframes

Pandas dataframe is a two-dimensional cunstructor for tabular data. It stores the data, allows performing various operations with it, including cleaning the data and processing it.

To know more about Pandas functionality, check out this link.

2. Reading files

2.1. Optional: Reading files line by line and writing to files

You can read files in Python with the complex of open(), readline(), close() functions in the following way:

Alternatively, you can use the function .readlines() that would provide the list of all lines contained in the file, where each line is represented as string with \n in the end.

Now, let's try writing only the data for Setosa species into the new file named setosa_dataset_write.csv:

2.2. Reading and writing to files with Pandas

One of the most standard ways of reading tabular data is to use the read_csv() function from Pandas.

Let's say we want to extract only Setosa species and save the respective dataframe to the new file 'setosa_dataset.csv'. This can be achieved with the .to_csv() function in the following way:

As you may notice, reading and writing files with Pandas is easier as compared to the open/write/close version described in the subsection 1.1. However, when dealing with large files or files with non dataframe-like content, the latter option might be more convenient.

Side note: NumPy arrays, dictionaries and pickled objects can be efficiently stored with NumPy to .npy, .npz and .pickle formats with the np.save() function, and read afterwards with the np.load() function.

3. Analyzing and visualizing the data

In this section, we will do some exploratory data analysis (aka EDA) and visualization for the iris dataset.

First, let's start with summary statistics for features of the dataset:

Exercise 1. Look into the output of the .describe() function. What feature has the largest mean value? What about standard deviation?


Let's check if there is any dependency between the lengths and widths of petals and sepals by using the .scatter() plot function from the matplotlib library. First, we create the figure object by definging figure (fig) and axes (ax). Axes can be either a single object, or an array of Axes objects if several subplots are created. Check out the documentation to know more about creating figures with matplotlib.

There is a clear linear dependency bertween the two features. Let's get some numerical value for this dependency by calculating Pearson correlation with the .pearsonr() function from the scipy.stats package:

Side note: do not forget to check function parameters before using it!

Since the number of features and samples is not huge, we can look into all possible pairwise relationships between the features in the dataset with the .pairplot() function from the seaborn library:

Exercise 2.

  1. How do you call the plots visualized on the diadonal of the output figure? What do they represent?
  2. Look into parameters of the.pairplot() function and find a way how to color the data based on the iris species (the last column of the iris_df).

Let's look into the sepal and petal length distributions and compare them side by side. To visualize figures side by side, we will use the plt.subplots() function with parameters nrows=1 and ncols=2, indicating the number of rows and columns for the subplot grid respectively.

Be careful with the Axes! Note that since we are creating more that one plot, ax will be an array of Axes object (e.g., ax1, ax2 = ax in our case, or alternatively, individual axes can be extracted with indexing: ax[0], ax[1], as in the example below).

Side note: The list of named colors available in matplotlib can be found via this link. Custom colors can be defined in RGB, RGBA formats or with hex identifiers, see this link for more details.

Exercise 3. let's plot the two density plots (with .hist() function and density=True parameter) in one figure. Complete the code in the cell below by coloring the histogram with the same colors as in the example above, set labels, add figure legend outside the axes, specify labels of x and y axes and the figure title.


Exercise 4. We have already tried calculating correlation and the reapective p-value with the scipy.stats package. Now we want to know whether there is a statistically significant difference in means of sepal lengths for Setosa and Virginica species. To do that, we will use the t-test for independent samples, which is already implemented in the scipy.stats package. Find the respective function, use it on the sepal lengths for Setosa and Virginica species, print the value of the respective statistics and p-value. What is your conclusion?


Finally, we will see how to group data with Pandas .groupby() function. To learn more about its functionality, see this link.

For example, imagine we want to know the average values for all features per iris species. To do that, we will provide the column name for aggregation by specifying it as the first argument of the .groupby() function. This will yield the pandas.core.groupby.generic.DataFrameGroupBy object containing iris variety and the respecive dataframe. Later, we can use the .mean() function to aggregate across all samples for each feature and output the result into one dataframe:

To have a better understanding of the .groupby() function content, let's iterate over it:

Now you are ready for the genetics and genomics exercises!