Basic Pandas Operations#

Why do we use Pandas?#

Pandas is a powerful Python library that supports creation, storage, and manipulation of data frames. Aa data-frame as a 2-dimensional data-structure – basically a spreadsheet – with rows and columns. It’s probably the most common way that we organize our data, whether experimental, survey, whatever… It’s basically the bread and butter of data-analysis.

Pandas has features such as handling missing data, cleaning up the data, and it also supports multiple file formats including CSV, Excel, SQL, etc.

In this activity, we will learn some of the basic oprations to get you started with Pandas.

Storing data frames in a CSV file#

First we will create a dictionary containing the data. Then we can convert the dictionary to a pandas data frame and store it as a csv file. In real life you will never be hard-coding data like this. This is just a demonstration.

import numpy as np
import pandas as pd

name_dict = {
    'Name':['Adam','Becky','Charlie','Daniel','Emily','Frank','Greta','Helen','Ian','Jack','Klaus','Lucy'],
    'Class ID':list(range(1,13)),
    'Age':[int(i) for i in (np.round(np.random.uniform(18,30,12),0))],
    'Score':[int(i) for i in (np.round(np.random.uniform(60,100,12),0))]
}

df = pd.DataFrame(name_dict)

df.to_csv('ClassList.csv',index=False)

Loading in data from a CSV file#

After loading in the data, we can do some quick checks. First, let’s check the column names:

df = pd.read_csv('ClassList.csv')
df.columns
pandas.core.indexes.base.Index

Pandas implements its own data types, but you can convert them to familiar things like lists (Basically, if it looks like it’s a list, you can convert it to a list. You’ll lose some functionality when you do it, but often it’s functionality you don’t need.).

print(type(df.columns))
col_names = list(df.columns)
print(col_names)
<class 'pandas.core.indexes.base.Index'>
['Name', 'Class ID', 'Age', 'Score']

Next, let’s look at the head (first few rows), and tail (the last few rows). This is always a good idea to make sure that everything is being read in as we expect.

df.head(10) #the first 5 rows by default. If you want more, insert the number as an argument to head.
Name Class ID Age Score
0 Adam 1 29 72
1 Becky 2 29 90
2 Charlie 3 20 81
3 Daniel 4 23 65
4 Emily 5 26 75
5 Frank 6 29 62
6 Greta 7 30 94
7 Helen 8 28 61
8 Ian 9 20 77
9 Jack 10 27 93
df.tail() # This will give you the last 5 rows of data
Name Class ID Age Score
7 Helen 8 28 61
8 Ian 9 20 77
9 Jack 10 27 93
10 Klaus 11 23 75
11 Lucy 12 20 73

Accessing a particular row, column, or cell#

In some cases, we want to access a particular row of the data.

df.loc[0,:] # Accessing the first row of data, the first value points to the row(s) and the second value points to the column(s)
Name        Adam
Class ID       1
Age           28
Score         63
Name: 0, dtype: object

We can select a particular column by its name

df.Score # Accessing the Score column
0     63
1     72
2     82
3     82
4     71
5     82
6     84
7     97
8     63
9     98
10    94
11    99
Name: Score, dtype: int64

Or you can select a column by its index

df.iloc[:,2] # This will also access the Score column
0     28
1     27
2     28
3     22
4     24
5     21
6     26
7     27
8     23
9     26
10    29
11    27
Name: Age, dtype: int64

It’s also possible to select a particular cell

df.iloc[0]['Score'] # This will access the first row ([0]) of the Score column
63

Iterate through rows and access values in that row#

In some cases, we need to iterate through all the rows, for example when we are using a pandas data-frame as a trial list and creating experimental trials based on that data.

# Printing everything of each row
for row in df.iterrows():
    print(row)
(0, Name        Adam
Class ID       1
Age           28
Score         63
Name: 0, dtype: object)
(1, Name        Becky
Class ID        2
Age            27
Score          72
Name: 1, dtype: object)
(2, Name        Charlie
Class ID          3
Age              28
Score            82
Name: 2, dtype: object)
(3, Name        Daniel
Class ID         4
Age             22
Score           82
Name: 3, dtype: object)
(4, Name        Emily
Class ID        5
Age            24
Score          71
Name: 4, dtype: object)
(5, Name        Frank
Class ID        6
Age            21
Score          82
Name: 5, dtype: object)
(6, Name        Greta
Class ID        7
Age            26
Score          84
Name: 6, dtype: object)
(7, Name        Helen
Class ID        8
Age            27
Score          97
Name: 7, dtype: object)
(8, Name        Ian
Class ID      9
Age          23
Score        63
Name: 8, dtype: object)
(9, Name        Jack
Class ID      10
Age           26
Score         98
Name: 9, dtype: object)
(10, Name        Klaus
Class ID       11
Age            29
Score          94
Name: 10, dtype: object)
(11, Name        Lucy
Class ID      12
Age           27
Score         99
Name: 11, dtype: object)
# Iterate through all rows and select only the name and Score columns
for index,row in df.iterrows():
    print(row['Name'], row['Score'])
Adam 72
Becky 90
Charlie 81
Daniel 65
Emily 75
Frank 62
Greta 94
Helen 61
Ian 77
Jack 93
Klaus 75
Lucy 73

Changing data in a data-frame#

If what you want to do is change data in a data-frame, the preferred way is to not iterate through the data-frame, but change it procedurally. For example, let’s say we want to add a column to our data-frame that converts scores to letter-grades. Rather than iterating through it and executing various conditionals, we do this:

def convert_score_to_grade(score):
    min_score_to_letter = {90:'A', 80:'B', 70:'C', 60:'D', 0:'F'}
    for min_score,letter in min_score_to_letter.items():
        if score>=min_score:
            return letter

df['Letter'] = list(map(convert_score_to_grade, df['Score'])) #create a new column called Letter

#now let's check it
for index,row in df.iterrows():
    print(row['Score'], row['Letter'])
72 C
90 A
81 B
65 D
75 C
62 D
94 A
61 D
77 C
93 A
75 C
73 C