Basic Pandas Operations
Contents
Basic Pandas Operations#
Why do we use Pandas?#
Pandas is a powerful Python library that supports creation, storage, and manipulation of data frames. Aa data-frame as a 2-dimensional data-structure – basically a spreadsheet – with rows and columns. It’s probably the most common way that we organize our data, whether experimental, survey, whatever… It’s basically the bread and butter of data-analysis.
Pandas has features such as handling missing data, cleaning up the data, and it also supports multiple file formats including CSV, Excel, SQL, etc.
In this activity, we will learn some of the basic oprations to get you started with Pandas.
Storing data frames in a CSV file#
First we will create a dictionary containing the data. Then we can convert the dictionary to a pandas data frame and store it as a csv file. In real life you will never be hard-coding data like this. This is just a demonstration.
import numpy as np
import pandas as pd
name_dict = {
'Name':['Adam','Becky','Charlie','Daniel','Emily','Frank','Greta','Helen','Ian','Jack','Klaus','Lucy'],
'Class ID':list(range(1,13)),
'Age':[int(i) for i in (np.round(np.random.uniform(18,30,12),0))],
'Score':[int(i) for i in (np.round(np.random.uniform(60,100,12),0))]
}
df = pd.DataFrame(name_dict)
df.to_csv('ClassList.csv',index=False)
Loading in data from a CSV file#
After loading in the data, we can do some quick checks. First, let’s check the column names:
df = pd.read_csv('ClassList.csv')
df.columns
pandas.core.indexes.base.Index
Pandas implements its own data types, but you can convert them to familiar things like lists (Basically, if it looks like it’s a list, you can convert it to a list. You’ll lose some functionality when you do it, but often it’s functionality you don’t need.).
print(type(df.columns))
col_names = list(df.columns)
print(col_names)
<class 'pandas.core.indexes.base.Index'>
['Name', 'Class ID', 'Age', 'Score']
Next, let’s look at the head
(first few rows), and tail
(the last few rows). This is always a good idea to make sure that everything is being read in as we expect.
df.head(10) #the first 5 rows by default. If you want more, insert the number as an argument to head.
Name | Class ID | Age | Score | |
---|---|---|---|---|
0 | Adam | 1 | 29 | 72 |
1 | Becky | 2 | 29 | 90 |
2 | Charlie | 3 | 20 | 81 |
3 | Daniel | 4 | 23 | 65 |
4 | Emily | 5 | 26 | 75 |
5 | Frank | 6 | 29 | 62 |
6 | Greta | 7 | 30 | 94 |
7 | Helen | 8 | 28 | 61 |
8 | Ian | 9 | 20 | 77 |
9 | Jack | 10 | 27 | 93 |
df.tail() # This will give you the last 5 rows of data
Name | Class ID | Age | Score | |
---|---|---|---|---|
7 | Helen | 8 | 28 | 61 |
8 | Ian | 9 | 20 | 77 |
9 | Jack | 10 | 27 | 93 |
10 | Klaus | 11 | 23 | 75 |
11 | Lucy | 12 | 20 | 73 |
Accessing a particular row, column, or cell#
In some cases, we want to access a particular row of the data.
df.loc[0,:] # Accessing the first row of data, the first value points to the row(s) and the second value points to the column(s)
Name Adam
Class ID 1
Age 28
Score 63
Name: 0, dtype: object
We can select a particular column by its name
df.Score # Accessing the Score column
0 63
1 72
2 82
3 82
4 71
5 82
6 84
7 97
8 63
9 98
10 94
11 99
Name: Score, dtype: int64
Or you can select a column by its index
df.iloc[:,2] # This will also access the Score column
0 28
1 27
2 28
3 22
4 24
5 21
6 26
7 27
8 23
9 26
10 29
11 27
Name: Age, dtype: int64
It’s also possible to select a particular cell
df.iloc[0]['Score'] # This will access the first row ([0]) of the Score column
63
Iterate through rows and access values in that row#
In some cases, we need to iterate through all the rows, for example when we are using a pandas data-frame as a trial list and creating experimental trials based on that data.
# Printing everything of each row
for row in df.iterrows():
print(row)
(0, Name Adam
Class ID 1
Age 28
Score 63
Name: 0, dtype: object)
(1, Name Becky
Class ID 2
Age 27
Score 72
Name: 1, dtype: object)
(2, Name Charlie
Class ID 3
Age 28
Score 82
Name: 2, dtype: object)
(3, Name Daniel
Class ID 4
Age 22
Score 82
Name: 3, dtype: object)
(4, Name Emily
Class ID 5
Age 24
Score 71
Name: 4, dtype: object)
(5, Name Frank
Class ID 6
Age 21
Score 82
Name: 5, dtype: object)
(6, Name Greta
Class ID 7
Age 26
Score 84
Name: 6, dtype: object)
(7, Name Helen
Class ID 8
Age 27
Score 97
Name: 7, dtype: object)
(8, Name Ian
Class ID 9
Age 23
Score 63
Name: 8, dtype: object)
(9, Name Jack
Class ID 10
Age 26
Score 98
Name: 9, dtype: object)
(10, Name Klaus
Class ID 11
Age 29
Score 94
Name: 10, dtype: object)
(11, Name Lucy
Class ID 12
Age 27
Score 99
Name: 11, dtype: object)
# Iterate through all rows and select only the name and Score columns
for index,row in df.iterrows():
print(row['Name'], row['Score'])
Adam 72
Becky 90
Charlie 81
Daniel 65
Emily 75
Frank 62
Greta 94
Helen 61
Ian 77
Jack 93
Klaus 75
Lucy 73
Changing data in a data-frame#
If what you want to do is change data in a data-frame, the preferred way is to not iterate through the data-frame, but change it procedurally. For example, let’s say we want to add a column to our data-frame that converts scores to letter-grades. Rather than iterating through it and executing various conditionals, we do this:
def convert_score_to_grade(score):
min_score_to_letter = {90:'A', 80:'B', 70:'C', 60:'D', 0:'F'}
for min_score,letter in min_score_to_letter.items():
if score>=min_score:
return letter
df['Letter'] = list(map(convert_score_to_grade, df['Score'])) #create a new column called Letter
#now let's check it
for index,row in df.iterrows():
print(row['Score'], row['Letter'])
72 C
90 A
81 B
65 D
75 C
62 D
94 A
61 D
77 C
93 A
75 C
73 C