{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic Pandas Operations\n", "\n", "## Why do we use Pandas?\n", "\n", "Pandas is a powerful Python library that supports creation, storage, and manipulation of data frames. Aa data-frame as a 2-dimensional data-structure -- basically a spreadsheet -- with rows and columns. It's probably the most common way that we organize our data, whether experimental, survey, whatever... It's basically the bread and butter of data-analysis.\n", "\n", "Pandas has features such as handling missing data, cleaning up the data, and it also supports multiple file formats including CSV, Excel, SQL, etc.\n", "\n", "In this activity, we will learn some of the basic oprations to get you started with Pandas.\n", "\n", "## Storing data frames in a CSV file\n", "\n", "First we will create a dictionary containing the data. Then we can convert the dictionary to a pandas data frame and store it as a csv file. In real life you will never be hard-coding data like this. This is just a demonstration." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "name_dict = {\n", " 'Name':['Adam','Becky','Charlie','Daniel','Emily','Frank','Greta','Helen','Ian','Jack','Klaus','Lucy'],\n", " 'Class ID':list(range(1,13)),\n", " 'Age':[int(i) for i in (np.round(np.random.uniform(18,30,12),0))],\n", " 'Score':[int(i) for i in (np.round(np.random.uniform(60,100,12),0))]\n", "}\n", "\n", "df = pd.DataFrame(name_dict)\n", "\n", "df.to_csv('ClassList.csv',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in data from a CSV file\n", "\n", "After loading in the data, we can do some quick checks. First, let's check the column names:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.indexes.base.Index" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('ClassList.csv')\n", "df.columns\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas implements its own data types, but you can convert them to familiar things like lists (Basically, if it looks like it's a list, you can convert it to a list. You'll lose some functionality when you do it, but often it's functionality you don't need.)." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "['Name', 'Class ID', 'Age', 'Score']\n" ] } ], "source": [ "print(type(df.columns))\n", "col_names = list(df.columns)\n", "print(col_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's look at the `head` (first few rows), and `tail` (the last few rows). This is always a good idea to make sure that everything is being read in as we expect." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameClass IDAgeScore
0Adam12972
1Becky22990
2Charlie32081
3Daniel42365
4Emily52675
5Frank62962
6Greta73094
7Helen82861
8Ian92077
9Jack102793
\n", "
" ], "text/plain": [ " Name Class ID Age Score\n", "0 Adam 1 29 72\n", "1 Becky 2 29 90\n", "2 Charlie 3 20 81\n", "3 Daniel 4 23 65\n", "4 Emily 5 26 75\n", "5 Frank 6 29 62\n", "6 Greta 7 30 94\n", "7 Helen 8 28 61\n", "8 Ian 9 20 77\n", "9 Jack 10 27 93" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10) #the first 5 rows by default. If you want more, insert the number as an argument to head." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameClass IDAgeScore
7Helen82861
8Ian92077
9Jack102793
10Klaus112375
11Lucy122073
\n", "
" ], "text/plain": [ " Name Class ID Age Score\n", "7 Helen 8 28 61\n", "8 Ian 9 20 77\n", "9 Jack 10 27 93\n", "10 Klaus 11 23 75\n", "11 Lucy 12 20 73" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail() # This will give you the last 5 rows of data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing a particular row, column, or cell\n", "\n", "In some cases, we want to access a particular row of the data." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name Adam\n", "Class ID 1\n", "Age 28\n", "Score 63\n", "Name: 0, dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[0,:] # Accessing the first row of data, the first value points to the row(s) and the second value points to the column(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can select a particular column by its name" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 63\n", "1 72\n", "2 82\n", "3 82\n", "4 71\n", "5 82\n", "6 84\n", "7 97\n", "8 63\n", "9 98\n", "10 94\n", "11 99\n", "Name: Score, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Score # Accessing the Score column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can select a column by its index" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 28\n", "1 27\n", "2 28\n", "3 22\n", "4 24\n", "5 21\n", "6 26\n", "7 27\n", "8 23\n", "9 26\n", "10 29\n", "11 27\n", "Name: Age, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[:,2] # This will also access the Score column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's also possible to select a particular cell" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "63" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0]['Score'] # This will access the first row ([0]) of the Score column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterate through rows and access values in that row\n", "\n", "In some cases, we need to iterate through all the rows, for example when we are using a pandas data-frame as a trial list and creating experimental trials based on that data." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(0, Name Adam\n", "Class ID 1\n", "Age 28\n", "Score 63\n", "Name: 0, dtype: object)\n", "(1, Name Becky\n", "Class ID 2\n", "Age 27\n", "Score 72\n", "Name: 1, dtype: object)\n", "(2, Name Charlie\n", "Class ID 3\n", "Age 28\n", "Score 82\n", "Name: 2, dtype: object)\n", "(3, Name Daniel\n", "Class ID 4\n", "Age 22\n", "Score 82\n", "Name: 3, dtype: object)\n", "(4, Name Emily\n", "Class ID 5\n", "Age 24\n", "Score 71\n", "Name: 4, dtype: object)\n", "(5, Name Frank\n", "Class ID 6\n", "Age 21\n", "Score 82\n", "Name: 5, dtype: object)\n", "(6, Name Greta\n", "Class ID 7\n", "Age 26\n", "Score 84\n", "Name: 6, dtype: object)\n", "(7, Name Helen\n", "Class ID 8\n", "Age 27\n", "Score 97\n", "Name: 7, dtype: object)\n", "(8, Name Ian\n", "Class ID 9\n", "Age 23\n", "Score 63\n", "Name: 8, dtype: object)\n", "(9, Name Jack\n", "Class ID 10\n", "Age 26\n", "Score 98\n", "Name: 9, dtype: object)\n", "(10, Name Klaus\n", "Class ID 11\n", "Age 29\n", "Score 94\n", "Name: 10, dtype: object)\n", "(11, Name Lucy\n", "Class ID 12\n", "Age 27\n", "Score 99\n", "Name: 11, dtype: object)\n" ] } ], "source": [ "# Printing everything of each row\n", "for row in df.iterrows():\n", " print(row)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Adam 72\n", "Becky 90\n", "Charlie 81\n", "Daniel 65\n", "Emily 75\n", "Frank 62\n", "Greta 94\n", "Helen 61\n", "Ian 77\n", "Jack 93\n", "Klaus 75\n", "Lucy 73\n" ] } ], "source": [ "# Iterate through all rows and select only the name and Score columns\n", "for index,row in df.iterrows():\n", " print(row['Name'], row['Score'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Changing data in a data-frame\n", "\n", "If what you want to do is **change** data in a data-frame, the preferred way is to **not** iterate through the data-frame, but change it procedurally. For example, let's say we want to add a column to our data-frame that converts scores to letter-grades. Rather than iterating through it and executing various conditionals, we do this:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "72 C\n", "90 A\n", "81 B\n", "65 D\n", "75 C\n", "62 D\n", "94 A\n", "61 D\n", "77 C\n", "93 A\n", "75 C\n", "73 C\n" ] } ], "source": [ "def convert_score_to_grade(score):\n", " min_score_to_letter = {90:'A', 80:'B', 70:'C', 60:'D', 0:'F'}\n", " for min_score,letter in min_score_to_letter.items():\n", " if score>=min_score:\n", " return letter\n", "\n", "df['Letter'] = list(map(convert_score_to_grade, df['Score'])) #create a new column called Letter\n", "\n", "#now let's check it\n", "for index,row in df.iterrows():\n", " print(row['Score'], row['Letter'])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('psych750')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "57beecaf6908bae4f97de5e2dc8e8d0311fae5bc989593c172c307d13e31f6e4" } } }, "nbformat": 4, "nbformat_minor": 2 }