Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. "the ability to process data depends on users capabilities and their tools"
    • Combining Data. "combine county data sets"
    • Status on Data"determining the artist with the greatest attendance during a particular month"
    • Data poses challenge. "the need to clean data", "incomplete data"
  • From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

DataFrame

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Notes and Observations #1

  • Imports the Pandas library and renames it to "pd".

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

df = pd.read_json('files/grade.json')

print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

Notes and Observations #2

  • Uses Pandas to read in data from a JSON file named "grade.json" located in the "files" directory.

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

Student ID  GPA
       123 3.57
       246 4.00
       578 2.78
       469 3.45
       324 4.75
       313 3.33
       145 2.95
       167 3.90
       235 3.15
       nil 2.80
       469 3.45
       456 2.75

Notes and Observations #2

  • Two separate print statements:

    • 1: Uses double brackets to extract a single column (GPA) of data from the DataFrame.

      • Output: GPA column only.
    • 2: Uses double brackets to extract two columns of data from the DataFrame: "Student ID" and "GPA".

      • Printed using the "to_string()" method.

      • The "index=False" argument ensures that the DataFrame is printed without row numbers.

DataFrame Sort

print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
   Student ID Year in School   GPA
11        456             10  2.75
2         578             12  2.78
9         nil              9  2.80
6         145             12  2.95
8         235      9th Grade  3.15
5         313             20  3.33
3         469             11  3.45
10        469             11  3.45
0         123             12  3.57
7         167             10  3.90
1         246             10  4.00
4         324         Junior  4.75

   Student ID Year in School   GPA
4         324         Junior  4.75
1         246             10  4.00
7         167             10  3.90
0         123             12  3.57
3         469             11  3.45
10        469             11  3.45
5         313             20  3.33
8         235      9th Grade  3.15
6         145             12  2.95
9         nil              9  2.80
2         578             12  2.78
11        456             10  2.75

Notes and Observations #3

  • Two separate print statements

    • 1: Uses the "sort_values()" method to sort the DataFrame by the "GPA" column from lowest to highest.

      • Output: DataFrame that displays the rows sorted by GPA from lowest to highest.
    • 2: Uses the same "sort_values()" method

      • The "ascending" argument is set to "False"

      • Output: new DataFrame that displays the rows sorted by GPA from highest to lowest.

DataFrame Selection or Filter

print(df[df.GPA > 3.00])
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

Notes and Observations #4

  • Uses Pandas to filter data from a DataFrame based on the condition df.GPA > 3.00.

DataFrame Selection Max and Min

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

Notes and Observations #5

  • Uses Pandas to filter data from a DataFrame based on the maximum and minimum values of the "GPA" column.

  • Two separate print statements:

    • 1: Uses boolean indexing to select only the rows where the "GPA" column is equal to the maximum value of the "GPA" column.

      • max(): Used to find the maximum value in the "GPA" column.

      • Compared to the entire "GPA" column using the == operator.

      • Output: New DataFrame that contains only the rows where the "GPA" column is equal to the maximum value of the "GPA" column.

    • 2: Uses boolean indexing to select only the rows where the "GPA" column is equal to the minimum value of the "GPA" column.

      • min(): Used to find the minimum value in the "GPA" column.

      • Compared to the entire "GPA" column using the == operator.

      • Output: New DataFrame that contains only the rows where the "GPA" column is equal to the minimum value of the "GPA" column.

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

Python Dictionary to Pandas DataFrame

import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390, 500],
  "duration": [50, 40, 45, 55]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")

#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3", "day4"])
print(df)
-------------Dict_to_DF------------------
   calories  duration
0       420        50
1       380        40
2       390        45
3       500        55
----------Dict_to_DF_labels--------------
      calories  duration
day1       420        50
day2       380        40
day3       390        45
day4       500        55

Notes and Observations #6

  • Imports the pandas library using the shorthand pd.

  • Creates a dictionary named dict that contains two lists:

    • calories

    • duration

  • Converts the dictionary dict into a Pandas DataFrame.

    • The resulting DataFrame is printed to the console using the print() function.
  • Creates another DataFrame by passing the dictionary dict as well as a list of strings as arguments to the pd.DataFrame() method.

    • DataFrame is printed to the console using the print() function.

Examine DataFrame Rows

print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
-------Examine Selected Rows---------
      calories  duration
day1       420        50
day3       390        45
--------Examine Single Row-----------
calories    420
duration     50
Name: day1, dtype: int64

Notes and Observations #6

  • Prints the string -------Examine Selected Rows--------- to the console using the print() function.

  • Selects and prints specific rows of the DataFrame df using the loc[] method and a list of strings.

  • Prints the string --------Examine Single Row----------- to the console using the print() function.

  • Selects and prints a single row of the DataFrame df using the loc[] method and a single string.

Pandas DataFrame Information

print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, day1 to day4
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calories  4 non-null      int64
 1   duration  4 non-null      int64
dtypes: int64(2)
memory usage: 268.0+ bytes
None

Notes and Observations #7

  • Calls the info() method of the DataFrame df and prints the resulting information to the console using the print() function

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
--Duration Top 10---------
     Duration  Pulse  Maxpulse  Calories
69        300    108       143    1500.2
79        270    100       131    1729.0
109       210    137       184    1860.4
60        210    108       160    1376.0
106       180     90       120     800.3
90        180    101       127     600.1
65        180     90       130     800.4
61        160    110       137    1034.4
62        160    109       135     853.0
67        150    107       130     816.0
--Duration Bottom 10------
     Duration  Pulse  Maxpulse  Calories
68         20    106       136     110.4
100        20     95       112      77.7
89         20     83       107      50.3
135        20    136       156     189.0
94         20    150       171     127.4
95         20    151       168     229.4
139        20    141       162     222.4
64         20    110       130     131.4
112        15    124       139     124.2
93         15     80       100      50.5

Notes and Observations #8

  • Imports the pandas library using the shorthand pd.

  • Reads in a CSV file named data.csv using the pd.read_csv() method and stores the resulting DataFrame in a variable named df.

  • Sorts the DataFrame df in descending order based on the values in the Duration column using the sort_values() method.

  • Prints the string --Duration Top 10--------- to the console using the print() function.

  • Selects and prints the first 10 rows of the sorted DataFrame df using the head() method.

  • Prints the string --Duration Bottom 10------ to the console using the print() function.

  • Selects and prints the last 10 rows of the sorted DataFrame df using the tail() method.

APIs are a Source for Writing Programs with Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
'''Pandas can be used to analyze data'''
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://flask.nighthawkcodingsociety.com/api/covid/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json['countries_stat'])  # filter endpoint for country stats
    print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
    
fetch()
  country_name       cases     deaths
0          USA  82,649,779  1,018,316
1        India  43,057,545    522,193
2       Brazil  30,345,654    662,663
3       France  28,244,977    145,020
4      Germany  24,109,433    134,624
5           UK  21,933,206    173,352

Notes and Observations #9

  • Imports the pandas library using the shorthand pd and the requests library.

  • Defines a function named fetch() that retrieves data from a remote server.

    • Creates a variable named url that contains the URL of the remote server endpoint.

    • Sends a GET request to the remote server using the requests.get() method and stores the response in a variable named fetch.

    • Retrieves the JSON content from the response using the fetch.json() method and stores the resulting dictionary in a variable named json.

    • Creates a new DataFrame named df by passing the countries_stat key from the json dictionary to the pd.DataFrame() method.

    • Selects and prints the first 6 rows and the columns from country_name to deaths using the loc[] method and the print() function.

    • Calls itself to execute the code .

CB 2.3

sml

The Extracting Information from Data Quiz and Using Programs with Data quizzes were relatively harder than the topics discussed in 2.2, but I was able to complete both quizzes with ease and I only missed one problem. However, there were some questions that were trickier than others. There were also a lot of high-quality notes.

Extracting Information from Data

Challenge in analyzing data from many counties

A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?

A: A computer program cannot combine data from different files.

B: Different counties may organize data in different ways.

C: The number of counties is too large for the program to process.

D: The total number of rows of data is too large for the program to process.

Answer: Different counties may organize data in different ways. It will be a challenge to clean the data from the different counties to make the data uniform. The way pollution data is captured and organized may vary significantly from county to county.

Using Programs with Data Quiz

Bookstore spreadsheet

A large spreadsheet contains the following information about the books at a bookstore. A sample portion of the spreadsheet is shown below.

sml

An employee wants to count the number of books that meet all of the following criteria.

  • Is a mystery book

  • Costs less than $10.00

  • Has at least one copy in stock

For a given row in the spreadsheet, suppose genre contains the genre as a string, num contains the number of copies in stock as a number, and cost contains the cost as a number. Which of the following expressions will evaluate to true if the book should be counted and evaluates to false otherwise?

A: (genre = "mystery") AND ((1 ≤ num) AND (cost < 10.00))

B: (genre = "mystery") AND ((1 < num) AND (cost < 10.00))

C: (genre = "mystery") OR ((1 ≤ num) OR (cost < 10.00))

D: (genre = "mystery") OR ((1 < num) OR (cost < 10.00))

Answer: (genre = "mystery") AND ((1 ≤ num) AND (cost < 10.00)). For a book to be counted, the value of genre must be "mystery" so that only mystery books are counted. The value of num must be greater than or equal to 1 so that only books that have at least one copy in stock are counted. The value of cost must be less than 10.00 so that only books that cost less than $10 are counted. All three conditions must be true, so the AND operator is used between them.

Clothing store sales information

The owner of a clothing store records the following information for each transaction made at the store during a 7-day period.

  • The date of the transaction

  • The method of payment used in the transaction

  • The number of items purchased in the transaction

  • The total amount of the transaction, in dollars

Customers can pay for purchases using cash, check, a debit card, or a credit card.

Using only the data collected during the 7-day period, which of the following statements is true?

A: The average amount spent per day during the 7-day period can be determined by sorting the data by the total transaction amount, then adding the 7 greatest amounts, and then dividing the sum by 7.

B: The method of payment that was used in the greatest number of transactions during the 7-day period can be determined by sorting the data by the method of payment, then adding the number of items purchased for each type of payment method, and then finding the maximum sum.

C: The most expensive item purchased on a given date can be determined by searching the data for all items purchased on the given date and then sorting the matching items by item price.

D: The total number of items purchased on a given date can be determined by searching the data for all transactions that occurred on the given date and then adding the number of items purchased for each matching transaction.

Answer: The total number of items purchased on a given date can be determined by searching the data for all transactions that occurred on the given date and then adding the number of items purchased for each matching transaction.For each transaction, the data includes the date of the transaction and the number of items purchased in the transaction. By searching the data to find all transactions that occurred on the given date, and then adding the number of items purchased in each of those transactions, the total number of items purchased on a given date can be determined.

Use of databases to display animal information

A wildlife preserve is developing an interactive exhibit for its guests. The exhibit is intended to allow guests to select the name of an animal on a touch screen and display various facts about the selected animal.

For example, if a guest selects the animal name “wolf,” the exhibit is intended to display the following information.

  • Classification: mammal

  • Skin type: fur

  • Thermoregulation: warm-blooded

  • Lifestyle: pack

  • Average life span: 10–12 years

  • Top speed: 75 kilometers/hour

The preserve has two databases of information available to use for the exhibit. The first database contains information for each animal’s name, classification, skin type, and thermoregulation. The second database contains information for each animal’s name, lifestyle, average life span, and top speed.

Which of the following explains how the two databases can be used to develop the interactive exhibit?

A: Only the first database is needed. It can be searched by animal name to find all the information to be displayed.

B: Only the second database is needed. It can be searched by animal name to find all the information to be displayed.

C: Both databases are needed. Each database can be searched by animal name to find all information to be displayed.

D: The two databases are not sufficient to display all the necessary information because the intended display information does not include the animal name.

Answer: Both databases are needed. Each database can be searched by animal name to find all information to be displayed. The information to be displayed comes from both databases. The animal name can be used search the first database to find the classification, skin type, and thermoregulation information. The animal name can be used search the second database to find the lifestyle, average life span, and top speed information.

Album Dataset Stuff

import json

with open('files/album.json', 'r') as f:
    albums_data = json.load(f)

def get_album_info(album_name):
    album_id = albums_data["Album ID"].get(album_name)
    if album_id is None:
        return "Album not found"

    album_creator = albums_data["Creator"][album_name]

    album_year = albums_data["Year"][album_name]

    return f"Album ID: {album_id}\nCreator: {album_creator}\nYear of release: {album_year}"

print(get_album_info("Kid A"))
print(get_album_info("In Rainbows"))
print(get_album_info("Dummy"))
print(get_album_info("Graduation"))
Album ID: 2
Creator: Radiohead
Year of release: 2000
Album ID: 7
Creator: Radiohead
Year of release: 2007
Album ID: 5
Creator: Portishead
Year of release: 1994
Album not found
import json

with open('files/album.json', 'r') as f:
    albums_data = json.load(f)

def get_album_stats(albums_data):
    total_albums = len(albums_data["Album ID"])

    creator_counts = {}
    for album in albums_data["Creator"]:
        creator = albums_data["Creator"][album]
        if creator in creator_counts:
            creator_counts[creator] += 1
        else:
            creator_counts[creator] = 1

    years = albums_data["Year"].values()
    avg_year = sum(years) / len(years)

    oldest_album = min(albums_data["Year"], key=albums_data["Year"].get)
    newest_album = max(albums_data["Year"], key=albums_data["Year"].get)

    stats_str = (
        f"Total albums: {total_albums}\n"
        f"Albums per creator: {creator_counts}\n"
        f"Average year of release: {avg_year:.2f}\n"
        f"Oldest album: {oldest_album}\n"
        f"Newest album: {newest_album}\n"
    )
    return stats_str

album_stats = get_album_stats(albums_data)
print(album_stats)
Total albums: 9
Albums per creator: {'UNKLE': 1, 'Radiohead': 3, 'Noah Hawley, Jeff Russo': 1, '21 Pilots': 1, 'Portishead': 1, 'Atoms for Peace': 1, 'The Police': 1}
Average year of release: 2003.44
Oldest album: Synchronicity
Newest album: It's Always Blue