Pandas¶

1. What is Pandas¶

A Python library is like a collection of pre-built tools or functions that we can use to perform specific tasks without writing code from scratch. In the case of pandas, it's a popular library that helps us work with data in a structured way, making it easier to analyze and manipulate data tables.

In other words Pandas is the "excel" of Python.

To import the pandas library in Python, you can use the import statement.

import pandas as pd

import is the Python keyword used to import external libraries.
pandas is the name of the library you want to import.
as pd is an alias or shorthand that you can use to refer to the pandas library in your code.

It's a common convention to use pd as the alias when working with pandas, but you can choose any alias you prefer.

2. Dataframe¶

A DataFrame in pandas is like a super spreadsheet where you can store and work with data in rows and columns, making it easy to analyze and manipulate information.

2.1 Create df from zero¶

# Dictionary with the data
data = {
    'Model': ['Boeing 747', 'Airbus A320', 'Cessna 172', 'Embraer E190'],
    'Manufacturer': ['Boeing', 'Airbus', 'Cessna', 'Embraer'],
    'Passenger Capacity': [660, 220, 4, 114],
    'Max Speed (mph)': [570, 511, 131, 541]
}

# Dataframe generation
df_planes = df = pd.Dataframe()

display(df_planes)

output

          Model Manufacturer  Passenger Capacity  Max Speed (mph)
0    Boeing 747       Boeing                 660               570
1  Airbus A320       Airbus                 220               511
2    Cessna 172       Cessna                   4               131
3  Embraer E190      Embraer                 114               541

2.2 Create df from a file¶

Reminder

If the file is in the same folder of the code, you don't need to write the path, just the name of the file to opern it from pandas.

# Path to the CSV file
file_path = 'airports.csv'

# Read the CSV file and create a DataFrame
airports_df = pd.read_csv(file_path)

# Display the DataFrame
display(airports_df)

3. Exploring operations¶

In pandas, you can use operations like head, tail, info, describe, shape, or size to explore and understand your DataFrame quickly.

3.1 Display the head and tail¶

head: displays the first few rows of a DataFrame.
tail: shows the last few rows, providing a quick overview of the data.

display(df_planes.head(3))
print("-----")
display(df_planes.tail(2))

output

         Model Manufacturer  Passenger Capacity  Max Speed (mph)
0   Boeing 747       Boeing                 660              570
1  Airbus A320       Airbus                 220              511
2   Cessna 172       Cessna                   4              131
-----
          Model Manufacturer  Passenger Capacity  Max Speed (mph)
2    Cessna 172       Cessna                   4              131
3  Embraer E190      Embraer                 114              541

3.2 Get the size and shape¶

shape: returns the dimensions (rows, columns) of a DataFrame as a tuple.
size: returns the total number of elements (cells) in a DataFrame.

size = df_planes.size
shape = df_planes.shape

print(f"Size: {size}")
print(f"Shape: {shape}")

output

Size: 16
Shape: (4, 4)

3.3 Data info & describe¶

info(): provides summary information about a DataFrame, including the data types, non-null counts, and memory usage of each column.
describe(): generates descriptive statistics, like count, mean, min, max, and quartiles, for numeric columns in a DataFrame.

info = df_planes.info()
describe = df_planes.describe()

print(f"Info: {info}")
print("----------")
print(f"Describe: {describe}")

output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Model               4 non-null      object
 1   Manufacturer        4 non-null      object
 2   Passenger Capacity  4 non-null      int64 
 3   Max Speed (mph)     4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
Info: None
----------
Describe:        Passenger Capacity  Max Speed (mph)
count            4.000000         4.000000
mean           249.500000       438.250000
std            287.524492       206.244798
min              4.000000       131.000000
25%             86.500000       416.000000
50%            167.000000       526.000000
75%            330.000000       548.250000
max            660.000000       570.000000

4. Extracting operations¶

These operations allow the user to retrieve specific columns or rows from a DataFrame for analysis and manipulation.

4.1 Extract a column¶

# Extract the 'Age' column
planes = df_planes['Model']

print(planes)

output

0      Boeing 747
1     Airbus A320
2      Cessna 172
3    Embraer E190
Name: Model, dtype: object

4.3 Extract multiple columns¶

# Selecting the columns
selected_columns = df_planes[['Model', 'Max Speed (mph)']]

# Printing the selected columns
print(selected_columns)

output

          Model  Max Speed (mph)
0    Boeing 747              570
1   Airbus A320              511
2    Cessna 172              131
3  Embraer E190              541

4.4 Extract a row by index¶

# Extracting a row by index using iloc
row_at_index_2 = df_planes.iloc[2]

# Printing the extracted row
print(row_at_index_2)

output

Model                 Cessna 172
Manufacturer              Cessna
Passenger Capacity             4
Max Speed (mph)              131
Name: 2, dtype: object

4.5 Extract a row by condition¶

Suppose you want to extract all rows where the speed is greater than 200 mph.

# Extracting rows with speed > 200 mph
df_fast_planes = df_planes[df_planes['Max Speed (mph)'] > 200]

# Printing the extracted rows
print(df_fast_planes)

output

          Model Manufacturer  Passenger Capacity  Max Speed (mph)
0    Boeing 747       Boeing                 660              570
1   Airbus A320       Airbus                 220              511
3  Embraer E190      Embraer                 114              541

5. Edit columns & rows¶

5.1 Add a new column¶

To add a new column to a pandas DataFrame, you can simply assign a list or a Series to a new column name. Remember to add a column with the same amount of rows of the dataframe.

# List of lengths in meters
lengths_in_meters = [70.6, 37.6, 8.2, 36.2]

# Add a new column named "Length (m)" to df with the length values
df_planes['Length'] = lengths_in_meters

# Print the updated DataFrame
print(df_planes)

output

          Model Manufacturer  Passenger Capacity  Max Speed (mph)      Length
0    Boeing 747       Boeing                 660              570        70.6
1   Airbus A320       Airbus                 220              511        37.6
2    Cessna 172       Cessna                   4              131         8.2
3  Embraer E190      Embraer                 114              541        36.2

5.2 Rename a column¶

# Rename the "Lenght" column to "Length (m)"
df_planes.rename(columns={'Length': 'Length (m)'}, inplace=True)

# Print the updated DataFrame
print(df_planes)

output

          Model Manufacturer  Passenger Capacity  Max Speed (mph)  Length (m)
0    Boeing 747       Boeing                 660              570        70.6
1   Airbus A320       Airbus                 220              511        37.6
2    Cessna 172       Cessna                   4              131         8.2
3  Embraer E190      Embraer                 114              541        36.2

6. Aggregate operations¶

Imagine you have a database with hacker attacks.

datetime	ip_source	location	type_of_attack	severity	duration_minutes
2023-04-10 08:15:00	192.168.1.100	Madrid	DDoS	High	45
2023-04-10 12:45:00	203.54.32.17	Los Angeles	Phishing	Medium	30
2023-04-11 14:30:00	104.23.45.67	London	SQL Injection	Low	60
2023-04-12 20:10:00	45.76.12.34	Singapore	Malware	High	90
2023-04-10 16:03:00	203.54.32.17	Los Angeles	Phishing	Medium	30

6.1 Unique values¶

unique_attack_types = df_attacks['type_of_attack'].unique()
print(unique_attack_types)

output

['DDoS' 'Phishing' 'SQL Injection' 'Malware']

6.2 Count unique values¶

unique_attack_count = df_attacks['type_of_attack'].nunique()
print(unique_attack_count)

output

6.3 Count elements in a column¶

element_count = df_attacks['datetime'].count()
print(element_count)

output

6.4 Math operators¶

Operator	Statement
Mean	df["column"].mean()
Median	df["column"].median()
Standar deviation	df["column"].std()
Max value	df["column"].max()
Min value	df["column"].min()

element_count = df_attacks['duration_minutes'].mean()
print(element_count)

output

51.0

7. Groping Operations¶

Grouping operations in pandas involve splitting a DataFrame into groups based on one or more columns, applying a function to each group, and then combining the results into a new DataFrame.

.reset_index()

Using reset_index() after grouping operations is important to reset the index labels and make the resulting DataFrame more structured and easier to work with.

7.1 Group rows counting¶

# Grouping by 'type_of_attack' and counting elements in each group
df_types_of_attack = df_attacks.groupby('type_of_attack').count()

print(df_types_of_attack)

output

type_of_attack
DDoS             1
Malware          1
Phishing         2
SQL Injection    1
Name: ip_source, dtype: int64

7.2 Group rows operating¶

mean_by_severity = df_attacks.groupby('severity')['duration_minutes'].mean()

print(mean_by_severity)

output

severity
High      67.5
Low       60.0
Medium    30.0
Name: duration_minutes, dtype: float64

8. Other operations¶

8.1 Ordering rows of a dataframe¶

# Sorting the DataFrame by 'Location'
df_sorted = df_attacks.sort_values(by='location')

print(df_sorted[['ip_source', 'location']])

output

       ip_source     location
2   104.23.45.67       London
4   203.54.32.17  Los Angeles
1   203.54.32.17       Madrid
0  192.168.1.100     New York
3    45.76.12.34    Singapore

8.2 Joining two dataframes¶

data_attacks_update = {'datetime': ['2023-04-10 08:15:00', '2023-04-10 12:45:00', '2023-04-11 14:30:00'],
        'ip_source': ['192.168.1.100', '203.54.32.17', '104.23.45.67'],
        'location': ['New York', 'Madrid', 'London'],
        'type_of_attack': ['DDoS', 'Phishing', 'SQL Injection'],
        'severity': ['High', 'Medium', 'Low'],
        'duration_minutes': [45, 30, 60]}

df_update = pd.DataFrame(data_attacks_update)
df_total_atacks = pd.concat([df_attacks, df_update])
print(df_total_atacks)

output

    datetime            ip_source       location    type_of_attack  severity    duration_minutes
0   2023-04-10 08:15:00 192.168.1.100   New York    DDoS            High    45
1   2023-04-10 12:45:00 203.54.32.17    Madrid      Phishing        Medium  30
2   2023-04-11 14:30:00 104.23.45.67    London      SQL Injection   Low 60
3   2023-04-12 20:10:00 45.76.12.34     Singapore   Malware         High    90
4   2023-04-10 16:03:00 203.54.32.17    Los Angeles Phishing        Medium  30
0   2023-04-10 08:15:00 192.168.1.100   New York    DDoS            High    45
1   2023-04-10 12:45:00 203.54.32.17    Madrid      Phishing        Medium  30
2   2023-04-11 14:30:00 104.23.45.67    London      SQL Injection   Low 60

-> Start coding¶

You can do the exercises in Google Colab or in Jupyter Lab

Google Colab

Open the 12_introduction_to_pandas.ipynb notebook to start doing the exercises.

reminder

Remember to fork the exercise repository and save your solved notebooks in your forked repo.

Jupyter Lab

WIP