Skip to content

Pandas

1. What is Pandas

A Python library is like a collection of pre-built tools or functions that we can use to perform specific tasks without writing code from scratch. In the case of pandas, it's a popular library that helps us work with data in a structured way, making it easier to analyze and manipulate data tables.

In other words Pandas is the "excel" of Python.

To import the pandas library in Python, you can use the import statement.

import pandas as pd
  • import is the Python keyword used to import external libraries.
  • pandas is the name of the library you want to import.
  • as pd is an alias or shorthand that you can use to refer to the pandas library in your code.

It's a common convention to use pd as the alias when working with pandas, but you can choose any alias you prefer.

2. Dataframe

A DataFrame in pandas is like a super spreadsheet where you can store and work with data in rows and columns, making it easy to analyze and manipulate information.

2.1 Create df from zero

# Dictionary with the data
data = {
    'Model': ['Boeing 747', 'Airbus A320', 'Cessna 172', 'Embraer E190'],
    'Manufacturer': ['Boeing', 'Airbus', 'Cessna', 'Embraer'],
    'Passenger Capacity': [660, 220, 4, 114],
    'Max Speed (mph)': [570, 511, 131, 541]
}

# Dataframe generation
df_planes = df = pd.Dataframe()

display(df_planes)
output
          Model Manufacturer  Passenger Capacity  Max Speed (mph)
0    Boeing 747       Boeing                 660               570
1  Airbus A320       Airbus                 220               511
2    Cessna 172       Cessna                   4               131
3  Embraer E190      Embraer                 114               541

2.2 Create df from a file

Reminder

If the file is in the same folder of the code, you don't need to write the path, just the name of the file to opern it from pandas.

# Path to the CSV file
file_path = 'airports.csv'

# Read the CSV file and create a DataFrame
airports_df = pd.read_csv(file_path)

# Display the DataFrame
display(airports_df)

3. Exploring operations

In pandas, you can use operations like head, tail, info, describe, shape, or size to explore and understand your DataFrame quickly.

3.1 Display the head and tail

  • head: displays the first few rows of a DataFrame.
  • tail: shows the last few rows, providing a quick overview of the data.

display(df_planes.head(3))
print("-----")
display(df_planes.tail(2))
output
         Model Manufacturer  Passenger Capacity  Max Speed (mph)
0   Boeing 747       Boeing                 660              570
1  Airbus A320       Airbus                 220              511
2   Cessna 172       Cessna                   4              131
-----
          Model Manufacturer  Passenger Capacity  Max Speed (mph)
2    Cessna 172       Cessna                   4              131
3  Embraer E190      Embraer                 114              541

3.2 Get the size and shape

  • shape: returns the dimensions (rows, columns) of a DataFrame as a tuple.
  • size: returns the total number of elements (cells) in a DataFrame.

size = df_planes.size
shape = df_planes.shape

print(f"Size: {size}")
print(f"Shape: {shape}")
output
Size: 16
Shape: (4, 4)

3.3 Data info & describe

  • info(): provides summary information about a DataFrame, including the data types, non-null counts, and memory usage of each column.
  • describe(): generates descriptive statistics, like count, mean, min, max, and quartiles, for numeric columns in a DataFrame.

info = df_planes.info()
describe = df_planes.describe()

print(f"Info: {info}")
print("----------")
print(f"Describe: {describe}")
output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Model               4 non-null      object
 1   Manufacturer        4 non-null      object
 2   Passenger Capacity  4 non-null      int64 
 3   Max Speed (mph)     4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
Info: None
----------
Describe:        Passenger Capacity  Max Speed (mph)
count            4.000000         4.000000
mean           249.500000       438.250000
std            287.524492       206.244798
min              4.000000       131.000000
25%             86.500000       416.000000
50%            167.000000       526.000000
75%            330.000000       548.250000
max            660.000000       570.000000

4. Extracting operations

These operations allow the user to retrieve specific columns or rows from a DataFrame for analysis and manipulation.

4.1 Extract a column

# Extract the 'Age' column
planes = df_planes['Model']

print(planes)
output
0      Boeing 747
1     Airbus A320
2      Cessna 172
3    Embraer E190
Name: Model, dtype: object

4.3 Extract multiple columns

# Selecting the columns
selected_columns = df_planes[['Model', 'Max Speed (mph)']]

# Printing the selected columns
print(selected_columns)
output
          Model  Max Speed (mph)
0    Boeing 747              570
1   Airbus A320              511
2    Cessna 172              131
3  Embraer E190              541

4.4 Extract a row by index

# Extracting a row by index using iloc
row_at_index_2 = df_planes.iloc[2]

# Printing the extracted row
print(row_at_index_2)
output
Model                 Cessna 172
Manufacturer              Cessna
Passenger Capacity             4
Max Speed (mph)              131
Name: 2, dtype: object

4.5 Extract a row by condition

Suppose you want to extract all rows where the speed is greater than 200 mph.

# Extracting rows with speed > 200 mph
df_fast_planes = df_planes[df_planes['Max Speed (mph)'] > 200]

# Printing the extracted rows
print(df_fast_planes)
output
          Model Manufacturer  Passenger Capacity  Max Speed (mph)
0    Boeing 747       Boeing                 660              570
1   Airbus A320       Airbus                 220              511
3  Embraer E190      Embraer                 114              541

5. Edit columns & rows

5.1 Add a new column

To add a new column to a pandas DataFrame, you can simply assign a list or a Series to a new column name. Remember to add a column with the same amount of rows of the dataframe.

# List of lengths in meters
lengths_in_meters = [70.6, 37.6, 8.2, 36.2]

# Add a new column named "Length (m)" to df with the length values
df_planes['Length'] = lengths_in_meters

# Print the updated DataFrame
print(df_planes)
output
          Model Manufacturer  Passenger Capacity  Max Speed (mph)      Length
0    Boeing 747       Boeing                 660              570        70.6
1   Airbus A320       Airbus                 220              511        37.6
2    Cessna 172       Cessna                   4              131         8.2
3  Embraer E190      Embraer                 114              541        36.2

5.2 Rename a column

# Rename the "Lenght" column to "Length (m)"
df_planes.rename(columns={'Length': 'Length (m)'}, inplace=True)

# Print the updated DataFrame
print(df_planes)
output
          Model Manufacturer  Passenger Capacity  Max Speed (mph)  Length (m)
0    Boeing 747       Boeing                 660              570        70.6
1   Airbus A320       Airbus                 220              511        37.6
2    Cessna 172       Cessna                   4              131         8.2
3  Embraer E190      Embraer                 114              541        36.2

6. Aggregate operations

Imagine you have a database with hacker attacks.

datetime ip_source location type_of_attack severity duration_minutes
2023-04-10 08:15:00 192.168.1.100 Madrid DDoS High 45
2023-04-10 12:45:00 203.54.32.17 Los Angeles Phishing Medium 30
2023-04-11 14:30:00 104.23.45.67 London SQL Injection Low 60
2023-04-12 20:10:00 45.76.12.34 Singapore Malware High 90
2023-04-10 16:03:00 203.54.32.17 Los Angeles Phishing Medium 30

6.1 Unique values

unique_attack_types = df_attacks['type_of_attack'].unique()
print(unique_attack_types)
output
['DDoS' 'Phishing' 'SQL Injection' 'Malware']

6.2 Count unique values

unique_attack_count = df_attacks['type_of_attack'].nunique()
print(unique_attack_count)
output
4

6.3 Count elements in a column

element_count = df_attacks['datetime'].count()
print(element_count)
output
5

6.4 Math operators

Operator Statement
Mean df["column"].mean()
Median df["column"].median()
Standar deviation df["column"].std()
Max value df["column"].max()
Min value df["column"].min()
element_count = df_attacks['duration_minutes'].mean()
print(element_count)
output
51.0

7. Groping Operations

Grouping operations in pandas involve splitting a DataFrame into groups based on one or more columns, applying a function to each group, and then combining the results into a new DataFrame.

.reset_index()

Using reset_index() after grouping operations is important to reset the index labels and make the resulting DataFrame more structured and easier to work with.

7.1 Group rows counting

# Grouping by 'type_of_attack' and counting elements in each group
df_types_of_attack = df_attacks.groupby('type_of_attack').count()

print(df_types_of_attack)
output
type_of_attack
DDoS             1
Malware          1
Phishing         2
SQL Injection    1
Name: ip_source, dtype: int64

7.2 Group rows operating

mean_by_severity = df_attacks.groupby('severity')['duration_minutes'].mean()

print(mean_by_severity)
output
severity
High      67.5
Low       60.0
Medium    30.0
Name: duration_minutes, dtype: float64

8. Other operations

8.1 Ordering rows of a dataframe

# Sorting the DataFrame by 'Location'
df_sorted = df_attacks.sort_values(by='location')

print(df_sorted[['ip_source', 'location']])
output
       ip_source     location
2   104.23.45.67       London
4   203.54.32.17  Los Angeles
1   203.54.32.17       Madrid
0  192.168.1.100     New York
3    45.76.12.34    Singapore

8.2 Joining two dataframes

data_attacks_update = {'datetime': ['2023-04-10 08:15:00', '2023-04-10 12:45:00', '2023-04-11 14:30:00'],
        'ip_source': ['192.168.1.100', '203.54.32.17', '104.23.45.67'],
        'location': ['New York', 'Madrid', 'London'],
        'type_of_attack': ['DDoS', 'Phishing', 'SQL Injection'],
        'severity': ['High', 'Medium', 'Low'],
        'duration_minutes': [45, 30, 60]}

df_update = pd.DataFrame(data_attacks_update)
df_total_atacks = pd.concat([df_attacks, df_update])
print(df_total_atacks)
output
    datetime            ip_source       location    type_of_attack  severity    duration_minutes
0   2023-04-10 08:15:00 192.168.1.100   New York    DDoS            High    45
1   2023-04-10 12:45:00 203.54.32.17    Madrid      Phishing        Medium  30
2   2023-04-11 14:30:00 104.23.45.67    London      SQL Injection   Low 60
3   2023-04-12 20:10:00 45.76.12.34     Singapore   Malware         High    90
4   2023-04-10 16:03:00 203.54.32.17    Los Angeles Phishing        Medium  30
0   2023-04-10 08:15:00 192.168.1.100   New York    DDoS            High    45
1   2023-04-10 12:45:00 203.54.32.17    Madrid      Phishing        Medium  30
2   2023-04-11 14:30:00 104.23.45.67    London      SQL Injection   Low 60

-> Start coding

You can do the exercises in Google Colab or in Jupyter Lab

Google Colab

Open the 12_introduction_to_pandas.ipynb notebook to start doing the exercises.

reminder

Remember to fork the exercise repository and save your solved notebooks in your forked repo.

Jupyter Lab

WIP