# Import modules
import pandas as pd
import numpy as np
29 Pandas module
Pandas is a powerful library primarily used for tabular data analysis. It offers data structures like DataFrames and Series, which make tasks such as data cleaning and aggregation easy. Pandas integrates well with other libraries like Numpy, and can read data saved in various text formats and spreadsheet software like MS Excel. One of the hallmarks of Pandas DataFrames is that we can call data stored in rows and columns by name, similar to working with Python dictionaries.
Create DataFrame from existing variable
After importing the module we have two possible directions. We import data from a file or we convert an existing variable into a Pandas DataFrame. Here we will create a simple DatFrame to learn the basics. This way we will be able to display the result of our operations without worrying about extensive datasets.
Let’s create a dictionary with some weather data and missing values (represented by -9999
).
# Create dictionary with some weather data
= {'timestamp': ['1/1/2000','2/1/2000','3/1/2000','4/1/2000','5/1/2000'],
data 'wind_speed': [2.2, 3.2, -9999.0, 4.1, 2.9],
'wind_direction': ['E', 'NW', 'NW', 'N', 'S'],
'precipitation': [0, 18, 25, 2, 0]}
The next step consists of converting the dictionary into a Pandas DataFrame. This is straight forward using the DataFrame method of the Pandas module: pd.DataFrame()
# Convert dictionary into DataFrame
= pd.DataFrame(data)
df df.head()
timestamp | wind_speed | wind_direction | precipitation | |
---|---|---|---|---|
0 | 1/1/2000 | 2.2 | E | 0 |
1 | 2/1/2000 | 3.2 | NW | 18 |
2 | 3/1/2000 | -9999.0 | NW | 25 |
3 | 4/1/2000 | 4.1 | N | 2 |
4 | 5/1/2000 | 2.9 | S | 0 |
The above DataFrame has the following components:
header row containing column names
index (the left-most column with numbers from 0 to 4) is equivalent to a row name.
Each column has data of the same type.
# By default, values in Pandas series are Numpy arrays
print(df["wind_speed"].values)
print(type(df["wind_speed"].values))
[ 2.200e+00 3.200e+00 -9.999e+03 4.100e+00 2.900e+00]
<class 'numpy.ndarray'>
Basic methods and properties
Pandas DataFrame has dedicated functions to display a limited number of heading and tailing rows.
3) # First three rows df.head(
timestamp | wind_speed | wind_direction | precipitation | |
---|---|---|---|---|
0 | 1/1/2000 | 2.2 | E | 0 |
1 | 2/1/2000 | 3.2 | NW | 18 |
2 | 3/1/2000 | -9999.0 | NW | 25 |
3) # Last three rows df.tail(
timestamp | wind_speed | wind_direction | precipitation | |
---|---|---|---|---|
2 | 3/1/2000 | -9999.0 | NW | 25 |
3 | 4/1/2000 | 4.1 | N | 2 |
4 | 5/1/2000 | 2.9 | S | 0 |
To display the DataFrame content simply use the head()
and tail()
methods. As an alternative you can use the print()
function or type the name of the DataFrame and press ctrl + Enter
. Note that by default Jupyter Lab highlights rows when using the head()
or tail()
methods.
To start exploring and analyzing our dataset it is often handy to know the column names.
# Display column names
df.columns
Index(['timestamp', 'wind_speed', 'wind_direction', 'precipitation'], dtype='object')
# Total number of elements
df.size
20
# Number of rows and columns
df.shape
(5, 4)
# Data type of each column
df.dtypes
timestamp object
wind_speed float64
wind_direction object
precipitation int64
dtype: object
Convert strings to datetime
# Convert dates in string format to Pandas datetime format
# %d = day in format 00 days
# %m = month in format 00 months
# %Y = full year
"timestamp"] = pd.to_datetime(df["timestamp"], format="%d/%m/%Y")
df[ df.head()
timestamp | wind_speed | wind_direction | precipitation | |
---|---|---|---|---|
0 | 2000-01-01 | 2.2 | E | 0 |
1 | 2000-01-02 | 3.2 | NW | 18 |
2 | 2000-01-03 | -9999.0 | NW | 25 |
3 | 2000-01-04 | 4.1 | N | 2 |
4 | 2000-01-05 | 2.9 | S | 0 |
# The `timestamp` column has changed to datetime format
df.dtypes
timestamp datetime64[ns]
wind_speed float64
wind_direction object
precipitation int64
dtype: object
Extract information from the timestamp
Having specific information like day of the year, month, or weeks in a separate column can be useful to help us aggregate values. For instance, to compute the monthly mean air temperature we need to know in what month each temperature observations was recorded.
For this we will use the dt
submodule within Pandas
.
# Get the day of the year
"doy"] = df["timestamp"].dt.dayofyear
df[ df.head()
timestamp | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|
0 | 2000-01-01 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 3.2 | NW | 18 | 2 |
2 | 2000-01-03 | -9999.0 | NW | 25 | 3 |
3 | 2000-01-04 | 4.1 | N | 2 | 4 |
4 | 2000-01-05 | 2.9 | S | 0 | 5 |
The new column was placed at the end of the DataFrame. This the default when creating a new column.
In the next example we use the insert()
method to add the new column in a specific location. Typically, for date components it helps to have the columns close to the datetime column.
# Get month from timstamp and create new column
#.insert(positionOfNewColumn, nameOfNewColumn, dataOfNewColumn)
1,'month',df["timestamp"].dt.month)
df.insert( df.head()
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
2 | 2000-01-03 | 1 | -9999.0 | NW | 25 | 3 |
3 | 2000-01-04 | 1 | 4.1 | N | 2 | 4 |
4 | 2000-01-05 | 1 | 2.9 | S | 0 | 5 |
Re-running the previous cell will trigger an error since Pandas cannot have two columns with the same name.
Missing values
One of the most common operations when working with data is handling missing values. Almost every dataset has missing data and there is no universal way of denoting missing values. Most common placeholders are: NaN
, NA
, -99
, -9999
, M
, missing
, etc. To find out more about how missing data is represented in your dataset always read associated metadata. Some of these placeholders are automatically identified by Pandas as missing values and are represented as NaN
values.
Pandas methods can deal with missing data, meaning that it is not necessary to always replace missing values in order to make computations. We just need to ensure that missing values are represented as NaN
values.
The fillna()
and interpolate()
methods can help us replace missing values with an approximate value using neighboring points.
To replace missing values in our current dataset, we will follow these steps:
Identify the cells with
-9999
values. Output will be a boolean DataFrame having the same dimensions asdf
.Replace
-9999
withNaN
values from the Numpy module.Check our work using the
isna()
method
Missing values represented as np.nan
are actually of type float
.
# Print data type of NaN values from Numpy
type(np.nan)
float
# Step 1: find -9999 values across the entire dataframe
= df.isin([-9999]) # or idx_missing = df == -9999
idx_missing idx_missing
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | False | False |
2 | False | False | True | False | False | False |
3 | False | False | False | False | False | False |
4 | False | False | False | False | False | False |
# Find missing vlaues in only one column
"wind_speed"].isin([-9999]) # or df["wind_speed"] == -99999 df[
0 False
1 False
2 True
3 False
4 False
Name: wind_speed, dtype: bool
Using the isin()
method we can place multiple placeholders denoting missing data, as opposed to the boolean statement that would require multiple or
statements.
# Step 2: Replace missing values with NaN
= np.nan
df[idx_missing] df
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
2 | 2000-01-03 | 1 | NaN | NW | 25 | 3 |
3 | 2000-01-04 | 1 | 4.1 | N | 2 | 4 |
4 | 2000-01-05 | 1 | 2.9 | S | 0 | 5 |
# Step 3: Check our work
df.isna()
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | False | False |
2 | False | False | True | False | False | False |
3 | False | False | False | False | False | False |
4 | False | False | False | False | False | False |
Quick statistics
DataFrames have a variety of methods to calculate simple statistics. To obtain an overall summary we can use the describe()
method.
# Summary stats for all columns
print(df.describe())
timestamp month wind_speed precipitation doy
count 5 5.0 4.000000 5.0000 5.000000
mean 2000-01-03 00:00:00 1.0 3.100000 9.0000 3.000000
min 2000-01-01 00:00:00 1.0 2.200000 0.0000 1.000000
25% 2000-01-02 00:00:00 1.0 2.725000 0.0000 2.000000
50% 2000-01-03 00:00:00 1.0 3.050000 2.0000 3.000000
75% 2000-01-04 00:00:00 1.0 3.425000 18.0000 4.000000
max 2000-01-05 00:00:00 1.0 4.100000 25.0000 5.000000
std NaN 0.0 0.787401 11.7047 1.581139
# Metric ignoring NaN values
print(df["wind_speed"].max()) # Maximum value for each column
print(df["wind_speed"].mean()) # Average value for each column
print(df["wind_speed"].min()) # Minimum value for each column
print(df["wind_speed"].std()) # Standard deviation value for each column
print(df["wind_speed"].var()) # Variance value for each column
print(df["wind_speed"].median()) # Variance value for each column
print(df["wind_speed"].quantile(0.95))
4.1
3.1
2.2
0.7874007874011809
0.6199999999999997
3.05
3.9649999999999994
# Cumulative sum. Useful to compute cumulative precipitation
print(df.precipitation.cumsum())
0 0
1 18
2 43
3 45
4 45
Name: precipitation, dtype: int64
# Unique values. Useful to compute unique wind directions
print(df.wind_direction.unique())
['E' 'NW' 'N' 'S']
Indexing and slicing
To start making computations with need to access the data insde the Pandas DataFrame. Indexing and slicing are useful operations to select portions of data by calling specific rows, columns, or a combination of both. The index operator []
is primarily intended to be used with column labels (e.g. df[columnName]
), however, it can also handle row slices (e.g. df[rows]
). A common notation useful to understand how the slicing works is as follows:
Select rows
# First three rows
0:3] df[
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
2 | 2000-01-03 | 1 | NaN | NW | 25 | 3 |
Select columns
We can call individual columns using the dot
or bracket
notation. Note that in option 2 there is no .
between df
and ['windSpeed']
'wind_speed'] # Option 1
df[# df.wind_speed # Option 2
0 2.2
1 3.2
2 NaN
3 4.1
4 2.9
Name: wind_speed, dtype: float64
To pass more than one row of column you will need to group them in a list.
# Select multiple columns at once
'wind_speed','wind_direction']] df[[
wind_speed | wind_direction | |
---|---|---|
0 | 2.2 | E |
1 | 3.2 | NW |
2 | NaN | NW |
3 | 4.1 | N |
4 | 2.9 | S |
A common mistake when slicing multiple columns is to forget grouping column names into a list, so the following command will not work:
df['wind_speed','wind_direction']
Using iloc
method
iloc
: Integer-location. iloc
gets rows (or columns) at specific indexes. It only takes integers as input. Exclusive of its endpoint
# Top 3 rows and columns 1 and 2
0:3,[1,2]] df.iloc[
month | wind_speed | |
---|---|---|
0 | 1 | 2.2 |
1 | 1 | 3.2 |
2 | 1 | NaN |
# Top 2 rows and all columns
0:2,:] # Same as: df.iloc[0:2] df.iloc[
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
Although a bit more verbose and perhaps less pythonic, sometimes it is better to specify all the columns
using the colon :
character. In my opinion this notation is more explicit and clearly states the rows and columns of the slicing operation. For instance, df.iloc[0:2,:]
is more explicit than df.iloc[0:2]
.
Using loc
method
loc
gets rows (or columns) with specific labels. Inclusive of its endpoint
# Select multiple rows and columns at once using the loc method
0:2,['wind_speed','wind_direction']] df.loc[
wind_speed | wind_direction | |
---|---|---|
0 | 2.2 | E |
1 | 3.2 | NW |
2 | NaN | NW |
# Some rows and all columns
0:1,:] df.loc[
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
# First three elements of a single column
0:2,'wind_speed'] df.loc[
0 2.2
1 3.2
2 NaN
Name: wind_speed, dtype: float64
# First three elements of multiple columns
0:2,['wind_speed','wind_direction']] df.loc[
wind_speed | wind_direction | |
---|---|---|
0 | 2.2 | E |
1 | 3.2 | NW |
2 | NaN | NW |
These statements will not work with loc
:
df.loc[0:2,0:1]
df.loc[[0:2],[0:1]]
Filter data using boolean indexing
Boolean indexing (a.k.a. logical indexing) consists of creating an array with True/False values as a result of one or more conditional statements that can be use to select data that meet the specified condition.
Let’s select the data across all columns for days that have wind speed greater than 3 meters per second. We will first select the rows of df.wind_speed
that are greater than 3 m/s, and then we will use the resulting boolean to slice the DataFrame.
= df['wind_speed'] > 3 # Rows in which the wind speed is greater than
idx # Let's inspect the idx variable. idx
0 False
1 True
2 False
3 True
4 False
Name: wind_speed, dtype: bool
# Now let's apply the boolean variable to the dataframe
df[idx]
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
3 | 2000-01-04 | 1 | 4.1 | N | 2 | 4 |
# We can also apply the boolean to specific columns
'wind_direction'] df.loc[idx,
1 NW
3 N
Name: wind_direction, dtype: object
It’s also possible to write the previous command as a single line of code. This is fine, but sometimes nesting too many conditions can create commands that are hard to read and understand. To avoid this problem, storing the boolean in a new variable makes things a lot easier to read and re-use.
# Same in a single line of code
'wind_speed'] > 3,'wind_direction'] df.loc[df[
1 NW
3 N
Name: wind_direction, dtype: object
Another popular way of filtering is to check whether an element or group of elements are within a set. Let’s check whether January 1 and January 2 are in the DataFrame.
= df['doy'].isin([1,2])
idx_doy idx_doy
0 True
1 True
2 False
3 False
4 False
Name: doy, dtype: bool
# Select all columns for the selected days of the year
df.loc[idx_doy,:]
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
Pandas custom date range
Most datasets collected over a period of time include timestamps, but in case the timestamps are missing or you need to create a range of dates during your analysis, Pandas has the date_range()
method to create a sequence of timestamps.
= pd.date_range('20000102', periods=2, freq='D') # Used df.shape[0] to find the total number of rows
subset_dates subset_dates
DatetimeIndex(['2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq='D')
# The same to generate months
'20200101', periods=df.shape[0], freq='M') # Specify the frequency to months pd.date_range(
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31'],
dtype='datetime64[ns]', freq='M')
Select range of dates with boolean indexing
Now that we covered both boolean indexing and pandas dates we can use these concepts to select data from a specific window of time. This is a pretty common operation when trying to select a subset of the entire DataFrame by a specific date range.
# Generate boolean for rows that match the subset of dates generated earlier
= df["timestamp"].isin(subset_dates)
idx_subset idx_subset
0 False
1 True
2 True
3 False
4 False
Name: timestamp, dtype: bool
# Generate a new DataFrame using only the rows with matching dates
= df.loc[idx_subset]
df_subset df_subset
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
2 | 2000-01-03 | 1 | NaN | NW | 25 | 3 |
# It isn't always necessary to generate a new DataFrame
# So you can access a specific column like this
"precipitation"] df.loc[idx_subset,
1 18
2 25
Name: precipitation, dtype: int64
Add and remove columns
The insert()
and drop()
methods allow us to add or remove columns to/from the DataFrame. The most common use of these functions is as follows:
df.insert(index_of_new_column, name_of_new_column, data_of_new_column)
df.drop(name_of_column_to_be_removed)
# Add new column at a specific location
2, 'air_temperature', [25.4, 26, 27.1, 28.9, 30.2]) # Similar to: df['dates'] = dates
df.insert( df
timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 25.4 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 26.0 | 3.2 | NW | 18 | 2 |
2 | 2000-01-03 | 1 | 27.1 | NaN | NW | 25 | 3 |
3 | 2000-01-04 | 1 | 28.9 | 4.1 | N | 2 | 4 |
4 | 2000-01-05 | 1 | 30.2 | 2.9 | S | 0 | 5 |
# Remove specific column
=['air_temperature']) df.drop(columns
timestamp | month | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
2 | 2000-01-03 | 1 | NaN | NW | 25 | 3 |
3 | 2000-01-04 | 1 | 4.1 | N | 2 | 4 |
4 | 2000-01-05 | 1 | 2.9 | S | 0 | 5 |
Reset DataFrame index
# Replace the index by a variables of our choice
'timestamp') df.set_index(
month | air_temperature | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|
timestamp | ||||||
2000-01-01 | 1 | 25.4 | 2.2 | E | 0 | 1 |
2000-01-02 | 1 | 26.0 | 3.2 | NW | 18 | 2 |
2000-01-03 | 1 | 27.1 | NaN | NW | 25 | 3 |
2000-01-04 | 1 | 28.9 | 4.1 | N | 2 | 4 |
2000-01-05 | 1 | 30.2 | 2.9 | S | 0 | 5 |
# Reset the index (see that 'doy' goes back to the end of the DataFrame again)
0, drop=True) df.reset_index(
timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|---|
0 | 2000-01-01 | 1 | 25.4 | 2.2 | E | 0 | 1 |
1 | 2000-01-02 | 1 | 26.0 | 3.2 | NW | 18 | 2 |
2 | 2000-01-03 | 1 | 27.1 | NaN | NW | 25 | 3 |
3 | 2000-01-04 | 1 | 28.9 | 4.1 | N | 2 | 4 |
4 | 2000-01-05 | 1 | 30.2 | 2.9 | S | 0 | 5 |
Merge two dataframes
# Create a new DataFrame (follows dates)
# Dictionary
= {'timestamp': ['6/1/2000','7/1/2000','8/1/2000','9/1/2000','10/1/2000'],
data2 'wind_speed': [4.3, 2.1, 0.5, 2.7, 1.9],
'wind_direction': ['N', 'N', 'SW', 'E', 'NW'],
'precipitation': [0, 0, 0, 25, 0]}
# Dcitionary to DataFrame
= pd.DataFrame(data2)
df2
# Convert strings to pandas datetime
"timestamp"] = pd.to_datetime(df2["timestamp"], format="%d/%m/%Y")
df2[
df2.head()
timestamp | wind_speed | wind_direction | precipitation | |
---|---|---|---|---|
0 | 2000-01-06 | 4.3 | N | 0 |
1 | 2000-01-07 | 2.1 | N | 0 |
2 | 2000-01-08 | 0.5 | SW | 0 |
3 | 2000-01-09 | 2.7 | E | 25 |
4 | 2000-01-10 | 1.9 | NW | 0 |
Not using the format="%d/%m/%y"
in the previous cell results in the wrong datetime conversion. It is always recommended to specify the format.
# Merge both Dataframes by applying a union of keys from both frames (how='outer' option)
= pd.merge(df, df2, how='outer')
df_merged df_merged
timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | |
---|---|---|---|---|---|---|---|
0 | 2000-01-01 | 1.0 | 25.4 | 2.2 | E | 0 | 1.0 |
1 | 2000-01-02 | 1.0 | 26.0 | 3.2 | NW | 18 | 2.0 |
2 | 2000-01-03 | 1.0 | 27.1 | NaN | NW | 25 | 3.0 |
3 | 2000-01-04 | 1.0 | 28.9 | 4.1 | N | 2 | 4.0 |
4 | 2000-01-05 | 1.0 | 30.2 | 2.9 | S | 0 | 5.0 |
5 | 2000-01-06 | NaN | NaN | 4.3 | N | 0 | NaN |
6 | 2000-01-07 | NaN | NaN | 2.1 | N | 0 | NaN |
7 | 2000-01-08 | NaN | NaN | 0.5 | SW | 0 | NaN |
8 | 2000-01-09 | NaN | NaN | 2.7 | E | 25 | NaN |
9 | 2000-01-10 | NaN | NaN | 1.9 | NW | 0 | NaN |
Note how
NaN
values were assigned to variables not present in the new DataFrame
# Create another DataFrame with more limited data. Values every other day
= {'timestamp': ['1/1/2000','3/1/2000','5/1/2000','7/1/2000','9/1/2000'],
data3 'pressure': [980, 987, 985, 991, 990]} # Pressure in millibars
= pd.DataFrame(data3)
df3 "timestamp"] = pd.to_datetime(df3["timestamp"], format="%d/%m/%Y")
df3[ df3.head()
timestamp | pressure | |
---|---|---|
0 | 2000-01-01 | 980 |
1 | 2000-01-03 | 987 |
2 | 2000-01-05 | 985 |
3 | 2000-01-07 | 991 |
4 | 2000-01-09 | 990 |
# Only the matching rows will be merged
="timestamp") df_merged.merge(df3, on
timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | pressure | |
---|---|---|---|---|---|---|---|---|
0 | 2000-01-01 | 1.0 | 25.4 | 2.2 | E | 0 | 1.0 | 980 |
1 | 2000-01-03 | 1.0 | 27.1 | NaN | NW | 25 | 3.0 | 987 |
2 | 2000-01-05 | 1.0 | 30.2 | 2.9 | S | 0 | 5.0 | 985 |
3 | 2000-01-07 | NaN | NaN | 2.1 | N | 0 | NaN | 991 |
4 | 2000-01-09 | NaN | NaN | 2.7 | E | 25 | NaN | 990 |
# Only add values from the new, more sporadic, variable where there is a match.
="left") df_merged.merge(df3, how
timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | pressure | |
---|---|---|---|---|---|---|---|---|
0 | 2000-01-01 | 1.0 | 25.4 | 2.2 | E | 0 | 1.0 | 980.0 |
1 | 2000-01-02 | 1.0 | 26.0 | 3.2 | NW | 18 | 2.0 | NaN |
2 | 2000-01-03 | 1.0 | 27.1 | NaN | NW | 25 | 3.0 | 987.0 |
3 | 2000-01-04 | 1.0 | 28.9 | 4.1 | N | 2 | 4.0 | NaN |
4 | 2000-01-05 | 1.0 | 30.2 | 2.9 | S | 0 | 5.0 | 985.0 |
5 | 2000-01-06 | NaN | NaN | 4.3 | N | 0 | NaN | NaN |
6 | 2000-01-07 | NaN | NaN | 2.1 | N | 0 | NaN | 991.0 |
7 | 2000-01-08 | NaN | NaN | 0.5 | SW | 0 | NaN | NaN |
8 | 2000-01-09 | NaN | NaN | 2.7 | E | 25 | NaN | 990.0 |
9 | 2000-01-10 | NaN | NaN | 1.9 | NW | 0 | NaN | NaN |
Operations with real dataset
# Read CSV file
= "../datasets/ok_mesonet_8_apr_2019.csv"
data_url = pd.read_csv(data_url)
df 5) df.head(
STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | ||||||||||
1 | ADAX | Ada | OK | 34.80 | -96.67 | 2019 | 4 | 15 | 15 | 20 | ... | 40 | S | 12 | 20 | 1011.13 | 78 | 48 | |||
2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | ||
3 | ALV2 | Alva | OK | 36.71 | -98.71 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | 82 | S | 20 | 26 | 1004.65 | 84 | 40 | ||
4 | ANT2 | Antlers | OK | 34.25 | -95.67 | 2019 | 4 | 15 | 15 | 20 | ... | 35 | S | 11 | 20 | 1013.64 | 78 | 38 |
5 rows × 22 columns
Some columns seem to have empty cells. Ideally we would like to see these cells filled with NaN
values. Something looks fishy. Let’s inspect some of these cells.
# Print one the cells to see what's in there
0,'RAIN'] df.loc[
' '
In the inspected cell we found a string with a single space in it. Now we can use the replace()
method to substitute these strings for NaN
values.
In Pandas, the inplace=True
argument is used when performing operations on a DataFrame or Series to decide whether the changes made should affect the original data or not. When you use inplace=True
, any changes you make to the DataFrame or Series will modify the original data directly. It means that you’re altering the existing data, and you don’t need to assign the result to a new variable. When omitting inplace=True
, Pandas by default creates a new copy of the DataFrame or Series with the changes applied. This means that the original data remains unchanged, and you need to assign the result to a new variable if you want to keep the modified data.
# Replace empty strings with NaN values
' ', np.nan, inplace=True)
df.replace(5) df.head(
STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | ADAX | Ada | OK | 34.80 | -96.67 | 2019 | 4 | 15 | 15 | 20 | ... | 40 | NaN | NaN | S | 12 | 20 | 1011.13 | 78 | 48 | NaN |
2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | NaN | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | NaN |
3 | ALV2 | Alva | OK | 36.71 | -98.71 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | NaN | 82 | S | 20 | 26 | 1004.65 | 84 | 40 | NaN |
4 | ANT2 | Antlers | OK | 34.25 | -95.67 | 2019 | 4 | 15 | 15 | 20 | ... | 35 | NaN | NaN | S | 11 | 20 | 1013.64 | 78 | 38 | NaN |
5 rows × 22 columns
The previous solution is not the best. We could have resolved the issue with the empty strings by simply adding the following option na_values=' '
to the pd.read_csv()
function, like this: df = pd.read_csv(data_url, na_values=’ ’). This will automatically populate all cells that contain ' '
with NaN
values.
Match specific stations
= df['STID'].str.match('ACME')
idx_acme df[idx_acme]
STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 22 columns
= df['STID'].str.match('A')
idx_starts_with_A df[idx_starts_with_A]
STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | ADAX | Ada | OK | 34.80 | -96.67 | 2019 | 4 | 15 | 15 | 20 | ... | 40 | NaN | NaN | S | 12 | 20 | 1011.13 | 78 | 48 | NaN |
2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | NaN | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | NaN |
3 | ALV2 | Alva | OK | 36.71 | -98.71 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | NaN | 82 | S | 20 | 26 | 1004.65 | 84 | 40 | NaN |
4 | ANT2 | Antlers | OK | 34.25 | -95.67 | 2019 | 4 | 15 | 15 | 20 | ... | 35 | NaN | NaN | S | 11 | 20 | 1013.64 | 78 | 38 | NaN |
5 | APAC | Apache | OK | 34.91 | -98.29 | 2019 | 4 | 15 | 15 | 20 | ... | 41 | NaN | NaN | S | 23 | 29 | 1008.9 | 80 | 49 | NaN |
6 | ARD2 | Ardmore | OK | 34.19 | -97.09 | 2019 | 4 | 15 | 15 | 20 | ... | 41 | NaN | NaN | S | 18 | 26 | 1011.43 | 77 | 50 | NaN |
7 | ARNE | Arnett | OK | 36.07 | -99.90 | 2019 | 4 | 15 | 15 | 20 | ... | 10 | NaN | 85 | SW | 22 | 32 | 1005.13 | NaN | NaN | NaN |
8 rows × 22 columns
= df['STID'].str.contains('A')
idx_has_A 15) df[idx_has_A].head(
STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | ADAX | Ada | OK | 34.80 | -96.67 | 2019 | 4 | 15 | 15 | 20 | ... | 40 | NaN | NaN | S | 12 | 20 | 1011.13 | 78 | 48 | NaN |
2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | NaN | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | NaN |
3 | ALV2 | Alva | OK | 36.71 | -98.71 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | NaN | 82 | S | 20 | 26 | 1004.65 | 84 | 40 | NaN |
4 | ANT2 | Antlers | OK | 34.25 | -95.67 | 2019 | 4 | 15 | 15 | 20 | ... | 35 | NaN | NaN | S | 11 | 20 | 1013.64 | 78 | 38 | NaN |
5 | APAC | Apache | OK | 34.91 | -98.29 | 2019 | 4 | 15 | 15 | 20 | ... | 41 | NaN | NaN | S | 23 | 29 | 1008.9 | 80 | 49 | NaN |
6 | ARD2 | Ardmore | OK | 34.19 | -97.09 | 2019 | 4 | 15 | 15 | 20 | ... | 41 | NaN | NaN | S | 18 | 26 | 1011.43 | 77 | 50 | NaN |
7 | ARNE | Arnett | OK | 36.07 | -99.90 | 2019 | 4 | 15 | 15 | 20 | ... | 10 | NaN | 85 | SW | 22 | 32 | 1005.13 | NaN | NaN | NaN |
8 | BEAV | Beaver | OK | 36.80 | -100.53 | 2019 | 4 | 15 | 15 | 20 | ... | 9 | NaN | 84 | SW | 17 | 26 | 1003.9 | 91 | 34 | NaN |
11 | BLAC | Blackwell | OK | 36.75 | -97.25 | 2019 | 4 | 15 | 15 | 20 | ... | 38 | NaN | NaN | SSW | 15 | 23 | 1007.02 | 80 | 44 | NaN |
20 | BYAR | Byars | OK | 34.85 | -97.00 | 2019 | 4 | 15 | 15 | 20 | ... | 43 | NaN | NaN | S | 22 | 32 | 1010.64 | 77 | 49 | NaN |
21 | CAMA | Camargo | OK | 36.03 | -99.35 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | NaN | 82 | S | 23 | 29 | 1005.56 | NaN | NaN | NaN |
22 | CARL | Lake Carl Blackwell | OK | 36.15 | -97.29 | 2019 | 4 | 15 | 15 | 20 | ... | 36 | NaN | 80 | S | 17 | 25 | 1007.56 | 80 | 50 | NaN |
24 | CHAN | Chandler | OK | 35.65 | -96.80 | 2019 | 4 | 15 | 15 | 20 | ... | 37 | NaN | NaN | SSW | 16 | 27 | 1009.35 | 80 | 48 | NaN |
28 | CLAY | Clayton | OK | 34.66 | -95.33 | 2019 | 4 | 15 | 15 | 20 | ... | 36 | NaN | NaN | S | 9 | 24 | 1012.9 | 78 | 40 | NaN |
15 rows × 22 columns
= df['NAME'].str.contains('Blackwell') & df['NAME'].str.contains('Lake')
idx df[idx]
STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | CARL | Lake Carl Blackwell | OK | 36.15 | -97.29 | 2019 | 4 | 15 | 15 | 20 | ... | 36 | NaN | 80 | S | 17 | 25 | 1007.56 | 80 | 50 | NaN |
1 rows × 22 columns
The following line won’t work because the string matching is case sensitive:
idx = df['NAME'].str.contains('LAKE')
= df['NAME'].str.contains('Blackwell') | df['NAME'].str.contains('Lake')
idx df[idx]
STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | BLAC | Blackwell | OK | 36.75 | -97.25 | 2019 | 4 | 15 | 15 | 20 | ... | 38 | NaN | NaN | SSW | 15 | 23 | 1007.02 | 80 | 44 | NaN |
22 | CARL | Lake Carl Blackwell | OK | 36.15 | -97.29 | 2019 | 4 | 15 | 15 | 20 | ... | 36 | NaN | 80 | S | 17 | 25 | 1007.56 | 80 | 50 | NaN |
2 rows × 22 columns
= df['STID'].isin(['ACME','ALTU'])
idx df[idx]
STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | NaN | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | NaN |
2 rows × 22 columns