# Import modules
import pandas as pd
import numpy as np29 Pandas module
pandas, dataframe, tabular, tables
Pandas is a powerful library primarily used for tabular data analysis. It offers data structures like DataFrames and Series, which make tasks such as data cleaning and aggregation easy. Pandas integrates well with other libraries like Numpy, and can read data saved in various text formats and spreadsheet software like MS Excel. One of the hallmarks of Pandas DataFrames is that we can call data stored in rows and columns by name, similar to working with Python dictionaries.
Create DataFrame from existing variable
After importing the module we have two possible directions. We import data from a file or we convert an existing variable into a Pandas DataFrame. Here we will create a simple DatFrame to learn the basics. This way we will be able to display the result of our operations without worrying about extensive datasets.
Let’s create a dictionary with some weather data and missing values (represented by -9999).
# Create dictionary with some weather data
data = {'timestamp': ['1/1/2000','2/1/2000','3/1/2000','4/1/2000','5/1/2000'],
'wind_speed': [2.2, 3.2, -9999.0, 4.1, 2.9],
'wind_direction': ['E', 'NW', 'NW', 'N', 'S'],
'precipitation': [0, 18, 25, 2, 0]}The next step consists of converting the dictionary into a Pandas DataFrame. This is straight forward using the DataFrame method of the Pandas module: pd.DataFrame()
# Convert dictionary into DataFrame
df = pd.DataFrame(data)
df.head()| timestamp | wind_speed | wind_direction | precipitation | |
|---|---|---|---|---|
| 0 | 1/1/2000 | 2.2 | E | 0 |
| 1 | 2/1/2000 | 3.2 | NW | 18 |
| 2 | 3/1/2000 | -9999.0 | NW | 25 |
| 3 | 4/1/2000 | 4.1 | N | 2 |
| 4 | 5/1/2000 | 2.9 | S | 0 |
The above DataFrame has the following components:
header row containing column names
index (the left-most column with numbers from 0 to 4) is equivalent to a row name.
Each column has data of the same type.
# By default, values in Pandas series are Numpy arrays
print(df["wind_speed"].values)
print(type(df["wind_speed"].values))[ 2.200e+00 3.200e+00 -9.999e+03 4.100e+00 2.900e+00]
<class 'numpy.ndarray'>
Basic methods and properties
Pandas DataFrame has dedicated functions to display a limited number of heading and tailing rows.
df.head(3) # First three rows| timestamp | wind_speed | wind_direction | precipitation | |
|---|---|---|---|---|
| 0 | 1/1/2000 | 2.2 | E | 0 |
| 1 | 2/1/2000 | 3.2 | NW | 18 |
| 2 | 3/1/2000 | -9999.0 | NW | 25 |
df.tail(3) # Last three rows| timestamp | wind_speed | wind_direction | precipitation | |
|---|---|---|---|---|
| 2 | 3/1/2000 | -9999.0 | NW | 25 |
| 3 | 4/1/2000 | 4.1 | N | 2 |
| 4 | 5/1/2000 | 2.9 | S | 0 |
To display the DataFrame content simply use the head() and tail() methods. As an alternative you can use the print() function or type the name of the DataFrame and press ctrl + Enter. Note that by default Jupyter Lab highlights rows when using the head() or tail() methods.
To start exploring and analyzing our dataset it is often handy to know the column names.
# Display column names
df.columnsIndex(['timestamp', 'wind_speed', 'wind_direction', 'precipitation'], dtype='object')
# Total number of elements
df.size20
# Number of rows and columns
df.shape(5, 4)
# Data type of each column
df.dtypestimestamp object
wind_speed float64
wind_direction object
precipitation int64
dtype: object
Convert strings to datetime
# Convert dates in string format to Pandas datetime format
# %d = day in format 00 days
# %m = month in format 00 months
# %Y = full year
df["timestamp"] = pd.to_datetime(df["timestamp"], format="%d/%m/%Y")
df.head()| timestamp | wind_speed | wind_direction | precipitation | |
|---|---|---|---|---|
| 0 | 2000-01-01 | 2.2 | E | 0 |
| 1 | 2000-01-02 | 3.2 | NW | 18 |
| 2 | 2000-01-03 | -9999.0 | NW | 25 |
| 3 | 2000-01-04 | 4.1 | N | 2 |
| 4 | 2000-01-05 | 2.9 | S | 0 |
# The `timestamp` column has changed to datetime format
df.dtypestimestamp datetime64[ns]
wind_speed float64
wind_direction object
precipitation int64
dtype: object
Extract information from the timestamp
Having specific information like day of the year, month, or weeks in a separate column can be useful to help us aggregate values. For instance, to compute the monthly mean air temperature we need to know in what month each temperature observations was recorded.
For this we will use the dt submodule within Pandas.
# Get the day of the year
df["doy"] = df["timestamp"].dt.dayofyear
df.head()| timestamp | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|
| 0 | 2000-01-01 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 3.2 | NW | 18 | 2 |
| 2 | 2000-01-03 | -9999.0 | NW | 25 | 3 |
| 3 | 2000-01-04 | 4.1 | N | 2 | 4 |
| 4 | 2000-01-05 | 2.9 | S | 0 | 5 |
The new column was placed at the end of the DataFrame. This the default when creating a new column.
In the next example we use the insert() method to add the new column in a specific location. Typically, for date components it helps to have the columns close to the datetime column.
# Get month from timstamp and create new column
#.insert(positionOfNewColumn, nameOfNewColumn, dataOfNewColumn)
df.insert(1,'month',df["timestamp"].dt.month)
df.head()| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
| 2 | 2000-01-03 | 1 | -9999.0 | NW | 25 | 3 |
| 3 | 2000-01-04 | 1 | 4.1 | N | 2 | 4 |
| 4 | 2000-01-05 | 1 | 2.9 | S | 0 | 5 |
Re-running the previous cell will trigger an error since Pandas cannot have two columns with the same name.
Missing values
One of the most common operations when working with data is handling missing values. Almost every dataset has missing data and there is no universal way of denoting missing values. Most common placeholders are: NaN, NA, -99, -9999, M, missing, etc. To find out more about how missing data is represented in your dataset always read associated metadata. Some of these placeholders are automatically identified by Pandas as missing values and are represented as NaN values.
Pandas methods can deal with missing data, meaning that it is not necessary to always replace missing values in order to make computations. We just need to ensure that missing values are represented as NaN values.
The fillna() and interpolate() methods can help us replace missing values with an approximate value using neighboring points.
To replace missing values in our current dataset, we will follow these steps:
Identify the cells with
-9999values. Output will be a boolean DataFrame having the same dimensions asdf.Replace
-9999withNaNvalues from the Numpy module.Check our work using the
isna()method
Missing values represented as np.nan are actually of type float.
# Print data type of NaN values from Numpy
type(np.nan)float
# Step 1: find -9999 values across the entire dataframe
idx_missing = df.isin([-9999]) # or idx_missing = df == -9999
idx_missing| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False |
| 2 | False | False | True | False | False | False |
| 3 | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False |
# Find missing vlaues in only one column
df["wind_speed"].isin([-9999]) # or df["wind_speed"] == -999990 False
1 False
2 True
3 False
4 False
Name: wind_speed, dtype: bool
Using the isin() method we can place multiple placeholders denoting missing data, as opposed to the boolean statement that would require multiple or statements.
# Step 2: Replace missing values with NaN
df[idx_missing] = np.nan
df| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
| 2 | 2000-01-03 | 1 | NaN | NW | 25 | 3 |
| 3 | 2000-01-04 | 1 | 4.1 | N | 2 | 4 |
| 4 | 2000-01-05 | 1 | 2.9 | S | 0 | 5 |
# Step 3: Check our work
df.isna()| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False |
| 2 | False | False | True | False | False | False |
| 3 | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False |
Quick statistics
DataFrames have a variety of methods to calculate simple statistics. To obtain an overall summary we can use the describe() method.
# Summary stats for all columns
print(df.describe()) timestamp month wind_speed precipitation doy
count 5 5.0 4.000000 5.0000 5.000000
mean 2000-01-03 00:00:00 1.0 3.100000 9.0000 3.000000
min 2000-01-01 00:00:00 1.0 2.200000 0.0000 1.000000
25% 2000-01-02 00:00:00 1.0 2.725000 0.0000 2.000000
50% 2000-01-03 00:00:00 1.0 3.050000 2.0000 3.000000
75% 2000-01-04 00:00:00 1.0 3.425000 18.0000 4.000000
max 2000-01-05 00:00:00 1.0 4.100000 25.0000 5.000000
std NaN 0.0 0.787401 11.7047 1.581139
# Metric ignoring NaN values
print(df["wind_speed"].max()) # Maximum value for each column
print(df["wind_speed"].mean()) # Average value for each column
print(df["wind_speed"].min()) # Minimum value for each column
print(df["wind_speed"].std()) # Standard deviation value for each column
print(df["wind_speed"].var()) # Variance value for each column
print(df["wind_speed"].median()) # Variance value for each column
print(df["wind_speed"].quantile(0.95))4.1
3.1
2.2
0.7874007874011809
0.6199999999999997
3.05
3.9649999999999994
# Cumulative sum. Useful to compute cumulative precipitation
print(df.precipitation.cumsum())0 0
1 18
2 43
3 45
4 45
Name: precipitation, dtype: int64
# Unique values. Useful to compute unique wind directions
print(df.wind_direction.unique())['E' 'NW' 'N' 'S']
Indexing and slicing
To start making computations with need to access the data insde the Pandas DataFrame. Indexing and slicing are useful operations to select portions of data by calling specific rows, columns, or a combination of both. The index operator [] is primarily intended to be used with column labels (e.g. df[columnName]), however, it can also handle row slices (e.g. df[rows]). A common notation useful to understand how the slicing works is as follows:
Select rows
# First three rows
df[0:3]| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
| 2 | 2000-01-03 | 1 | NaN | NW | 25 | 3 |
Select columns
We can call individual columns using the dot or bracket notation. Note that in option 2 there is no . between df and ['windSpeed']
df['wind_speed'] # Option 1
# df.wind_speed # Option 20 2.2
1 3.2
2 NaN
3 4.1
4 2.9
Name: wind_speed, dtype: float64
To pass more than one row of column you will need to group them in a list.
# Select multiple columns at once
df[['wind_speed','wind_direction']]| wind_speed | wind_direction | |
|---|---|---|
| 0 | 2.2 | E |
| 1 | 3.2 | NW |
| 2 | NaN | NW |
| 3 | 4.1 | N |
| 4 | 2.9 | S |
A common mistake when slicing multiple columns is to forget grouping column names into a list, so the following command will not work:
df['wind_speed','wind_direction']
Using iloc method
iloc: Integer-location. iloc gets rows (or columns) at specific indexes. It only takes integers as input. Exclusive of its endpoint
# Top 3 rows and columns 1 and 2
df.iloc[0:3,[1,2]]| month | wind_speed | |
|---|---|---|
| 0 | 1 | 2.2 |
| 1 | 1 | 3.2 |
| 2 | 1 | NaN |
# Top 2 rows and all columns
df.iloc[0:2,:] # Same as: df.iloc[0:2]| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
Although a bit more verbose and perhaps less pythonic, sometimes it is better to specify all the columns using the colon : character. In my opinion this notation is more explicit and clearly states the rows and columns of the slicing operation. For instance, df.iloc[0:2,:] is more explicit than df.iloc[0:2].
Using loc method
loc gets rows (or columns) with specific labels. Inclusive of its endpoint
# Select multiple rows and columns at once using the loc method
df.loc[0:2,['wind_speed','wind_direction']]| wind_speed | wind_direction | |
|---|---|---|
| 0 | 2.2 | E |
| 1 | 3.2 | NW |
| 2 | NaN | NW |
# Some rows and all columns
df.loc[0:1,:]| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
# First three elements of a single column
df.loc[0:2,'wind_speed'] 0 2.2
1 3.2
2 NaN
Name: wind_speed, dtype: float64
# First three elements of multiple columns
df.loc[0:2,['wind_speed','wind_direction']]| wind_speed | wind_direction | |
|---|---|---|
| 0 | 2.2 | E |
| 1 | 3.2 | NW |
| 2 | NaN | NW |
These statements will not work with loc:
df.loc[0:2,0:1]
df.loc[[0:2],[0:1]]
Filter data using boolean indexing
Boolean indexing (a.k.a. logical indexing) consists of creating an array with True/False values as a result of one or more conditional statements that can be use to select data that meet the specified condition.
Let’s select the data across all columns for days that have wind speed greater than 3 meters per second. We will first select the rows of df.wind_speed that are greater than 3 m/s, and then we will use the resulting boolean to slice the DataFrame.
idx = df['wind_speed'] > 3 # Rows in which the wind speed is greater than
idx # Let's inspect the idx variable.0 False
1 True
2 False
3 True
4 False
Name: wind_speed, dtype: bool
# Now let's apply the boolean variable to the dataframe
df[idx]| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
| 3 | 2000-01-04 | 1 | 4.1 | N | 2 | 4 |
# We can also apply the boolean to specific columns
df.loc[idx,'wind_direction']1 NW
3 N
Name: wind_direction, dtype: object
It’s also possible to write the previous command as a single line of code. This is fine, but sometimes nesting too many conditions can create commands that are hard to read and understand. To avoid this problem, storing the boolean in a new variable makes things a lot easier to read and re-use.
# Same in a single line of code
df.loc[df['wind_speed'] > 3,'wind_direction']1 NW
3 N
Name: wind_direction, dtype: object
Another popular way of filtering is to check whether an element or group of elements are within a set. Let’s check whether January 1 and January 2 are in the DataFrame.
idx_doy = df['doy'].isin([1,2])
idx_doy0 True
1 True
2 False
3 False
4 False
Name: doy, dtype: bool
# Select all columns for the selected days of the year
df.loc[idx_doy,:]| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
Pandas custom date range
Most datasets collected over a period of time include timestamps, but in case the timestamps are missing or you need to create a range of dates during your analysis, Pandas has the date_range() method to create a sequence of timestamps.
subset_dates = pd.date_range('20000102', periods=2, freq='D') # Used df.shape[0] to find the total number of rows
subset_datesDatetimeIndex(['2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq='D')
# The same to generate months
pd.date_range('20200101', periods=df.shape[0], freq='M') # Specify the frequency to monthsDatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31'],
dtype='datetime64[ns]', freq='M')
Select range of dates with boolean indexing
Now that we covered both boolean indexing and pandas dates we can use these concepts to select data from a specific window of time. This is a pretty common operation when trying to select a subset of the entire DataFrame by a specific date range.
# Generate boolean for rows that match the subset of dates generated earlier
idx_subset = df["timestamp"].isin(subset_dates)
idx_subset0 False
1 True
2 True
3 False
4 False
Name: timestamp, dtype: bool
# Generate a new DataFrame using only the rows with matching dates
df_subset = df.loc[idx_subset]
df_subset| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
| 2 | 2000-01-03 | 1 | NaN | NW | 25 | 3 |
# It isn't always necessary to generate a new DataFrame
# So you can access a specific column like this
df.loc[idx_subset,"precipitation"]1 18
2 25
Name: precipitation, dtype: int64
Add and remove columns
The insert() and drop() methods allow us to add or remove columns to/from the DataFrame. The most common use of these functions is as follows:
df.insert(index_of_new_column, name_of_new_column, data_of_new_column)
df.drop(name_of_column_to_be_removed)
# Add new column at a specific location
df.insert(2, 'air_temperature', [25.4, 26, 27.1, 28.9, 30.2]) # Similar to: df['dates'] = dates
df| timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 25.4 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 26.0 | 3.2 | NW | 18 | 2 |
| 2 | 2000-01-03 | 1 | 27.1 | NaN | NW | 25 | 3 |
| 3 | 2000-01-04 | 1 | 28.9 | 4.1 | N | 2 | 4 |
| 4 | 2000-01-05 | 1 | 30.2 | 2.9 | S | 0 | 5 |
# Remove specific column
df.drop(columns=['air_temperature'])| timestamp | month | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 3.2 | NW | 18 | 2 |
| 2 | 2000-01-03 | 1 | NaN | NW | 25 | 3 |
| 3 | 2000-01-04 | 1 | 4.1 | N | 2 | 4 |
| 4 | 2000-01-05 | 1 | 2.9 | S | 0 | 5 |
Reset DataFrame index
# Replace the index by a variables of our choice
df.set_index('timestamp')| month | air_temperature | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|
| timestamp | ||||||
| 2000-01-01 | 1 | 25.4 | 2.2 | E | 0 | 1 |
| 2000-01-02 | 1 | 26.0 | 3.2 | NW | 18 | 2 |
| 2000-01-03 | 1 | 27.1 | NaN | NW | 25 | 3 |
| 2000-01-04 | 1 | 28.9 | 4.1 | N | 2 | 4 |
| 2000-01-05 | 1 | 30.2 | 2.9 | S | 0 | 5 |
# Reset the index (see that 'doy' goes back to the end of the DataFrame again)
df.reset_index(0, drop=True)| timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1 | 25.4 | 2.2 | E | 0 | 1 |
| 1 | 2000-01-02 | 1 | 26.0 | 3.2 | NW | 18 | 2 |
| 2 | 2000-01-03 | 1 | 27.1 | NaN | NW | 25 | 3 |
| 3 | 2000-01-04 | 1 | 28.9 | 4.1 | N | 2 | 4 |
| 4 | 2000-01-05 | 1 | 30.2 | 2.9 | S | 0 | 5 |
Merge two dataframes
# Create a new DataFrame (follows dates)
# Dictionary
data2 = {'timestamp': ['6/1/2000','7/1/2000','8/1/2000','9/1/2000','10/1/2000'],
'wind_speed': [4.3, 2.1, 0.5, 2.7, 1.9],
'wind_direction': ['N', 'N', 'SW', 'E', 'NW'],
'precipitation': [0, 0, 0, 25, 0]}
# Dcitionary to DataFrame
df2 = pd.DataFrame(data2)
# Convert strings to pandas datetime
df2["timestamp"] = pd.to_datetime(df2["timestamp"], format="%d/%m/%Y")
df2.head()| timestamp | wind_speed | wind_direction | precipitation | |
|---|---|---|---|---|
| 0 | 2000-01-06 | 4.3 | N | 0 |
| 1 | 2000-01-07 | 2.1 | N | 0 |
| 2 | 2000-01-08 | 0.5 | SW | 0 |
| 3 | 2000-01-09 | 2.7 | E | 25 |
| 4 | 2000-01-10 | 1.9 | NW | 0 |
Not using the format="%d/%m/%y" in the previous cell results in the wrong datetime conversion. It is always recommended to specify the format.
# Merge both Dataframes by applying a union of keys from both frames (how='outer' option)
df_merged = pd.merge(df, df2, how='outer')
df_merged| timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | |
|---|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1.0 | 25.4 | 2.2 | E | 0 | 1.0 |
| 1 | 2000-01-02 | 1.0 | 26.0 | 3.2 | NW | 18 | 2.0 |
| 2 | 2000-01-03 | 1.0 | 27.1 | NaN | NW | 25 | 3.0 |
| 3 | 2000-01-04 | 1.0 | 28.9 | 4.1 | N | 2 | 4.0 |
| 4 | 2000-01-05 | 1.0 | 30.2 | 2.9 | S | 0 | 5.0 |
| 5 | 2000-01-06 | NaN | NaN | 4.3 | N | 0 | NaN |
| 6 | 2000-01-07 | NaN | NaN | 2.1 | N | 0 | NaN |
| 7 | 2000-01-08 | NaN | NaN | 0.5 | SW | 0 | NaN |
| 8 | 2000-01-09 | NaN | NaN | 2.7 | E | 25 | NaN |
| 9 | 2000-01-10 | NaN | NaN | 1.9 | NW | 0 | NaN |
Note how
NaNvalues were assigned to variables not present in the new DataFrame
# Create another DataFrame with more limited data. Values every other day
data3 = {'timestamp': ['1/1/2000','3/1/2000','5/1/2000','7/1/2000','9/1/2000'],
'pressure': [980, 987, 985, 991, 990]} # Pressure in millibars
df3 = pd.DataFrame(data3)
df3["timestamp"] = pd.to_datetime(df3["timestamp"], format="%d/%m/%Y")
df3.head()| timestamp | pressure | |
|---|---|---|
| 0 | 2000-01-01 | 980 |
| 1 | 2000-01-03 | 987 |
| 2 | 2000-01-05 | 985 |
| 3 | 2000-01-07 | 991 |
| 4 | 2000-01-09 | 990 |
# Only the matching rows will be merged
df_merged.merge(df3, on="timestamp")| timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | pressure | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1.0 | 25.4 | 2.2 | E | 0 | 1.0 | 980 |
| 1 | 2000-01-03 | 1.0 | 27.1 | NaN | NW | 25 | 3.0 | 987 |
| 2 | 2000-01-05 | 1.0 | 30.2 | 2.9 | S | 0 | 5.0 | 985 |
| 3 | 2000-01-07 | NaN | NaN | 2.1 | N | 0 | NaN | 991 |
| 4 | 2000-01-09 | NaN | NaN | 2.7 | E | 25 | NaN | 990 |
# Only add values from the new, more sporadic, variable where there is a match.
df_merged.merge(df3, how="left")| timestamp | month | air_temperature | wind_speed | wind_direction | precipitation | doy | pressure | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2000-01-01 | 1.0 | 25.4 | 2.2 | E | 0 | 1.0 | 980.0 |
| 1 | 2000-01-02 | 1.0 | 26.0 | 3.2 | NW | 18 | 2.0 | NaN |
| 2 | 2000-01-03 | 1.0 | 27.1 | NaN | NW | 25 | 3.0 | 987.0 |
| 3 | 2000-01-04 | 1.0 | 28.9 | 4.1 | N | 2 | 4.0 | NaN |
| 4 | 2000-01-05 | 1.0 | 30.2 | 2.9 | S | 0 | 5.0 | 985.0 |
| 5 | 2000-01-06 | NaN | NaN | 4.3 | N | 0 | NaN | NaN |
| 6 | 2000-01-07 | NaN | NaN | 2.1 | N | 0 | NaN | 991.0 |
| 7 | 2000-01-08 | NaN | NaN | 0.5 | SW | 0 | NaN | NaN |
| 8 | 2000-01-09 | NaN | NaN | 2.7 | E | 25 | NaN | 990.0 |
| 9 | 2000-01-10 | NaN | NaN | 1.9 | NW | 0 | NaN | NaN |
Operations with real dataset
# Read CSV file
data_url = "../datasets/ok_mesonet_8_apr_2019.csv"
df = pd.read_csv(data_url)
df.head(5)| STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | ||||||||||
| 1 | ADAX | Ada | OK | 34.80 | -96.67 | 2019 | 4 | 15 | 15 | 20 | ... | 40 | S | 12 | 20 | 1011.13 | 78 | 48 | |||
| 2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | ||
| 3 | ALV2 | Alva | OK | 36.71 | -98.71 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | 82 | S | 20 | 26 | 1004.65 | 84 | 40 | ||
| 4 | ANT2 | Antlers | OK | 34.25 | -95.67 | 2019 | 4 | 15 | 15 | 20 | ... | 35 | S | 11 | 20 | 1013.64 | 78 | 38 |
5 rows × 22 columns
Some columns seem to have empty cells. Ideally we would like to see these cells filled with NaN values. Something looks fishy. Let’s inspect some of these cells.
# Print one the cells to see what's in there
df.loc[0,'RAIN']' '
In the inspected cell we found a string with a single space in it. Now we can use the replace() method to substitute these strings for NaN values.
In Pandas, the inplace=True argument is used when performing operations on a DataFrame or Series to decide whether the changes made should affect the original data or not. When you use inplace=True, any changes you make to the DataFrame or Series will modify the original data directly. It means that you’re altering the existing data, and you don’t need to assign the result to a new variable. When omitting inplace=True, Pandas by default creates a new copy of the DataFrame or Series with the changes applied. This means that the original data remains unchanged, and you need to assign the result to a new variable if you want to keep the modified data.
# Replace empty strings with NaN values
df.replace(' ', np.nan, inplace=True)
df.head(5)| STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | ADAX | Ada | OK | 34.80 | -96.67 | 2019 | 4 | 15 | 15 | 20 | ... | 40 | NaN | NaN | S | 12 | 20 | 1011.13 | 78 | 48 | NaN |
| 2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | NaN | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | NaN |
| 3 | ALV2 | Alva | OK | 36.71 | -98.71 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | NaN | 82 | S | 20 | 26 | 1004.65 | 84 | 40 | NaN |
| 4 | ANT2 | Antlers | OK | 34.25 | -95.67 | 2019 | 4 | 15 | 15 | 20 | ... | 35 | NaN | NaN | S | 11 | 20 | 1013.64 | 78 | 38 | NaN |
5 rows × 22 columns
The previous solution is not the best. We could have resolved the issue with the empty strings by simply adding the following option na_values=' ' to the pd.read_csv() function, like this: df = pd.read_csv(data_url, na_values=’ ’). This will automatically populate all cells that contain ' ' with NaN values.
Match specific stations
idx_acme = df['STID'].str.match('ACME')
df[idx_acme]| STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 22 columns
idx_starts_with_A = df['STID'].str.match('A')
df[idx_starts_with_A]| STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | ADAX | Ada | OK | 34.80 | -96.67 | 2019 | 4 | 15 | 15 | 20 | ... | 40 | NaN | NaN | S | 12 | 20 | 1011.13 | 78 | 48 | NaN |
| 2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | NaN | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | NaN |
| 3 | ALV2 | Alva | OK | 36.71 | -98.71 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | NaN | 82 | S | 20 | 26 | 1004.65 | 84 | 40 | NaN |
| 4 | ANT2 | Antlers | OK | 34.25 | -95.67 | 2019 | 4 | 15 | 15 | 20 | ... | 35 | NaN | NaN | S | 11 | 20 | 1013.64 | 78 | 38 | NaN |
| 5 | APAC | Apache | OK | 34.91 | -98.29 | 2019 | 4 | 15 | 15 | 20 | ... | 41 | NaN | NaN | S | 23 | 29 | 1008.9 | 80 | 49 | NaN |
| 6 | ARD2 | Ardmore | OK | 34.19 | -97.09 | 2019 | 4 | 15 | 15 | 20 | ... | 41 | NaN | NaN | S | 18 | 26 | 1011.43 | 77 | 50 | NaN |
| 7 | ARNE | Arnett | OK | 36.07 | -99.90 | 2019 | 4 | 15 | 15 | 20 | ... | 10 | NaN | 85 | SW | 22 | 32 | 1005.13 | NaN | NaN | NaN |
8 rows × 22 columns
idx_has_A = df['STID'].str.contains('A')
df[idx_has_A].head(15)| STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | ADAX | Ada | OK | 34.80 | -96.67 | 2019 | 4 | 15 | 15 | 20 | ... | 40 | NaN | NaN | S | 12 | 20 | 1011.13 | 78 | 48 | NaN |
| 2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | NaN | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | NaN |
| 3 | ALV2 | Alva | OK | 36.71 | -98.71 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | NaN | 82 | S | 20 | 26 | 1004.65 | 84 | 40 | NaN |
| 4 | ANT2 | Antlers | OK | 34.25 | -95.67 | 2019 | 4 | 15 | 15 | 20 | ... | 35 | NaN | NaN | S | 11 | 20 | 1013.64 | 78 | 38 | NaN |
| 5 | APAC | Apache | OK | 34.91 | -98.29 | 2019 | 4 | 15 | 15 | 20 | ... | 41 | NaN | NaN | S | 23 | 29 | 1008.9 | 80 | 49 | NaN |
| 6 | ARD2 | Ardmore | OK | 34.19 | -97.09 | 2019 | 4 | 15 | 15 | 20 | ... | 41 | NaN | NaN | S | 18 | 26 | 1011.43 | 77 | 50 | NaN |
| 7 | ARNE | Arnett | OK | 36.07 | -99.90 | 2019 | 4 | 15 | 15 | 20 | ... | 10 | NaN | 85 | SW | 22 | 32 | 1005.13 | NaN | NaN | NaN |
| 8 | BEAV | Beaver | OK | 36.80 | -100.53 | 2019 | 4 | 15 | 15 | 20 | ... | 9 | NaN | 84 | SW | 17 | 26 | 1003.9 | 91 | 34 | NaN |
| 11 | BLAC | Blackwell | OK | 36.75 | -97.25 | 2019 | 4 | 15 | 15 | 20 | ... | 38 | NaN | NaN | SSW | 15 | 23 | 1007.02 | 80 | 44 | NaN |
| 20 | BYAR | Byars | OK | 34.85 | -97.00 | 2019 | 4 | 15 | 15 | 20 | ... | 43 | NaN | NaN | S | 22 | 32 | 1010.64 | 77 | 49 | NaN |
| 21 | CAMA | Camargo | OK | 36.03 | -99.35 | 2019 | 4 | 15 | 15 | 20 | ... | 32 | NaN | 82 | S | 23 | 29 | 1005.56 | NaN | NaN | NaN |
| 22 | CARL | Lake Carl Blackwell | OK | 36.15 | -97.29 | 2019 | 4 | 15 | 15 | 20 | ... | 36 | NaN | 80 | S | 17 | 25 | 1007.56 | 80 | 50 | NaN |
| 24 | CHAN | Chandler | OK | 35.65 | -96.80 | 2019 | 4 | 15 | 15 | 20 | ... | 37 | NaN | NaN | SSW | 16 | 27 | 1009.35 | 80 | 48 | NaN |
| 28 | CLAY | Clayton | OK | 34.66 | -95.33 | 2019 | 4 | 15 | 15 | 20 | ... | 36 | NaN | NaN | S | 9 | 24 | 1012.9 | 78 | 40 | NaN |
15 rows × 22 columns
idx = df['NAME'].str.contains('Blackwell') & df['NAME'].str.contains('Lake')
df[idx]| STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | CARL | Lake Carl Blackwell | OK | 36.15 | -97.29 | 2019 | 4 | 15 | 15 | 20 | ... | 36 | NaN | 80 | S | 17 | 25 | 1007.56 | 80 | 50 | NaN |
1 rows × 22 columns
The following line won’t work because the string matching is case sensitive:
idx = df['NAME'].str.contains('LAKE')
idx = df['NAME'].str.contains('Blackwell') | df['NAME'].str.contains('Lake')
df[idx]| STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | BLAC | Blackwell | OK | 36.75 | -97.25 | 2019 | 4 | 15 | 15 | 20 | ... | 38 | NaN | NaN | SSW | 15 | 23 | 1007.02 | 80 | 44 | NaN |
| 22 | CARL | Lake Carl Blackwell | OK | 36.15 | -97.29 | 2019 | 4 | 15 | 15 | 20 | ... | 36 | NaN | 80 | S | 17 | 25 | 1007.56 | 80 | 50 | NaN |
2 rows × 22 columns
idx = df['STID'].isin(['ACME','ALTU'])
df[idx]| STID | NAME | ST | LAT | LON | YR | MO | DA | HR | MI | ... | RELH | CHIL | HEAT | WDIR | WSPD | WMAX | PRES | TMAX | TMIN | RAIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ACME | Acme | OK | 34.81 | -98.02 | 2019 | 4 | 15 | 15 | 20 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | ALTU | Altus | OK | 34.59 | -99.34 | 2019 | 4 | 15 | 15 | 20 | ... | 39 | NaN | 82 | SSW | 19 | 26 | 1007.86 | 82 | 45 | NaN |
2 rows × 22 columns