29 Pandas module – PyNotes in Agriscience

Create DataFrame from existing variable

After importing the module we have two possible directions. We import data from a file or we convert an existing variable into a Pandas DataFrame. Here we will create a simple DatFrame to learn the basics. This way we will be able to display the result of our operations without worrying about extensive datasets.

Let’s create a dictionary with some weather data and missing values (represented by -9999).

# Create dictionary with some weather data
data = {'timestamp': ['1/1/2000','2/1/2000','3/1/2000','4/1/2000','5/1/2000'], 
        'wind_speed': [2.2, 3.2, -9999.0, 4.1, 2.9], 
        'wind_direction': ['E', 'NW', 'NW', 'N', 'S'],
        'precipitation': [0, 18, 25, 2, 0]}

The next step consists of converting the dictionary into a Pandas DataFrame. This is straight forward using the DataFrame method of the Pandas module: pd.DataFrame()

# Convert dictionary into DataFrame
df = pd.DataFrame(data)
df.head()

	timestamp	wind_speed	wind_direction	precipitation
0	1/1/2000	2.2	E	0
1	2/1/2000	3.2	NW	18
2	3/1/2000	-9999.0	NW	25
3	4/1/2000	4.1	N	2
4	5/1/2000	2.9	S	0

The above DataFrame has the following components:

header row containing column names
index (the left-most column with numbers from 0 to 4) is equivalent to a row name.
Each column has data of the same type.

# By default, values in Pandas series are Numpy arrays
print(df["wind_speed"].values)
print(type(df["wind_speed"].values))

[ 2.200e+00  3.200e+00 -9.999e+03  4.100e+00  2.900e+00]
<class 'numpy.ndarray'>

Basic methods and properties

Pandas DataFrame has dedicated functions to display a limited number of heading and tailing rows.

df.head(3) # First three rows

	timestamp	wind_speed	wind_direction	precipitation
0	1/1/2000	2.2	E	0
1	2/1/2000	3.2	NW	18
2	3/1/2000	-9999.0	NW	25

df.tail(3) # Last three rows

	timestamp	wind_speed	wind_direction	precipitation
2	3/1/2000	-9999.0	NW	25
3	4/1/2000	4.1	N	2
4	5/1/2000	2.9	S	0

Note

To display the DataFrame content simply use the head() and tail() methods. As an alternative you can use the print() function or type the name of the DataFrame and press ctrl + Enter. Note that by default Jupyter Lab highlights rows when using the head() or tail() methods.

To start exploring and analyzing our dataset it is often handy to know the column names.

# Display column names
df.columns

Index(['timestamp', 'wind_speed', 'wind_direction', 'precipitation'], dtype='object')

# Total number of elements
df.size

# Number of rows and columns
df.shape

(5, 4)

# Data type of each column
df.dtypes

timestamp          object
wind_speed        float64
wind_direction     object
precipitation       int64
dtype: object

Convert strings to datetime

# Convert dates in string format to Pandas datetime format
# %d = day in format 00 days
# %m = month in format 00 months
# %Y = full year

df["timestamp"] = pd.to_datetime(df["timestamp"], format="%d/%m/%Y")
df.head()

	timestamp	wind_speed	wind_direction	precipitation
0	2000-01-01	2.2	E	0
1	2000-01-02	3.2	NW	18
2	2000-01-03	-9999.0	NW	25
3	2000-01-04	4.1	N	2
4	2000-01-05	2.9	S	0

# The `timestamp` column has changed to datetime format
df.dtypes

timestamp         datetime64[ns]
wind_speed               float64
wind_direction            object
precipitation              int64
dtype: object

Extract information from the timestamp

Having specific information like day of the year, month, or weeks in a separate column can be useful to help us aggregate values. For instance, to compute the monthly mean air temperature we need to know in what month each temperature observations was recorded.

For this we will use the dt submodule within Pandas.

# Get the day of the year
df["doy"] = df["timestamp"].dt.dayofyear
df.head()

	timestamp	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	2.2	E	0	1
1	2000-01-02	3.2	NW	18	2
2	2000-01-03	-9999.0	NW	25	3
3	2000-01-04	4.1	N	2	4
4	2000-01-05	2.9	S	0	5

Note

The new column was placed at the end of the DataFrame. This the default when creating a new column.

In the next example we use the insert() method to add the new column in a specific location. Typically, for date components it helps to have the columns close to the datetime column.

# Get month from timstamp and create new column

#.insert(positionOfNewColumn, nameOfNewColumn, dataOfNewColumn)

df.insert(1,'month',df["timestamp"].dt.month)
df.head()

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	2.2	E	0	1
1	2000-01-02	1	3.2	NW	18	2
2	2000-01-03	1	-9999.0	NW	25	3
3	2000-01-04	1	4.1	N	2	4
4	2000-01-05	1	2.9	S	0	5

Warning

Re-running the previous cell will trigger an error since Pandas cannot have two columns with the same name.

Missing values

One of the most common operations when working with data is handling missing values. Almost every dataset has missing data and there is no universal way of denoting missing values. Most common placeholders are: NaN, NA, -99, -9999, M, missing, etc. To find out more about how missing data is represented in your dataset always read associated metadata. Some of these placeholders are automatically identified by Pandas as missing values and are represented as NaN values.

Pandas methods can deal with missing data, meaning that it is not necessary to always replace missing values in order to make computations. We just need to ensure that missing values are represented as NaN values.

The fillna() and interpolate() methods can help us replace missing values with an approximate value using neighboring points.

To replace missing values in our current dataset, we will follow these steps:

Identify the cells with -9999 values. Output will be a boolean DataFrame having the same dimensions as df.
Replace -9999 with NaN values from the Numpy module.
Check our work using the isna() method

Note

Missing values represented as np.nan are actually of type float.

# Print data type of NaN values from Numpy
type(np.nan)

float

# Step 1: find -9999 values across the entire dataframe

idx_missing = df.isin([-9999]) # or idx_missing = df == -9999
idx_missing

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	True	False	False	False
3	False	False	False	False	False	False
4	False	False	False	False	False	False

# Find missing vlaues in only one column
df["wind_speed"].isin([-9999]) # or df["wind_speed"] == -99999

0    False
1    False
2     True
3    False
4    False
Name: wind_speed, dtype: bool

Using the isin() method we can place multiple placeholders denoting missing data, as opposed to the boolean statement that would require multiple or statements.

# Step 2: Replace missing values with NaN

df[idx_missing] = np.nan
df

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	2.2	E	0	1
1	2000-01-02	1	3.2	NW	18	2
2	2000-01-03	1	NaN	NW	25	3
3	2000-01-04	1	4.1	N	2	4
4	2000-01-05	1	2.9	S	0	5

# Step 3: Check our work
df.isna()

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	True	False	False	False
3	False	False	False	False	False	False
4	False	False	False	False	False	False

Quick statistics

DataFrames have a variety of methods to calculate simple statistics. To obtain an overall summary we can use the describe() method.

# Summary stats for all columns
print(df.describe())

                 timestamp  month  wind_speed  precipitation       doy
count                    5    5.0    4.000000         5.0000  5.000000
mean   2000-01-03 00:00:00    1.0    3.100000         9.0000  3.000000
min    2000-01-01 00:00:00    1.0    2.200000         0.0000  1.000000
25%    2000-01-02 00:00:00    1.0    2.725000         0.0000  2.000000
50%    2000-01-03 00:00:00    1.0    3.050000         2.0000  3.000000
75%    2000-01-04 00:00:00    1.0    3.425000        18.0000  4.000000
max    2000-01-05 00:00:00    1.0    4.100000        25.0000  5.000000
std                    NaN    0.0    0.787401        11.7047  1.581139

# Metric ignoring NaN values
print(df["wind_speed"].max())         # Maximum value for each column
print(df["wind_speed"].mean())        # Average value for each column
print(df["wind_speed"].min())         # Minimum value for each column
print(df["wind_speed"].std())         # Standard deviation value for each column
print(df["wind_speed"].var())         # Variance value for each column
print(df["wind_speed"].median())      # Variance value for each column
print(df["wind_speed"].quantile(0.95))

4.1
3.1
2.2
0.7874007874011809
0.6199999999999997
3.05
3.9649999999999994

# Cumulative sum. Useful to compute cumulative precipitation
print(df.precipitation.cumsum())

0     0
1    18
2    43
3    45
4    45
Name: precipitation, dtype: int64

# Unique values. Useful to compute unique wind directions
print(df.wind_direction.unique())

['E' 'NW' 'N' 'S']

Indexing and slicing

To start making computations with need to access the data insde the Pandas DataFrame. Indexing and slicing are useful operations to select portions of data by calling specific rows, columns, or a combination of both. The index operator [] is primarily intended to be used with column labels (e.g. df[columnName]), however, it can also handle row slices (e.g. df[rows]). A common notation useful to understand how the slicing works is as follows:

Select rows

# First three rows
df[0:3]

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	2.2	E	0	1
1	2000-01-02	1	3.2	NW	18	2
2	2000-01-03	1	NaN	NW	25	3

Select columns

We can call individual columns using the dot or bracket notation. Note that in option 2 there is no . between df and ['windSpeed']

df['wind_speed'] # Option 1
# df.wind_speed  # Option 2

0    2.2
1    3.2
2    NaN
3    4.1
4    2.9
Name: wind_speed, dtype: float64

To pass more than one row of column you will need to group them in a list.

# Select multiple columns at once
df[['wind_speed','wind_direction']]

	wind_speed	wind_direction
0	2.2	E
1	3.2	NW
2	NaN	NW
3	4.1	N
4	2.9	S

A common mistake when slicing multiple columns is to forget grouping column names into a list, so the following command will not work:

df['wind_speed','wind_direction']

Using `iloc` method

iloc: Integer-location. iloc gets rows (or columns) at specific indexes. It only takes integers as input. Exclusive of its endpoint

# Top 3 rows and columns 1 and 2
df.iloc[0:3,[1,2]]

	month	wind_speed
0	1	2.2
1	1	3.2
2	1	NaN

# Top 2 rows and all columns
df.iloc[0:2,:] # Same as: df.iloc[0:2]

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	2.2	E	0	1
1	2000-01-02	1	3.2	NW	18	2

Although a bit more verbose and perhaps less pythonic, sometimes it is better to specify all the columns using the colon : character. In my opinion this notation is more explicit and clearly states the rows and columns of the slicing operation. For instance, df.iloc[0:2,:] is more explicit than df.iloc[0:2].

Using `loc` method

loc gets rows (or columns) with specific labels. Inclusive of its endpoint

# Select multiple rows and columns at once using the loc method
df.loc[0:2,['wind_speed','wind_direction']]

	wind_speed	wind_direction
0	2.2	E
1	3.2	NW
2	NaN	NW

# Some rows and all columns
df.loc[0:1,:]

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	2.2	E	0	1
1	2000-01-02	1	3.2	NW	18	2

# First three elements of a single column
df.loc[0:2,'wind_speed']

0    2.2
1    3.2
2    NaN
Name: wind_speed, dtype: float64

# First three elements of multiple columns
df.loc[0:2,['wind_speed','wind_direction']]

	wind_speed	wind_direction
0	2.2	E
1	3.2	NW
2	NaN	NW

These statements will not work with loc:

df.loc[0:2,0:1]

df.loc[[0:2],[0:1]]

Filter data using boolean indexing

Boolean indexing (a.k.a. logical indexing) consists of creating an array with True/False values as a result of one or more conditional statements that can be use to select data that meet the specified condition.

Let’s select the data across all columns for days that have wind speed greater than 3 meters per second. We will first select the rows of df.wind_speed that are greater than 3 m/s, and then we will use the resulting boolean to slice the DataFrame.

idx = df['wind_speed'] > 3 # Rows in which the wind speed is greater than 
idx  # Let's inspect the idx variable.

0    False
1     True
2    False
3     True
4    False
Name: wind_speed, dtype: bool

# Now let's apply the boolean variable to the dataframe
df[idx]

	timestamp	month	wind_speed	wind_direction	precipitation	doy
1	2000-01-02	1	3.2	NW	18	2
3	2000-01-04	1	4.1	N	2	4

# We can also apply the boolean to specific columns
df.loc[idx,'wind_direction']

1    NW
3     N
Name: wind_direction, dtype: object

It’s also possible to write the previous command as a single line of code. This is fine, but sometimes nesting too many conditions can create commands that are hard to read and understand. To avoid this problem, storing the boolean in a new variable makes things a lot easier to read and re-use.

# Same in a single line of code
df.loc[df['wind_speed'] > 3,'wind_direction']

1    NW
3     N
Name: wind_direction, dtype: object

Another popular way of filtering is to check whether an element or group of elements are within a set. Let’s check whether January 1 and January 2 are in the DataFrame.

idx_doy = df['doy'].isin([1,2])
idx_doy

0     True
1     True
2    False
3    False
4    False
Name: doy, dtype: bool

# Select all columns for the selected days of the year
df.loc[idx_doy,:]

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	2.2	E	0	1
1	2000-01-02	1	3.2	NW	18	2

Pandas custom date range

Most datasets collected over a period of time include timestamps, but in case the timestamps are missing or you need to create a range of dates during your analysis, Pandas has the date_range() method to create a sequence of timestamps.

subset_dates = pd.date_range('20000102', periods=2, freq='D') # Used df.shape[0] to find the total number of rows
subset_dates

DatetimeIndex(['2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq='D')

# The same to generate months
pd.date_range('20200101', periods=df.shape[0], freq='M') # Specify the frequency to months

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31'],
              dtype='datetime64[ns]', freq='M')

Select range of dates with boolean indexing

Now that we covered both boolean indexing and pandas dates we can use these concepts to select data from a specific window of time. This is a pretty common operation when trying to select a subset of the entire DataFrame by a specific date range.

# Generate boolean for rows that match the subset of dates generated earlier
idx_subset = df["timestamp"].isin(subset_dates)
idx_subset

0    False
1     True
2     True
3    False
4    False
Name: timestamp, dtype: bool

# Generate a new DataFrame using only the rows with matching dates
df_subset = df.loc[idx_subset]
df_subset

	timestamp	month	wind_speed	wind_direction	precipitation	doy
1	2000-01-02	1	3.2	NW	18	2
2	2000-01-03	1	NaN	NW	25	3

# It isn't always necessary to generate a new DataFrame
# So you can access a specific column like this
df.loc[idx_subset,"precipitation"]

1    18
2    25
Name: precipitation, dtype: int64

Add and remove columns

The insert() and drop() methods allow us to add or remove columns to/from the DataFrame. The most common use of these functions is as follows:

df.insert(index_of_new_column, name_of_new_column, data_of_new_column)

df.drop(name_of_column_to_be_removed)

# Add new column at a specific location
df.insert(2, 'air_temperature', [25.4, 26, 27.1, 28.9, 30.2]) # Similar to: df['dates'] = dates
df

	timestamp	month	air_temperature	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	25.4	2.2	E	0	1
1	2000-01-02	1	26.0	3.2	NW	18	2
2	2000-01-03	1	27.1	NaN	NW	25	3
3	2000-01-04	1	28.9	4.1	N	2	4
4	2000-01-05	1	30.2	2.9	S	0	5

# Remove specific column
df.drop(columns=['air_temperature'])

	timestamp	month	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	2.2	E	0	1
1	2000-01-02	1	3.2	NW	18	2
2	2000-01-03	1	NaN	NW	25	3
3	2000-01-04	1	4.1	N	2	4
4	2000-01-05	1	2.9	S	0	5

Reset DataFrame index

# Replace the index by a variables of our choice
df.set_index('timestamp')

	month	air_temperature	wind_speed	wind_direction	precipitation	doy
timestamp
2000-01-01	1	25.4	2.2	E	0	1
2000-01-02	1	26.0	3.2	NW	18	2
2000-01-03	1	27.1	NaN	NW	25	3
2000-01-04	1	28.9	4.1	N	2	4
2000-01-05	1	30.2	2.9	S	0	5

# Reset the index (see that 'doy' goes back to the end of the DataFrame again)
df.reset_index(0, drop=True)

	timestamp	month	air_temperature	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1	25.4	2.2	E	0	1
1	2000-01-02	1	26.0	3.2	NW	18	2
2	2000-01-03	1	27.1	NaN	NW	25	3
3	2000-01-04	1	28.9	4.1	N	2	4
4	2000-01-05	1	30.2	2.9	S	0	5

Merge two dataframes

# Create a new DataFrame (follows dates)

# Dictionary
data2 = {'timestamp': ['6/1/2000','7/1/2000','8/1/2000','9/1/2000','10/1/2000'], 
        'wind_speed': [4.3, 2.1, 0.5, 2.7, 1.9], 
        'wind_direction': ['N', 'N', 'SW', 'E', 'NW'],
        'precipitation': [0, 0, 0, 25, 0]}

# Dcitionary to DataFrame
df2 = pd.DataFrame(data2)

# Convert strings to pandas datetime
df2["timestamp"] = pd.to_datetime(df2["timestamp"], format="%d/%m/%Y") 

df2.head()

	timestamp	wind_speed	wind_direction	precipitation
0	2000-01-06	4.3	N	0
1	2000-01-07	2.1	N	0
2	2000-01-08	0.5	SW	0
3	2000-01-09	2.7	E	25
4	2000-01-10	1.9	NW	0

Warning

Not using the format="%d/%m/%y" in the previous cell results in the wrong datetime conversion. It is always recommended to specify the format.

# Merge both Dataframes by applying a union of keys from both frames (how='outer' option)
df_merged = pd.merge(df, df2, how='outer')
df_merged

	timestamp	month	air_temperature	wind_speed	wind_direction	precipitation	doy
0	2000-01-01	1.0	25.4	2.2	E	0	1.0
1	2000-01-02	1.0	26.0	3.2	NW	18	2.0
2	2000-01-03	1.0	27.1	NaN	NW	25	3.0
3	2000-01-04	1.0	28.9	4.1	N	2	4.0
4	2000-01-05	1.0	30.2	2.9	S	0	5.0
5	2000-01-06	NaN	NaN	4.3	N	0	NaN
6	2000-01-07	NaN	NaN	2.1	N	0	NaN
7	2000-01-08	NaN	NaN	0.5	SW	0	NaN
8	2000-01-09	NaN	NaN	2.7	E	25	NaN
9	2000-01-10	NaN	NaN	1.9	NW	0	NaN

Note how NaN values were assigned to variables not present in the new DataFrame

# Create another DataFrame with more limited data. Values every other day
data3 = {'timestamp': ['1/1/2000','3/1/2000','5/1/2000','7/1/2000','9/1/2000'], 
         'pressure': [980, 987, 985, 991, 990]}  # Pressure in millibars

df3 = pd.DataFrame(data3)
df3["timestamp"] = pd.to_datetime(df3["timestamp"], format="%d/%m/%Y")
df3.head()

	timestamp	pressure
0	2000-01-01	980
1	2000-01-03	987
2	2000-01-05	985
3	2000-01-07	991
4	2000-01-09	990

# Only the matching rows will be merged
df_merged.merge(df3, on="timestamp")

	timestamp	month	air_temperature	wind_speed	wind_direction	precipitation	doy	pressure
0	2000-01-01	1.0	25.4	2.2	E	0	1.0	980
1	2000-01-03	1.0	27.1	NaN	NW	25	3.0	987
2	2000-01-05	1.0	30.2	2.9	S	0	5.0	985
3	2000-01-07	NaN	NaN	2.1	N	0	NaN	991
4	2000-01-09	NaN	NaN	2.7	E	25	NaN	990

# Only add values from the new, more sporadic, variable where there is a match.
df_merged.merge(df3, how="left")

	timestamp	month	air_temperature	wind_speed	wind_direction	precipitation	doy	pressure
0	2000-01-01	1.0	25.4	2.2	E	0	1.0	980.0
1	2000-01-02	1.0	26.0	3.2	NW	18	2.0	NaN
2	2000-01-03	1.0	27.1	NaN	NW	25	3.0	987.0
3	2000-01-04	1.0	28.9	4.1	N	2	4.0	NaN
4	2000-01-05	1.0	30.2	2.9	S	0	5.0	985.0
5	2000-01-06	NaN	NaN	4.3	N	0	NaN	NaN
6	2000-01-07	NaN	NaN	2.1	N	0	NaN	991.0
7	2000-01-08	NaN	NaN	0.5	SW	0	NaN	NaN
8	2000-01-09	NaN	NaN	2.7	E	25	NaN	990.0
9	2000-01-10	NaN	NaN	1.9	NW	0	NaN	NaN

Operations with real dataset

# Read CSV file
data_url = "../datasets/ok_mesonet_8_apr_2019.csv"
df = pd.read_csv(data_url)
df.head(5)

	STID	NAME	ST	LAT	LON	YR	MO	DA	HR	MI	...	RELH	HEAT	WDIR	WSPD	WMAX	PRES	TMAX	TMIN
0	ACME	Acme	OK	34.81	-98.02	2019	4	15	15	20	...
1	ADAX	Ada	OK	34.80	-96.67	2019	4	15	15	20	...	40		S	12	20	1011.13	78	48
2	ALTU	Altus	OK	34.59	-99.34	2019	4	15	15	20	...	39	82	SSW	19	26	1007.86	82	45
3	ALV2	Alva	OK	36.71	-98.71	2019	4	15	15	20	...	32	82	S	20	26	1004.65	84	40
4	ANT2	Antlers	OK	34.25	-95.67	2019	4	15	15	20	...	35		S	11	20	1013.64	78	38

5 rows × 22 columns

Warning

Some columns seem to have empty cells. Ideally we would like to see these cells filled with NaN values. Something looks fishy. Let’s inspect some of these cells.

# Print one the cells to see what's in there
df.loc[0,'RAIN']

' '

In the inspected cell we found a string with a single space in it. Now we can use the replace() method to substitute these strings for NaN values.

Note

In Pandas, the inplace=True argument is used when performing operations on a DataFrame or Series to decide whether the changes made should affect the original data or not. When you use inplace=True, any changes you make to the DataFrame or Series will modify the original data directly. It means that you’re altering the existing data, and you don’t need to assign the result to a new variable. When omitting inplace=True, Pandas by default creates a new copy of the DataFrame or Series with the changes applied. This means that the original data remains unchanged, and you need to assign the result to a new variable if you want to keep the modified data.

# Replace empty strings with NaN values
df.replace(' ', np.nan, inplace=True)
df.head(5)

	STID	NAME	ST	LAT	LON	YR	MO	DA	HR	MI	...	RELH	CHIL	HEAT	WDIR	WSPD	WMAX	PRES	TMAX	TMIN	RAIN
0	ACME	Acme	OK	34.81	-98.02	2019	4	15	15	20	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	ADAX	Ada	OK	34.80	-96.67	2019	4	15	15	20	...	40	NaN	NaN	S	12	20	1011.13	78	48	NaN
2	ALTU	Altus	OK	34.59	-99.34	2019	4	15	15	20	...	39	NaN	82	SSW	19	26	1007.86	82	45	NaN
3	ALV2	Alva	OK	36.71	-98.71	2019	4	15	15	20	...	32	NaN	82	S	20	26	1004.65	84	40	NaN
4	ANT2	Antlers	OK	34.25	-95.67	2019	4	15	15	20	...	35	NaN	NaN	S	11	20	1013.64	78	38	NaN

5 rows × 22 columns

Tip

The previous solution is not the best. We could have resolved the issue with the empty strings by simply adding the following option na_values=' ' to the pd.read_csv() function, like this: df = pd.read_csv(data_url, na_values=’ ’). This will automatically populate all cells that contain ' ' with NaN values.

Match specific stations

idx_acme = df['STID'].str.match('ACME')
df[idx_acme]

	STID	NAME	ST	LAT	LON	YR	MO	DA	HR	MI	...	RELH	CHIL	HEAT	WDIR	WSPD	WMAX	PRES	TMAX	TMIN	RAIN
0	ACME	Acme	OK	34.81	-98.02	2019	4	15	15	20	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1 rows × 22 columns

idx_starts_with_A = df['STID'].str.match('A')
df[idx_starts_with_A]

	STID	NAME	ST	LAT	LON	YR	MO	DA	HR	MI	...	RELH	CHIL	HEAT	WDIR	WSPD	WMAX	PRES	TMAX	TMIN	RAIN
0	ACME	Acme	OK	34.81	-98.02	2019	4	15	15	20	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	ADAX	Ada	OK	34.80	-96.67	2019	4	15	15	20	...	40	NaN	NaN	S	12	20	1011.13	78	48	NaN
2	ALTU	Altus	OK	34.59	-99.34	2019	4	15	15	20	...	39	NaN	82	SSW	19	26	1007.86	82	45	NaN
3	ALV2	Alva	OK	36.71	-98.71	2019	4	15	15	20	...	32	NaN	82	S	20	26	1004.65	84	40	NaN
4	ANT2	Antlers	OK	34.25	-95.67	2019	4	15	15	20	...	35	NaN	NaN	S	11	20	1013.64	78	38	NaN
5	APAC	Apache	OK	34.91	-98.29	2019	4	15	15	20	...	41	NaN	NaN	S	23	29	1008.9	80	49	NaN
6	ARD2	Ardmore	OK	34.19	-97.09	2019	4	15	15	20	...	41	NaN	NaN	S	18	26	1011.43	77	50	NaN
7	ARNE	Arnett	OK	36.07	-99.90	2019	4	15	15	20	...	10	NaN	85	SW	22	32	1005.13	NaN	NaN	NaN

8 rows × 22 columns

idx_has_A = df['STID'].str.contains('A')
df[idx_has_A].head(15)

	STID	NAME	ST	LAT	LON	YR	MO	DA	HR	MI	...	RELH	CHIL	HEAT	WDIR	WSPD	WMAX	PRES	TMAX	TMIN	RAIN
0	ACME	Acme	OK	34.81	-98.02	2019	4	15	15	20	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	ADAX	Ada	OK	34.80	-96.67	2019	4	15	15	20	...	40	NaN	NaN	S	12	20	1011.13	78	48	NaN
2	ALTU	Altus	OK	34.59	-99.34	2019	4	15	15	20	...	39	NaN	82	SSW	19	26	1007.86	82	45	NaN
3	ALV2	Alva	OK	36.71	-98.71	2019	4	15	15	20	...	32	NaN	82	S	20	26	1004.65	84	40	NaN
4	ANT2	Antlers	OK	34.25	-95.67	2019	4	15	15	20	...	35	NaN	NaN	S	11	20	1013.64	78	38	NaN
5	APAC	Apache	OK	34.91	-98.29	2019	4	15	15	20	...	41	NaN	NaN	S	23	29	1008.9	80	49	NaN
6	ARD2	Ardmore	OK	34.19	-97.09	2019	4	15	15	20	...	41	NaN	NaN	S	18	26	1011.43	77	50	NaN
7	ARNE	Arnett	OK	36.07	-99.90	2019	4	15	15	20	...	10	NaN	85	SW	22	32	1005.13	NaN	NaN	NaN
8	BEAV	Beaver	OK	36.80	-100.53	2019	4	15	15	20	...	9	NaN	84	SW	17	26	1003.9	91	34	NaN
11	BLAC	Blackwell	OK	36.75	-97.25	2019	4	15	15	20	...	38	NaN	NaN	SSW	15	23	1007.02	80	44	NaN
20	BYAR	Byars	OK	34.85	-97.00	2019	4	15	15	20	...	43	NaN	NaN	S	22	32	1010.64	77	49	NaN
21	CAMA	Camargo	OK	36.03	-99.35	2019	4	15	15	20	...	32	NaN	82	S	23	29	1005.56	NaN	NaN	NaN
22	CARL	Lake Carl Blackwell	OK	36.15	-97.29	2019	4	15	15	20	...	36	NaN	80	S	17	25	1007.56	80	50	NaN
24	CHAN	Chandler	OK	35.65	-96.80	2019	4	15	15	20	...	37	NaN	NaN	SSW	16	27	1009.35	80	48	NaN
28	CLAY	Clayton	OK	34.66	-95.33	2019	4	15	15	20	...	36	NaN	NaN	S	9	24	1012.9	78	40	NaN

15 rows × 22 columns

idx = df['NAME'].str.contains('Blackwell') & df['NAME'].str.contains('Lake')
df[idx]

	STID	NAME	ST	LAT	LON	YR	MO	DA	HR	MI	...	RELH	CHIL	HEAT	WDIR	WSPD	WMAX	PRES	TMAX	TMIN	RAIN
22	CARL	Lake Carl Blackwell	OK	36.15	-97.29	2019	4	15	15	20	...	36	NaN	80	S	17	25	1007.56	80	50	NaN

1 rows × 22 columns

The following line won’t work because the string matching is case sensitive:

idx = df['NAME'].str.contains('LAKE')

idx = df['NAME'].str.contains('Blackwell') | df['NAME'].str.contains('Lake')
df[idx]

	STID	NAME	ST	LAT	LON	YR	MO	DA	HR	MI	...	RELH	CHIL	HEAT	WDIR	WSPD	WMAX	PRES	TMAX	TMIN	RAIN
11	BLAC	Blackwell	OK	36.75	-97.25	2019	4	15	15	20	...	38	NaN	NaN	SSW	15	23	1007.02	80	44	NaN
22	CARL	Lake Carl Blackwell	OK	36.15	-97.29	2019	4	15	15	20	...	36	NaN	80	S	17	25	1007.56	80	50	NaN

2 rows × 22 columns

idx = df['STID'].isin(['ACME','ALTU'])
df[idx]

	STID	NAME	ST	LAT	LON	YR	MO	DA	HR	MI	...	RELH	CHIL	HEAT	WDIR	WSPD	WMAX	PRES	TMAX	TMIN	RAIN
0	ACME	Acme	OK	34.81	-98.02	2019	4	15	15	20	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	ALTU	Altus	OK	34.59	-99.34	2019	4	15	15	20	...	39	NaN	82	SSW	19	26	1007.86	82	45	NaN

2 rows × 22 columns

Create DataFrame from existing variable

Basic methods and properties

Convert strings to datetime

Extract information from the timestamp

Missing values

Quick statistics

Indexing and slicing

Select rows

Select columns

Using iloc method

Using loc method

Filter data using boolean indexing

Pandas custom date range

Select range of dates with boolean indexing

Add and remove columns

Reset DataFrame index

Merge two dataframes

Operations with real dataset

Match specific stations

Using `iloc` method

Using `loc` method