29  Pandas module

Author

Andres Patrignani

Published

January 8, 2024

Pandas is a powerful library primarily used for tabular data analysis. It offers data structures like DataFrames and Series, which make tasks such as data cleaning and aggregation easy. Pandas integrates well with other libraries like Numpy, and can read data saved in various text formats and spreadsheet software like MS Excel. One of the hallmarks of Pandas DataFrames is that we can call data stored in rows and columns by name, similar to working with Python dictionaries.

# Import modules
import pandas as pd
import numpy as np

Create DataFrame from existing variable

After importing the module we have two possible directions. We import data from a file or we convert an existing variable into a Pandas DataFrame. Here we will create a simple DatFrame to learn the basics. This way we will be able to display the result of our operations without worrying about extensive datasets.

Let’s create a dictionary with some weather data and missing values (represented by -9999).

# Create dictionary with some weather data
data = {'timestamp': ['1/1/2000','2/1/2000','3/1/2000','4/1/2000','5/1/2000'], 
        'wind_speed': [2.2, 3.2, -9999.0, 4.1, 2.9], 
        'wind_direction': ['E', 'NW', 'NW', 'N', 'S'],
        'precipitation': [0, 18, 25, 2, 0]}

The next step consists of converting the dictionary into a Pandas DataFrame. This is straight forward using the DataFrame method of the Pandas module: pd.DataFrame()

# Convert dictionary into DataFrame
df = pd.DataFrame(data)
df.head()
timestamp wind_speed wind_direction precipitation
0 1/1/2000 2.2 E 0
1 2/1/2000 3.2 NW 18
2 3/1/2000 -9999.0 NW 25
3 4/1/2000 4.1 N 2
4 5/1/2000 2.9 S 0

The above DataFrame has the following components:

  1. header row containing column names

  2. index (the left-most column with numbers from 0 to 4) is equivalent to a row name.

  3. Each column has data of the same type.

# By default, values in Pandas series are Numpy arrays
print(df["wind_speed"].values)
print(type(df["wind_speed"].values))
[ 2.200e+00  3.200e+00 -9.999e+03  4.100e+00  2.900e+00]
<class 'numpy.ndarray'>

Basic methods and properties

Pandas DataFrame has dedicated functions to display a limited number of heading and tailing rows.

df.head(3) # First three rows
timestamp wind_speed wind_direction precipitation
0 1/1/2000 2.2 E 0
1 2/1/2000 3.2 NW 18
2 3/1/2000 -9999.0 NW 25
df.tail(3) # Last three rows
timestamp wind_speed wind_direction precipitation
2 3/1/2000 -9999.0 NW 25
3 4/1/2000 4.1 N 2
4 5/1/2000 2.9 S 0
Note

To display the DataFrame content simply use the head() and tail() methods. As an alternative you can use the print() function or type the name of the DataFrame and press ctrl + Enter. Note that by default Jupyter Lab highlights rows when using the head() or tail() methods.

To start exploring and analyzing our dataset it is often handy to know the column names.

# Display column names
df.columns
Index(['timestamp', 'wind_speed', 'wind_direction', 'precipitation'], dtype='object')
# Total number of elements
df.size
20
# Number of rows and columns
df.shape
(5, 4)
# Data type of each column
df.dtypes
timestamp          object
wind_speed        float64
wind_direction     object
precipitation       int64
dtype: object

Convert strings to datetime

# Convert dates in string format to Pandas datetime format
# %d = day in format 00 days
# %m = month in format 00 months
# %Y = full year

df["timestamp"] = pd.to_datetime(df["timestamp"], format="%d/%m/%Y")
df.head()
timestamp wind_speed wind_direction precipitation
0 2000-01-01 2.2 E 0
1 2000-01-02 3.2 NW 18
2 2000-01-03 -9999.0 NW 25
3 2000-01-04 4.1 N 2
4 2000-01-05 2.9 S 0
# The `timestamp` column has changed to datetime format
df.dtypes
timestamp         datetime64[ns]
wind_speed               float64
wind_direction            object
precipitation              int64
dtype: object

Extract information from the timestamp

Having specific information like day of the year, month, or weeks in a separate column can be useful to help us aggregate values. For instance, to compute the monthly mean air temperature we need to know in what month each temperature observations was recorded.

For this we will use the dt submodule within Pandas.

# Get the day of the year
df["doy"] = df["timestamp"].dt.dayofyear
df.head()
timestamp wind_speed wind_direction precipitation doy
0 2000-01-01 2.2 E 0 1
1 2000-01-02 3.2 NW 18 2
2 2000-01-03 -9999.0 NW 25 3
3 2000-01-04 4.1 N 2 4
4 2000-01-05 2.9 S 0 5
Note

The new column was placed at the end of the DataFrame. This the default when creating a new column.

In the next example we use the insert() method to add the new column in a specific location. Typically, for date components it helps to have the columns close to the datetime column.

# Get month from timstamp and create new column

#.insert(positionOfNewColumn, nameOfNewColumn, dataOfNewColumn)

df.insert(1,'month',df["timestamp"].dt.month)
df.head()
timestamp month wind_speed wind_direction precipitation doy
0 2000-01-01 1 2.2 E 0 1
1 2000-01-02 1 3.2 NW 18 2
2 2000-01-03 1 -9999.0 NW 25 3
3 2000-01-04 1 4.1 N 2 4
4 2000-01-05 1 2.9 S 0 5
Warning

Re-running the previous cell will trigger an error since Pandas cannot have two columns with the same name.

Missing values

One of the most common operations when working with data is handling missing values. Almost every dataset has missing data and there is no universal way of denoting missing values. Most common placeholders are: NaN, NA, -99, -9999, M, missing, etc. To find out more about how missing data is represented in your dataset always read associated metadata. Some of these placeholders are automatically identified by Pandas as missing values and are represented as NaN values.

Pandas methods can deal with missing data, meaning that it is not necessary to always replace missing values in order to make computations. We just need to ensure that missing values are represented as NaN values.

The fillna() and interpolate() methods can help us replace missing values with an approximate value using neighboring points.

To replace missing values in our current dataset, we will follow these steps:

  1. Identify the cells with -9999 values. Output will be a boolean DataFrame having the same dimensions as df.

  2. Replace -9999 with NaN values from the Numpy module.

  3. Check our work using the isna() method

Note

Missing values represented as np.nan are actually of type float.

# Print data type of NaN values from Numpy
type(np.nan)
float
# Step 1: find -9999 values across the entire dataframe

idx_missing = df.isin([-9999]) # or idx_missing = df == -9999
idx_missing
timestamp month wind_speed wind_direction precipitation doy
0 False False False False False False
1 False False False False False False
2 False False True False False False
3 False False False False False False
4 False False False False False False
# Find missing vlaues in only one column
df["wind_speed"].isin([-9999]) # or df["wind_speed"] == -99999
0    False
1    False
2     True
3    False
4    False
Name: wind_speed, dtype: bool

Using the isin() method we can place multiple placeholders denoting missing data, as opposed to the boolean statement that would require multiple or statements.

# Step 2: Replace missing values with NaN

df[idx_missing] = np.nan
df
timestamp month wind_speed wind_direction precipitation doy
0 2000-01-01 1 2.2 E 0 1
1 2000-01-02 1 3.2 NW 18 2
2 2000-01-03 1 NaN NW 25 3
3 2000-01-04 1 4.1 N 2 4
4 2000-01-05 1 2.9 S 0 5
# Step 3: Check our work
df.isna()
timestamp month wind_speed wind_direction precipitation doy
0 False False False False False False
1 False False False False False False
2 False False True False False False
3 False False False False False False
4 False False False False False False

Quick statistics

DataFrames have a variety of methods to calculate simple statistics. To obtain an overall summary we can use the describe() method.

# Summary stats for all columns
print(df.describe())
                 timestamp  month  wind_speed  precipitation       doy
count                    5    5.0    4.000000         5.0000  5.000000
mean   2000-01-03 00:00:00    1.0    3.100000         9.0000  3.000000
min    2000-01-01 00:00:00    1.0    2.200000         0.0000  1.000000
25%    2000-01-02 00:00:00    1.0    2.725000         0.0000  2.000000
50%    2000-01-03 00:00:00    1.0    3.050000         2.0000  3.000000
75%    2000-01-04 00:00:00    1.0    3.425000        18.0000  4.000000
max    2000-01-05 00:00:00    1.0    4.100000        25.0000  5.000000
std                    NaN    0.0    0.787401        11.7047  1.581139
# Metric ignoring NaN values
print(df["wind_speed"].max())         # Maximum value for each column
print(df["wind_speed"].mean())        # Average value for each column
print(df["wind_speed"].min())         # Minimum value for each column
print(df["wind_speed"].std())         # Standard deviation value for each column
print(df["wind_speed"].var())         # Variance value for each column
print(df["wind_speed"].median())      # Variance value for each column
print(df["wind_speed"].quantile(0.95))
4.1
3.1
2.2
0.7874007874011809
0.6199999999999997
3.05
3.9649999999999994
# Cumulative sum. Useful to compute cumulative precipitation
print(df.precipitation.cumsum())
0     0
1    18
2    43
3    45
4    45
Name: precipitation, dtype: int64
# Unique values. Useful to compute unique wind directions
print(df.wind_direction.unique())
['E' 'NW' 'N' 'S']

Indexing and slicing

To start making computations with need to access the data insde the Pandas DataFrame. Indexing and slicing are useful operations to select portions of data by calling specific rows, columns, or a combination of both. The index operator [] is primarily intended to be used with column labels (e.g. df[columnName]), however, it can also handle row slices (e.g. df[rows]). A common notation useful to understand how the slicing works is as follows:

Select rows

# First three rows
df[0:3]
timestamp month wind_speed wind_direction precipitation doy
0 2000-01-01 1 2.2 E 0 1
1 2000-01-02 1 3.2 NW 18 2
2 2000-01-03 1 NaN NW 25 3

Select columns

We can call individual columns using the dot or bracket notation. Note that in option 2 there is no . between df and ['windSpeed']

df['wind_speed'] # Option 1
# df.wind_speed  # Option 2
0    2.2
1    3.2
2    NaN
3    4.1
4    2.9
Name: wind_speed, dtype: float64

To pass more than one row of column you will need to group them in a list.

# Select multiple columns at once
df[['wind_speed','wind_direction']]
wind_speed wind_direction
0 2.2 E
1 3.2 NW
2 NaN NW
3 4.1 N
4 2.9 S

A common mistake when slicing multiple columns is to forget grouping column names into a list, so the following command will not work:

df['wind_speed','wind_direction']

Using iloc method

iloc: Integer-location. iloc gets rows (or columns) at specific indexes. It only takes integers as input. Exclusive of its endpoint

# Top 3 rows and columns 1 and 2
df.iloc[0:3,[1,2]]
month wind_speed
0 1 2.2
1 1 3.2
2 1 NaN
# Top 2 rows and all columns
df.iloc[0:2,:] # Same as: df.iloc[0:2]
timestamp month wind_speed wind_direction precipitation doy
0 2000-01-01 1 2.2 E 0 1
1 2000-01-02 1 3.2 NW 18 2

Although a bit more verbose and perhaps less pythonic, sometimes it is better to specify all the columns using the colon : character. In my opinion this notation is more explicit and clearly states the rows and columns of the slicing operation. For instance, df.iloc[0:2,:] is more explicit than df.iloc[0:2].

Using loc method

loc gets rows (or columns) with specific labels. Inclusive of its endpoint

# Select multiple rows and columns at once using the loc method
df.loc[0:2,['wind_speed','wind_direction']]
wind_speed wind_direction
0 2.2 E
1 3.2 NW
2 NaN NW
# Some rows and all columns
df.loc[0:1,:]
timestamp month wind_speed wind_direction precipitation doy
0 2000-01-01 1 2.2 E 0 1
1 2000-01-02 1 3.2 NW 18 2
# First three elements of a single column
df.loc[0:2,'wind_speed']  
0    2.2
1    3.2
2    NaN
Name: wind_speed, dtype: float64
# First three elements of multiple columns
df.loc[0:2,['wind_speed','wind_direction']]
wind_speed wind_direction
0 2.2 E
1 3.2 NW
2 NaN NW

These statements will not work with loc:

df.loc[0:2,0:1]

df.loc[[0:2],[0:1]]

Filter data using boolean indexing

Boolean indexing (a.k.a. logical indexing) consists of creating an array with True/False values as a result of one or more conditional statements that can be use to select data that meet the specified condition.

Let’s select the data across all columns for days that have wind speed greater than 3 meters per second. We will first select the rows of df.wind_speed that are greater than 3 m/s, and then we will use the resulting boolean to slice the DataFrame.

idx = df['wind_speed'] > 3 # Rows in which the wind speed is greater than 
idx  # Let's inspect the idx variable.
0    False
1     True
2    False
3     True
4    False
Name: wind_speed, dtype: bool
# Now let's apply the boolean variable to the dataframe
df[idx]
timestamp month wind_speed wind_direction precipitation doy
1 2000-01-02 1 3.2 NW 18 2
3 2000-01-04 1 4.1 N 2 4
# We can also apply the boolean to specific columns
df.loc[idx,'wind_direction']
1    NW
3     N
Name: wind_direction, dtype: object

It’s also possible to write the previous command as a single line of code. This is fine, but sometimes nesting too many conditions can create commands that are hard to read and understand. To avoid this problem, storing the boolean in a new variable makes things a lot easier to read and re-use.

# Same in a single line of code
df.loc[df['wind_speed'] > 3,'wind_direction']
1    NW
3     N
Name: wind_direction, dtype: object

Another popular way of filtering is to check whether an element or group of elements are within a set. Let’s check whether January 1 and January 2 are in the DataFrame.

idx_doy = df['doy'].isin([1,2])
idx_doy
0     True
1     True
2    False
3    False
4    False
Name: doy, dtype: bool
# Select all columns for the selected days of the year
df.loc[idx_doy,:]
timestamp month wind_speed wind_direction precipitation doy
0 2000-01-01 1 2.2 E 0 1
1 2000-01-02 1 3.2 NW 18 2

Pandas custom date range

Most datasets collected over a period of time include timestamps, but in case the timestamps are missing or you need to create a range of dates during your analysis, Pandas has the date_range() method to create a sequence of timestamps.

subset_dates = pd.date_range('20000102', periods=2, freq='D') # Used df.shape[0] to find the total number of rows
subset_dates
DatetimeIndex(['2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq='D')
# The same to generate months
pd.date_range('20200101', periods=df.shape[0], freq='M') # Specify the frequency to months
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31'],
              dtype='datetime64[ns]', freq='M')

Select range of dates with boolean indexing

Now that we covered both boolean indexing and pandas dates we can use these concepts to select data from a specific window of time. This is a pretty common operation when trying to select a subset of the entire DataFrame by a specific date range.

# Generate boolean for rows that match the subset of dates generated earlier
idx_subset = df["timestamp"].isin(subset_dates)
idx_subset
0    False
1     True
2     True
3    False
4    False
Name: timestamp, dtype: bool
# Generate a new DataFrame using only the rows with matching dates
df_subset = df.loc[idx_subset]
df_subset
timestamp month wind_speed wind_direction precipitation doy
1 2000-01-02 1 3.2 NW 18 2
2 2000-01-03 1 NaN NW 25 3
# It isn't always necessary to generate a new DataFrame
# So you can access a specific column like this
df.loc[idx_subset,"precipitation"]
1    18
2    25
Name: precipitation, dtype: int64

Add and remove columns

The insert() and drop() methods allow us to add or remove columns to/from the DataFrame. The most common use of these functions is as follows:

df.insert(index_of_new_column, name_of_new_column, data_of_new_column)

df.drop(name_of_column_to_be_removed)

# Add new column at a specific location
df.insert(2, 'air_temperature', [25.4, 26, 27.1, 28.9, 30.2]) # Similar to: df['dates'] = dates
df
timestamp month air_temperature wind_speed wind_direction precipitation doy
0 2000-01-01 1 25.4 2.2 E 0 1
1 2000-01-02 1 26.0 3.2 NW 18 2
2 2000-01-03 1 27.1 NaN NW 25 3
3 2000-01-04 1 28.9 4.1 N 2 4
4 2000-01-05 1 30.2 2.9 S 0 5
# Remove specific column
df.drop(columns=['air_temperature'])
timestamp month wind_speed wind_direction precipitation doy
0 2000-01-01 1 2.2 E 0 1
1 2000-01-02 1 3.2 NW 18 2
2 2000-01-03 1 NaN NW 25 3
3 2000-01-04 1 4.1 N 2 4
4 2000-01-05 1 2.9 S 0 5

Reset DataFrame index

# Replace the index by a variables of our choice
df.set_index('timestamp')
month air_temperature wind_speed wind_direction precipitation doy
timestamp
2000-01-01 1 25.4 2.2 E 0 1
2000-01-02 1 26.0 3.2 NW 18 2
2000-01-03 1 27.1 NaN NW 25 3
2000-01-04 1 28.9 4.1 N 2 4
2000-01-05 1 30.2 2.9 S 0 5
# Reset the index (see that 'doy' goes back to the end of the DataFrame again)
df.reset_index(0, drop=True)
timestamp month air_temperature wind_speed wind_direction precipitation doy
0 2000-01-01 1 25.4 2.2 E 0 1
1 2000-01-02 1 26.0 3.2 NW 18 2
2 2000-01-03 1 27.1 NaN NW 25 3
3 2000-01-04 1 28.9 4.1 N 2 4
4 2000-01-05 1 30.2 2.9 S 0 5

Merge two dataframes

# Create a new DataFrame (follows dates)

# Dictionary
data2 = {'timestamp': ['6/1/2000','7/1/2000','8/1/2000','9/1/2000','10/1/2000'], 
        'wind_speed': [4.3, 2.1, 0.5, 2.7, 1.9], 
        'wind_direction': ['N', 'N', 'SW', 'E', 'NW'],
        'precipitation': [0, 0, 0, 25, 0]}

# Dcitionary to DataFrame
df2 = pd.DataFrame(data2)

# Convert strings to pandas datetime
df2["timestamp"] = pd.to_datetime(df2["timestamp"], format="%d/%m/%Y") 

df2.head()
timestamp wind_speed wind_direction precipitation
0 2000-01-06 4.3 N 0
1 2000-01-07 2.1 N 0
2 2000-01-08 0.5 SW 0
3 2000-01-09 2.7 E 25
4 2000-01-10 1.9 NW 0
Warning

Not using the format="%d/%m/%y" in the previous cell results in the wrong datetime conversion. It is always recommended to specify the format.

# Merge both Dataframes by applying a union of keys from both frames (how='outer' option)
df_merged = pd.merge(df, df2, how='outer')
df_merged
timestamp month air_temperature wind_speed wind_direction precipitation doy
0 2000-01-01 1.0 25.4 2.2 E 0 1.0
1 2000-01-02 1.0 26.0 3.2 NW 18 2.0
2 2000-01-03 1.0 27.1 NaN NW 25 3.0
3 2000-01-04 1.0 28.9 4.1 N 2 4.0
4 2000-01-05 1.0 30.2 2.9 S 0 5.0
5 2000-01-06 NaN NaN 4.3 N 0 NaN
6 2000-01-07 NaN NaN 2.1 N 0 NaN
7 2000-01-08 NaN NaN 0.5 SW 0 NaN
8 2000-01-09 NaN NaN 2.7 E 25 NaN
9 2000-01-10 NaN NaN 1.9 NW 0 NaN

Note how NaN values were assigned to variables not present in the new DataFrame

# Create another DataFrame with more limited data. Values every other day
data3 = {'timestamp': ['1/1/2000','3/1/2000','5/1/2000','7/1/2000','9/1/2000'], 
         'pressure': [980, 987, 985, 991, 990]}  # Pressure in millibars

df3 = pd.DataFrame(data3)
df3["timestamp"] = pd.to_datetime(df3["timestamp"], format="%d/%m/%Y")
df3.head()
timestamp pressure
0 2000-01-01 980
1 2000-01-03 987
2 2000-01-05 985
3 2000-01-07 991
4 2000-01-09 990
# Only the matching rows will be merged
df_merged.merge(df3, on="timestamp")
timestamp month air_temperature wind_speed wind_direction precipitation doy pressure
0 2000-01-01 1.0 25.4 2.2 E 0 1.0 980
1 2000-01-03 1.0 27.1 NaN NW 25 3.0 987
2 2000-01-05 1.0 30.2 2.9 S 0 5.0 985
3 2000-01-07 NaN NaN 2.1 N 0 NaN 991
4 2000-01-09 NaN NaN 2.7 E 25 NaN 990
# Only add values from the new, more sporadic, variable where there is a match.
df_merged.merge(df3, how="left")
timestamp month air_temperature wind_speed wind_direction precipitation doy pressure
0 2000-01-01 1.0 25.4 2.2 E 0 1.0 980.0
1 2000-01-02 1.0 26.0 3.2 NW 18 2.0 NaN
2 2000-01-03 1.0 27.1 NaN NW 25 3.0 987.0
3 2000-01-04 1.0 28.9 4.1 N 2 4.0 NaN
4 2000-01-05 1.0 30.2 2.9 S 0 5.0 985.0
5 2000-01-06 NaN NaN 4.3 N 0 NaN NaN
6 2000-01-07 NaN NaN 2.1 N 0 NaN 991.0
7 2000-01-08 NaN NaN 0.5 SW 0 NaN NaN
8 2000-01-09 NaN NaN 2.7 E 25 NaN 990.0
9 2000-01-10 NaN NaN 1.9 NW 0 NaN NaN

Operations with real dataset

# Read CSV file
data_url = "../datasets/ok_mesonet_8_apr_2019.csv"
df = pd.read_csv(data_url)
df.head(5)
STID NAME ST LAT LON YR MO DA HR MI ... RELH CHIL HEAT WDIR WSPD WMAX PRES TMAX TMIN RAIN
0 ACME Acme OK 34.81 -98.02 2019 4 15 15 20 ...
1 ADAX Ada OK 34.80 -96.67 2019 4 15 15 20 ... 40 S 12 20 1011.13 78 48
2 ALTU Altus OK 34.59 -99.34 2019 4 15 15 20 ... 39 82 SSW 19 26 1007.86 82 45
3 ALV2 Alva OK 36.71 -98.71 2019 4 15 15 20 ... 32 82 S 20 26 1004.65 84 40
4 ANT2 Antlers OK 34.25 -95.67 2019 4 15 15 20 ... 35 S 11 20 1013.64 78 38

5 rows × 22 columns

Warning

Some columns seem to have empty cells. Ideally we would like to see these cells filled with NaN values. Something looks fishy. Let’s inspect some of these cells.

# Print one the cells to see what's in there
df.loc[0,'RAIN']
' '

In the inspected cell we found a string with a single space in it. Now we can use the replace() method to substitute these strings for NaN values.

Note

In Pandas, the inplace=True argument is used when performing operations on a DataFrame or Series to decide whether the changes made should affect the original data or not. When you use inplace=True, any changes you make to the DataFrame or Series will modify the original data directly. It means that you’re altering the existing data, and you don’t need to assign the result to a new variable. When omitting inplace=True, Pandas by default creates a new copy of the DataFrame or Series with the changes applied. This means that the original data remains unchanged, and you need to assign the result to a new variable if you want to keep the modified data.

# Replace empty strings with NaN values
df.replace(' ', np.nan, inplace=True)
df.head(5)
STID NAME ST LAT LON YR MO DA HR MI ... RELH CHIL HEAT WDIR WSPD WMAX PRES TMAX TMIN RAIN
0 ACME Acme OK 34.81 -98.02 2019 4 15 15 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 ADAX Ada OK 34.80 -96.67 2019 4 15 15 20 ... 40 NaN NaN S 12 20 1011.13 78 48 NaN
2 ALTU Altus OK 34.59 -99.34 2019 4 15 15 20 ... 39 NaN 82 SSW 19 26 1007.86 82 45 NaN
3 ALV2 Alva OK 36.71 -98.71 2019 4 15 15 20 ... 32 NaN 82 S 20 26 1004.65 84 40 NaN
4 ANT2 Antlers OK 34.25 -95.67 2019 4 15 15 20 ... 35 NaN NaN S 11 20 1013.64 78 38 NaN

5 rows × 22 columns

Tip

The previous solution is not the best. We could have resolved the issue with the empty strings by simply adding the following option na_values=' ' to the pd.read_csv() function, like this: df = pd.read_csv(data_url, na_values=’ ’). This will automatically populate all cells that contain ' ' with NaN values.

Match specific stations

idx_acme = df['STID'].str.match('ACME')
df[idx_acme]
STID NAME ST LAT LON YR MO DA HR MI ... RELH CHIL HEAT WDIR WSPD WMAX PRES TMAX TMIN RAIN
0 ACME Acme OK 34.81 -98.02 2019 4 15 15 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1 rows × 22 columns

idx_starts_with_A = df['STID'].str.match('A')
df[idx_starts_with_A]
STID NAME ST LAT LON YR MO DA HR MI ... RELH CHIL HEAT WDIR WSPD WMAX PRES TMAX TMIN RAIN
0 ACME Acme OK 34.81 -98.02 2019 4 15 15 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 ADAX Ada OK 34.80 -96.67 2019 4 15 15 20 ... 40 NaN NaN S 12 20 1011.13 78 48 NaN
2 ALTU Altus OK 34.59 -99.34 2019 4 15 15 20 ... 39 NaN 82 SSW 19 26 1007.86 82 45 NaN
3 ALV2 Alva OK 36.71 -98.71 2019 4 15 15 20 ... 32 NaN 82 S 20 26 1004.65 84 40 NaN
4 ANT2 Antlers OK 34.25 -95.67 2019 4 15 15 20 ... 35 NaN NaN S 11 20 1013.64 78 38 NaN
5 APAC Apache OK 34.91 -98.29 2019 4 15 15 20 ... 41 NaN NaN S 23 29 1008.9 80 49 NaN
6 ARD2 Ardmore OK 34.19 -97.09 2019 4 15 15 20 ... 41 NaN NaN S 18 26 1011.43 77 50 NaN
7 ARNE Arnett OK 36.07 -99.90 2019 4 15 15 20 ... 10 NaN 85 SW 22 32 1005.13 NaN NaN NaN

8 rows × 22 columns

idx_has_A = df['STID'].str.contains('A')
df[idx_has_A].head(15)
STID NAME ST LAT LON YR MO DA HR MI ... RELH CHIL HEAT WDIR WSPD WMAX PRES TMAX TMIN RAIN
0 ACME Acme OK 34.81 -98.02 2019 4 15 15 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 ADAX Ada OK 34.80 -96.67 2019 4 15 15 20 ... 40 NaN NaN S 12 20 1011.13 78 48 NaN
2 ALTU Altus OK 34.59 -99.34 2019 4 15 15 20 ... 39 NaN 82 SSW 19 26 1007.86 82 45 NaN
3 ALV2 Alva OK 36.71 -98.71 2019 4 15 15 20 ... 32 NaN 82 S 20 26 1004.65 84 40 NaN
4 ANT2 Antlers OK 34.25 -95.67 2019 4 15 15 20 ... 35 NaN NaN S 11 20 1013.64 78 38 NaN
5 APAC Apache OK 34.91 -98.29 2019 4 15 15 20 ... 41 NaN NaN S 23 29 1008.9 80 49 NaN
6 ARD2 Ardmore OK 34.19 -97.09 2019 4 15 15 20 ... 41 NaN NaN S 18 26 1011.43 77 50 NaN
7 ARNE Arnett OK 36.07 -99.90 2019 4 15 15 20 ... 10 NaN 85 SW 22 32 1005.13 NaN NaN NaN
8 BEAV Beaver OK 36.80 -100.53 2019 4 15 15 20 ... 9 NaN 84 SW 17 26 1003.9 91 34 NaN
11 BLAC Blackwell OK 36.75 -97.25 2019 4 15 15 20 ... 38 NaN NaN SSW 15 23 1007.02 80 44 NaN
20 BYAR Byars OK 34.85 -97.00 2019 4 15 15 20 ... 43 NaN NaN S 22 32 1010.64 77 49 NaN
21 CAMA Camargo OK 36.03 -99.35 2019 4 15 15 20 ... 32 NaN 82 S 23 29 1005.56 NaN NaN NaN
22 CARL Lake Carl Blackwell OK 36.15 -97.29 2019 4 15 15 20 ... 36 NaN 80 S 17 25 1007.56 80 50 NaN
24 CHAN Chandler OK 35.65 -96.80 2019 4 15 15 20 ... 37 NaN NaN SSW 16 27 1009.35 80 48 NaN
28 CLAY Clayton OK 34.66 -95.33 2019 4 15 15 20 ... 36 NaN NaN S 9 24 1012.9 78 40 NaN

15 rows × 22 columns

idx = df['NAME'].str.contains('Blackwell') & df['NAME'].str.contains('Lake')
df[idx]
STID NAME ST LAT LON YR MO DA HR MI ... RELH CHIL HEAT WDIR WSPD WMAX PRES TMAX TMIN RAIN
22 CARL Lake Carl Blackwell OK 36.15 -97.29 2019 4 15 15 20 ... 36 NaN 80 S 17 25 1007.56 80 50 NaN

1 rows × 22 columns

The following line won’t work because the string matching is case sensitive:

idx = df['NAME'].str.contains('LAKE')

idx = df['NAME'].str.contains('Blackwell') | df['NAME'].str.contains('Lake')
df[idx]
STID NAME ST LAT LON YR MO DA HR MI ... RELH CHIL HEAT WDIR WSPD WMAX PRES TMAX TMIN RAIN
11 BLAC Blackwell OK 36.75 -97.25 2019 4 15 15 20 ... 38 NaN NaN SSW 15 23 1007.02 80 44 NaN
22 CARL Lake Carl Blackwell OK 36.15 -97.29 2019 4 15 15 20 ... 36 NaN 80 S 17 25 1007.56 80 50 NaN

2 rows × 22 columns

idx = df['STID'].isin(['ACME','ALTU'])
df[idx]
STID NAME ST LAT LON YR MO DA HR MI ... RELH CHIL HEAT WDIR WSPD WMAX PRES TMAX TMIN RAIN
0 ACME Acme OK 34.81 -98.02 2019 4 15 15 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 ALTU Altus OK 34.59 -99.34 2019 4 15 15 20 ... 39 NaN 82 SSW 19 26 1007.86 82 45 NaN

2 rows × 22 columns