# Import modules
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
from sklearn.metrics import silhouette_score
92 Clustering raster layers
Cultivation of agricultural crops is highly dependent of the precipitation regime, temperature regime, and soil type of the region. However, these varaibles typically exhibit different spatial patterns, creating different zones or sub-regions with unique environmental attributes and yield potential.
In this exercise we will use three layers of information to divide the state of Kansas into homogeneous agrozones. The selected variables are:
30-year annual cumulative thermal units: Growing degree days or thermal units are highly correlated with crop growth and development. This variable exhibits a North-South gradient, with portions in the southern part of the state exposed to higher temperatures, and therefore greater cumulative thermal units. Source layer: Parameter elevation Regression on Independent Slopes Model (PRISM)
30-year annual precipitation: Precipitation is arguably one of the most important factors conditioning vegetation growth. The state of Kansas exhibits a strong East-West gradient, with eastern Kansas receiving about three times more precipitation per year (1200 mm per year) than western Kansas (400 mm per year). Source layer: Parameter elevation Regression on Independent Slopes Model (PRISM)
Sand content at 5 cm depth: Soil fertility and water holding capacity are major drivers of crop growth and yield. Agricultural crops are often planted in rich, fertile soils. Generally speaking, sandy soils have low fertility and low water holding capacity. So, in this exercise we will use the percetnage of sand content in the top 5 cm as a proxy variable to represent different soil types. Source: SoilGrids, International Soil Reference and Information Centre.
Thermal units and annual precipitation are coarse regulators of potentially viable regions for agricultural crops, while sand content is a fine regulator.
# Read geotiff layers
= xr.open_dataarray('../datasets/clustering/gdd_30_years.tif')
gdd = xr.open_dataarray('../datasets/clustering/precip_30_years.tif')
precip = xr.open_dataarray('../datasets/clustering/sand_0_5_cm.tif') sand
# Create figure to inspect each layer
=(6, 10))
plt.figure(figsize
3,1,1)
plt.subplot(='inferno', cbar_kwargs={'label':'GDD (Celsius-day/yr)'})
gdd.plot(cmap'Growing degree days')
plt.title('Longitude')
plt.xlabel('Latitude')
plt.ylabel(
#plt.axis('equal')
3,1,2)
plt.subplot(='Spectral', cbar_kwargs={'label':'Precipitation (mm/yr)'})
precip.plot(cmap'Annual precipitation')
plt.title('Longitude')
plt.xlabel('Latitude')
plt.ylabel(
3,1,3)
plt.subplot(='OrRd', cbar_kwargs={'label':'Sand 0-5 cm (%)'})
sand.plot(cmap'Sand content')
plt.title('Longitude')
plt.xlabel('Latitude')
plt.ylabel(
=0.5)
plt.subplots_adjust(hspace
plt.show()
# Find sand content for a single point
=-100.8, y=37.9, method='nearest').data sand.sel(x
array([54.24156397])
# Plot sand content around Garden City, KS
=slice(-101.5, -100.5), y=slice(38.5,37.0)).plot(cmap='OrRd', robust=True);
sand.sel(x'Area around Garden City, KS')
plt.title('Longitude')
plt.xlabel('Latitude') plt.ylabel(
Text(0, 0.5, 'Latitude')
Preparing data for clustering
The clustering routine in Scikit-learn cannot handle missing values (i.e., NaN). Because our maps contain NaN values to indicate pixels outside of the state boundary or pixels over water bodies (like lakes and ponds), we need to assign a placeholder to these pixels before clustering. Rather that assigning zero, we will assign a negative value that is very different from other values in our dataset. When we are done with the clustering process, we will revert the mask using the NaN values.
Clustering also requires normalizing the input data. The K-means appraoch treats all values equally when computing distances between data points. If one variable has a larger scale than others, it will dominate the clustering process. Normalizing the data ensures that all dimensions contribute equally to the clustering.
# Ensure all the layers have the same dimension.
print(gdd.shape)
print(precip.shape)
print(sand.shape)
(1, 362, 897)
(1, 362, 897)
(1, 362, 897)
Sometimes raster layers have different dimensions (different spatial resolution). In that case, we need to first resample to upscale or downscale the raster dimensions, so that all layers have the same shape. The code below shows how to do this for an arbitrary shape:
Reproject the DataArray to the new resolutionfrom rasterio.enums import Resampling
= sand.rio.reproject(sand.rio.crs,
reprojected_sand =(500, 1200),
shape=Resampling.cubic) resampling
# Convert 2D arrays into 1D arrays
= gdd.data.flatten()
X_gdd = precip.data.flatten()
X_precip = sand.data.flatten()
X_sand
# Find grid cells with NaN values in any layer
# Pixels with NaN values in at least one feature will be set to NaN
= np.isnan(X_gdd) | np.isnan(X_precip) | np.isnan(X_sand)
idx_nan
# Replace the NaN values with -999.0
= -1.0
X_gdd[idx_nan] = -1.0
X_precip[idx_nan] = -1.0
X_sand[idx_nan]
# Create aray of inputs ((n_samples by n_features))
= np.column_stack( (X_gdd, X_precip, X_sand) )
X
# Inspect shape of resulting input array
print(X.shape)
(324714, 3)
Normalization
We will use the normalize
function from Scikit Learn that implements an L2 normalization, which is also known as Euclidean normalization. This normalization technique consists of dividing a vector (our feature) by its Euclidean norm. The Euclidean norm of a vector is the square root of the sum of the squares of its elements.
# Normalize (0=normalize each feature, 1=normalize each sample)
= normalize(X, axis=0)
X_normalized
# Alternatively you can compute your own normalizing metric (like the z-score)
#X_normalized = (X - X.mean(axis=0))/X.std(axis=0)
# Find best number of clusters
= list(range(2,10))
cluster_range = []
score for k in cluster_range:
= KMeans(n_clusters=k, random_state=1, n_init='auto').fit(X_normalized)
clusters = np.random.randint(X_normalized.shape[0], size=(5_000))
rand_samples
score.append(silhouette_score(X_normalized[rand_samples,:], clusters.labels_[rand_samples]))
# Visualize Silhouette_score
=(5,4))
plt.figure(figsize'o-')
plt.plot(cluster_range, score, 'Cluster number')
plt.xlabel('Silhouette score')
plt.ylabel( plt.show()
It seems that the best number of clusters is 4. However, remember that we have one cluster for the mask (all the values set to -1). So the correct number of clusters defining the agro zones is 3.
# Cluster the data
= 3 + 1 # Add an extra cluster for the mask pixels
k = KMeans(n_clusters=k, random_state=1, n_init='auto').fit(X_normalized) clusters
From the Scikit-Learn documentation: “The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.”
# Check type of a single value of the resulting cluster labels
print(type(clusters.labels_[0]))
<class 'numpy.int32'>
# Convert to float so that we can bring back the NaN values
# to mask the resulting image
= clusters.labels_.astype(np.float64)
cluster_labels
# Restore mask
= np.nan
cluster_labels[idx_nan] print(cluster_labels.shape)
(324714,)
# Reshape cluster labels so that they match the shape of the original maps.
= np.reshape(cluster_labels, sand.data.shape)
cluster_labels print(cluster_labels.shape) # Check that we changed the shape of the array
(1, 362, 897)
# Add extra dimension
#cluster_labels = np.expand_dims(cluster_labels, axis=0)
= xr.DataArray(cluster_labels, coords=sand.coords, dims=sand.dims) agro_regions
# Plot map (use the x and y coordinates from another layer since the cluster labels don't have coordinates)
=(6,3))
plt.figure(figsize='Set3', add_colorbar=False)
agro_regions.plot(cmap'Kansas Agro-regions')
plt.title('Longitude')
plt.xlabel('Latitude')
plt.ylabel( plt.show()
Create summary of properties for each regions
Now that we have divided the state in regions of more homogeneous conditions, let’s generate a summary table with the median values of each property for each zone.
# Find a way to get unique labels in the map
= np.unique(cluster_labels[~np.isnan(cluster_labels)])
labels print(labels)
[0. 1. 3.]
A missing label is likely due to the placeholder for NaN values. This label was removed when we brought the NaN values back to the image.
# Iterate over each label and print the median values for each layer
for label in labels:
# Create boolean array for current cluster label
= cluster_labels == label
idx
# Create separate strings to keep lines shorter
= f"GDD:{round(np.median(gdd.data[idx]))} °C-day per year"
txt_gdd = f"PRECIP:{round(np.median(precip.data[idx]))} mm"
txt_precip = f"SAND:{round(np.median(sand.data[idx]))} %"
txt_sand
# Print all the strings together
print(f"Cluster {label} => {txt_gdd}, {txt_precip}, {txt_sand}")
Cluster 0.0 => GDD:4705 °C-day per year, PRECIP:944 mm, SAND:16 %
Cluster 1.0 => GDD:4358 °C-day per year, PRECIP:560 mm, SAND:27 %
Cluster 3.0 => GDD:4830 °C-day per year, PRECIP:656 mm, SAND:46 %
Practice
Combine the inputs array
X
with the cluster labels and create a Pandas Dataframe to easily compute descriptive statistics using thegroupby()
method.Partition the state of Kansas using a different number of regions. What type of regions can be detected by using a higher number of clusters? What happens if you only divide the state into 3 regions?
Which of the three variables would dominated the clustering process if the input variables are not normalized?
Can you run a similar analysis for your own state?