15  Data structures

Author

Andres Patrignani

Published

January 4, 2024

In the realm of programming and data science, data structures act as containers where information is stored for later use. Python offers a variety of built-in data structures like lists, tuples, sets, and dictionaries, each with its own unique properties and use cases.

The choice of the right data structure often depends on factors like scalability, data format, data complexity, and the programmer’s preference. Consider devoting some time before starting a new script to test and select an appropriate data structure for your program.

Tip

Lists are created using [], dictionaries using {}, and tuples using (), but to access the content inside all of them we use [].

Lists

Lists are versatile data structures defined by square brackets [ ] that ideal for storing sequences of elements, such as strings, numbers, or a mix of different data types. Lists are mutable, meaning that you can modify their content. Lists also support nesting, where a list can contain other lists. A key feature of lists is the ability to access elements through indexing (for single item) or slicing (for multiple items). While similar to arrays in other languages, like Matlab, it’s important to note that Python lists do not natively support element-wise operations, a functionality that is characteristic of NumPy arrays, a more advanced module that we will explore later.

# List with same data type
soil_texture = ["Sand", "Loam", "Silty clay", "Silt loam", "Silt"] # Strings (soil textural classes)
mean_sand = [92, 40, 5, 20, 5]  # Integers (percent sand for each soil textural class)

print(soil_texture)
print(type(soil_texture)) # Print type of data structure
    
# List with mixed data types (strings, floats, and an entire dictionary)
# Sample ID, soil texture, pH value, and multiple nutrient concentration in ppm
soil_sample = ["Sample_001", "Loam", 6.5, {"N": 20, "P": 15, "K": 5}]  
['Sand', 'Loam', 'Silty clay', 'Silt loam', 'Silt']
<class 'list'>
# Indexing a list
print(soil_texture[0]) # Accesses the first item
print(soil_sample[2])
Sand
6.5
# Slicing a list
print(soil_texture[2:4])
print(soil_texture[2:])
print(soil_texture[:3])
['Silty clay', 'Silt loam']
['Silty clay', 'Silt loam', 'Silt']
['Sand', 'Loam', 'Silty clay']
# Find the length of a list
print(len(soil_texture))  # Returns the number of items
5
Note

Can you guess how many items are in the soil_sample list? Use Python to check your answer!

# Append elements to a list
soil_texture.append("Clay")  # Adds 'Barley' to the list 'crops'
print(soil_texture)
['Sand', 'Loam', 'Silty clay', 'Silt loam', 'Silt', 'Clay']
# Append multiple elements
soil_texture.extend(["Loamy sand", "Sandy loam"])
print(soil_texture)
['Sand', 'Loam', 'Silty clay', 'Silt loam', 'Silt', 'Clay', 'Loamy sand', 'Sandy loam']
Note

Appending multiple items using the append() method will result in nested lists, while using the extend() method will results in merged lists. Give it a try and see if you can observe the difference.

# Remove list element
soil_texture.remove("Clay")
print(soil_texture)
['Sand', 'Loam', 'Silty clay', 'Silt loam', 'Silt', 'Loamy sand', 'Sandy loam']
# Insert an item at a specified position or index
soil_texture.insert(2, "Clay")  # Inserts 'Clay' back again, but at index 2
print(soil_texture)
['Sand', 'Loam', 'Clay', 'Silty clay', 'Silt loam', 'Silt', 'Loamy sand', 'Sandy loam']
# Remove element based on index
soil_texture.pop(4)
print(soil_texture)
['Sand', 'Loam', 'Clay', 'Silty clay', 'Silt', 'Loamy sand', 'Sandy loam']
# An alternative method to delete one or more elements of the list.
del soil_texture[1:3]
print(soil_texture)
['Sand', 'Silty clay', 'Silt', 'Loamy sand', 'Sandy loam']

Tuples

Tuples are an efficient data structure defined by parentheses ( ), and are especially useful for storing fixed sets of elements like coordinates in a two-dimensional plane (e.g., point(x, y)) or triplets of color values in the RGB color space (e.g., (r, g, b)). While tuples can be nested within lists and support operations similar to lists, like indexing and slicing, the main difference is that tuples are immutable. Once a tuple is created, its content cannot be changed. This makes tuples particularly valuable for storing critical information that must remain constant in your code.

# Geographic coordinates 
mauna_loa = (19.536111, -155.576111, 3397) # Mauna Load Observatory in Hawaii, USA
konza_prairie = (39.106704, -96.608968, 320) # Konza Prairie in Kansas, USA

locations = [mauna_loa, konza_prairie]
print(locations)
[(19.536111, -155.576111, 3397), (39.106704, -96.608968, 320)]
# A list of tuples
colors = [(0,0,0), (255,255,255), (0,255,0)] # Each tuple refers to black, white, and green.
print(colors)
print(type(colors[0]))
[(0, 0, 0), (255, 255, 255), (0, 255, 0)]
<class 'tuple'>
Note

What happens if we want to change the first element of the third tuple from 0 to 255? Hint: colors[2][0] = 255

Dictionaries

Dictionaries are a highly versatile and popular data structure that have the peculiar ability to store and retrieve data using key-value pairs defined within curly braces { } or using the dict() function. This means that you can access, add, or modify data using unique keys, making dictionaries incredibly efficient for organizing and handling data using named references.

Dictionaries are particularly useful in situations where data doesn’t fit neatly into a matrix or table format and has multiple attributes, such as weather data, where you might store various weather parameters (temperature, humidity, wind speed) using descriptive keys. Unlike lists or tuples, dictionaries aren’t ordered by nature, but they excel in scenarios where each piece of data needs to be associated with a specific identifier. This structure provides a straightforward and intuitive way to manage complex, unstructured data.

# Weather data is often stored in dictionary or dictionary-like data structures.
D = {'city':'Manhattan',
     'state':'Kansas',
     'coords': (39.208722, -96.592248, 350),
     'data': [{'date' : '20220101', 
              'precipitation' : {'value':12.5, 'unit':'mm', 'instrument':'TE525'},
              'air_temperature' : {'value':5.6, 'units':'Celsius', 'instrument':'ATMOS14'}
              },
              {'date' : '20220102', 
              'precipitation' : {'value':0, 'unit':'mm', 'instrument':'TE525'},
              'air_temperature' : {'value':1.3, 'units':'Celsius', 'instrument':'ATMOS14'}
              }]
    }

print(D)
print(type(D))
{'city': 'Manhattan', 'state': 'Kansas', 'coords': (39.208722, -96.592248, 350), 'data': [{'date': '20220101', 'precipitation': {'value': 12.5, 'unit': 'mm', 'instrument': 'TE525'}, 'air_temperature': {'value': 5.6, 'units': 'Celsius', 'instrument': 'ATMOS14'}}, {'date': '20220102', 'precipitation': {'value': 0, 'unit': 'mm', 'instrument': 'TE525'}, 'air_temperature': {'value': 1.3, 'units': 'Celsius', 'instrument': 'ATMOS14'}}]}
<class 'dict'>

The example above has several interesting features: - The city and state names are ordinary strings - The geographic coordinates (latitude, longitude, and elevation) are grouped using a tuple. - Weather data for each day is a list of dictionaries - In a single dictionary we have observations for a given timestamp together with the associated metadata including units, sensors, and location. Personally I think that dictionaries are ideal data structures in the context of reproducible science.

Note

The structure of the dictionary above depends on programmer preferences. For instance, rather than grouping all three coordinates into a tuple, a different programmer may prefer to store the values under individual name:value pairs, such as: latitude : 39.208722, longitude : -96.592248, and altitude : 350)

Sets

Sets are a unique and somewhat less commonly used data structure compared to lists, tuples, and dictionaries. Sets are defined with curly braces { } (without defining key-value pairs) or the set() function and are similar to mathematical sets, meaning they store unordered collections of unique items. In other words, Sets don’t allow for duplicate items, items cannot be changed (although items can be added and removed), and items are not indexed. This makes sets ideal for operations like determining membership, eliminating duplicates, and performing mathematical set operations such as unions, intersections, and differences. In scenarios like database querying or data analysis where you need to compare different datasets, sets can be used to find common elements (intersection), all elements (union), or differences between datasets.

# Union operation
field1_weeds = set(["Dandelion", "Crabgrass", "Thistle", "Dandelion"])
field2_weeds = set(["Thistle", "Crabgrass", "Foxtail"])
unique_weeds = field1_weeds.union(field2_weeds)
print(unique_weeds)
{'Crabgrass', 'Foxtail', 'Thistle', 'Dandelion'}
# Intersection operation
common_weeds = field1_weeds.intersection(field2_weeds)
print(common_weeds)
{'Crabgrass', 'Thistle'}
# Difference operation
different_weeds_in_field1 = field1_weeds.difference(field2_weeds)
print(different_weeds_in_field1)

different_weeds_in_field2 = field2_weeds.difference(field1_weeds)
print(different_weeds_in_field2)
{'Dandelion'}
{'Foxtail'}
# We can also chain more variables if needed
field3_weeds = set(["Pigweed", "Clover"])
field1_weeds.union(field2_weeds).union(field3_weeds)
{'Clover', 'Crabgrass', 'Dandelion', 'Foxtail', 'Pigweed', 'Thistle'}
Note

For this particular example, you could leverage a set data structure to easily compare field notes from multiple agronomists collecting information across farmer fields in a given region and quickly determine dominant weed species.

Practice

  1. Create a list with the scientific names of three common grasses in the US Great Plains: big bluestem, switchgrass, indian grass, and little bluestem.

  2. Using a periodic table, store in a dictionary the name, symbol, atomic mass, melting point, and boiling point of oxygen, nitrogen, phosphorus, and hydrogen. Then, write two separate python statements to retrieve the boiling point of oxygen and hydrogen. Combined, these two atoms can form water, which has a boiling point of 100 degrees Celsius. How does this value compare to the boiling point of the individual elements?

  3. Without editing the dictionary that you created in the previous point, append the properties for a new element: carbon.

  4. Create a list of tuples encoding the latitude, longitude, and altitude of three national parks of your choice.