32  SQLite Database

Author

Andres Patrignani

Published

January 15, 2024

SQLite is a lightweight, disk-based database that doesn’t require a separate server. Python comes with a built-in sqlite3 module, which allows you to work with SQLite databases. SQL stands for Structured Query Language, which is a standardized programming language that is used to manage relational databases. To learn more about SQLite daabases, in this tutorial we will:

  1. create a simple database by adding and removing a few entries. This example illustrates how to start a new database from scratch without prior tabulated data.

  2. create a more advanced database using a real-world dataset and the Pandas library. This example illustrates how to convert an existing spreadsheet with data into a database that you can then continue to add, update, modify, or delete data entries.

Additional software

To access and inspect the database I recommend using an open source tool like sqlitebrowser. With this tool you can also create, design, and edit sqlite databases, but we will do some of these steps using Python.

Key commands

  • CREATE TABLE: A SQL command used to create a new table in a database.
  • INSERT: A SQL command used to add new rows of data to a table in the database.
  • SELECT: A SQL command used to query data from a table, returning rows that match the specified criteria.
  • UPDATE: A SQL command used to modify existing data in a table.
  • DELETE: A SQL command used to remove rows from a table in the database.

Data types

SQLite supports a variety of data types:

  • TEXT: For storing character data. SQLite supports UTF-8, UTF-16BE, and UTF-16LE encodings.

  • INTEGER: For storing integer values. The size can be from 1 byte to 8 bytes, depending on the magnitude of the value.

  • REAL: For storing floating-point values. It is a double-precision (8-byte) floating point number.

  • BLOB: Stands for Binary Large Object. Used to store data exactly as it was input, such as images, files, or binary data.

  • NUMERIC: This type can be used for both integers and floating-point numbers. SQLite decides whether to use integer or real based on the value’s nature.

  • BOOLEAN: SQLite does not have a separate boolean storage class. Instead, boolean values are stored as integers 0 (false) and 1 (true).

  • DATE and TIME: SQLite does not have a storage class set aside for storing dates and/or times. Instead, they are stored as TEXT (as ISO8601 strings), REAL (as Julian day numbers), or INTEGER (as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC).

In SQLite the datatype you specify for a column acts more like a hint than a strict enforcement, allowing for flexibility in the types of data that can be inserted into a column. This is a distinctive feature compared to more rigid type systems in other database management systems.

Set up a simple database

After importing the sqlite3 module we create a connection to a new SQLite database. If the file doesn’t exist, the SQLite module will create it. This is convenient since we don’t have to be constantly checking whether the database exists or worry about overwriting the database.

# Import modules
import sqlite3
# Connect to the database
conn = sqlite3.connect('soils.db')
# Create a cursor object using the cursor() method
cursor = conn.cursor()

# Create table
cursor.execute('''CREATE TABLE soils
                 (id INTEGER PRIMARY KEY, date TEXT, lat REAL, lon REAL, vwc INTEGER);''')

# Save (commit) the changes
conn.commit()

Add Data

In SQLite databases, the construction (?,?,?,?) is used as a placeholder for parameter substitution in SQL statements, especially with the INSERT, UPDATE, and SELECT commands. This construction offers the following advantages:

  1. SQL Injection Prevention: By using placeholders, you prevent SQL injection, a common web security vulnerability where attackers can interfere with the queries that an application makes to its database.

  2. Data Handling: It automatically handles the quoting of strings and escaping of special characters, reducing errors in SQL query syntax due to data.

  3. Query Efficiency: When running similar queries multiple times, parameterized queries can improve performance as the database engine can reuse the query plan and execution path.

Each ? is a placeholder that is replaced with provided data values in a tuple when the execute method is called. This ensures that the values are properly formatted and inserted into the database, enhancing security and efficiency.

# Insert a row of data
obs = ('2024-01-02', 37.54, -98.78, 38)
cursor.execute("INSERT INTO soils (date, lat, lon, vwc) VALUES (?,?,?,?)", obs1)

# Save (commit) the changes
conn.commit()

Add with multiple entries

You can insert multiple entries at once using executemany()

# A list of multiple crop records
new_data = [('2024-01-02', 36.54, -98.12, 18),
            ('2024-04-14', 38.46, -99.78, 21),
            ('2024-05-23', 38.35, -98.01, 29)]

# Inserting multiple records at a time
cursor.executemany("INSERT INTO soils (date, lat, lon, vwc) VALUES (?,?,?,?)", new_data)

# Save (commit) the changes
conn.commit()

# Retrieve all data
cursor.execute('SELECT * FROM soils;')
for row in cursor.fetchall():
    print(row)
(1, '2024-01-02', 37.54, -98.78, 38)
(2, '2024-01-02', 36.54, -98.12, 18)
(3, '2024-04-14', 38.46, -99.78, 21)
(4, '2024-05-23', 38.35, -98.01, 29)

Query data

To query specific data from a table in an SQLite database using the SELECT statement, you can specify conditions using the WHERE clause. Here’s a basic syntax:

SELECT col1, col2, ... FROM table_name WHERE condition1 AND condition2;

To execute these queries remember to establish a connection, create a cursor object, execute the query using cursor.execute(query), and then use cursor.fetchall() to retrieve the results. Close the connection to the database once you’re done.

# Retrieve all data
cursor.execute('SELECT * FROM soils;')
for row in cursor.fetchall():
    print(row)
(1, '2024-01-02', 37.54, -98.78, 38)
(2, '2024-01-02', 36.54, -98.12, 18)
(3, '2024-04-14', 38.46, -99.78, 21)
(4, '2024-05-23', 38.35, -98.01, 29)
# Retrieve specific data
cursor.execute('SELECT date FROM soils WHERE vwc >= 30;')
for row in cursor.fetchall():
    print(row)
('2024-01-02',)
# Retrieve specific data (note that we need spacify the date as a string)
cursor.execute('SELECT lat,lon FROM soils WHERE date == "2024-03-07";')
for row in cursor.fetchall():
    print(row)
cursor.execute('SELECT * FROM soils WHERE vwc >=30 AND vwc < 40;')
for row in cursor.fetchall():
    print(row)
(1, '2024-01-02', 37.54, -98.78, 38)

Modify data

You can update records that match certain criteria.

# Update quantity for specific date note that this will update both rows with the same date)
cursor.execute("UPDATE soils SET vwc = 15 WHERE date = '2024-01-02';")

# Save (commit) the changes
conn.commit()
Update SQLite

Note that the previous command updates both rows with the same date. To only specify a single row we need to be more specific in our command.

# Retrieve all data
cursor.execute('SELECT * FROM soils;')
for row in cursor.fetchall():
    print(row)
(1, '2024-01-02', 37.54, -98.78, 15)
(2, '2024-01-02', 36.54, -98.12, 15)
(3, '2024-04-14', 38.46, -99.78, 21)
(4, '2024-05-23', 38.35, -98.01, 29)
# Update quantity for specific date note that this will update both rows with the same date)
cursor.execute("UPDATE soils SET vwc = 5 WHERE date = '2024-01-02' AND id = 2;")

# Save (commit) the changes
conn.commit()

# Retrieve all data
cursor.execute('SELECT * FROM soils;')
for row in cursor.fetchall():
    print(row)
(1, '2024-01-02', 37.54, -98.78, 15)
(2, '2024-01-02', 36.54, -98.12, 5)
(3, '2024-04-14', 38.46, -99.78, 21)
(4, '2024-05-23', 38.35, -98.01, 29)

Remove data

To remove records, use the DELETE statement.

# Delete the Wheat record
cursor.execute("DELETE FROM soils WHERE id = 3")

# Save (commit) the changes
conn.commit()

# Retrieve all data
cursor.execute('SELECT * FROM soils;')
for row in cursor.fetchall():
    print(row)
(1, '2024-01-02', 37.54, -98.78, 15)
(2, '2024-01-02', 36.54, -98.12, 5)
(4, '2024-05-23', 38.35, -98.01, 29)

Add new column/header

# Add a new column
# Use the DEFAULT construction like: DEFAULT 'Unknown' 
# to populate new column with custom value
cursor.execute("ALTER TABLE soils ADD COLUMN soil_type TEXT;")

# Retrieve data one more time before we close the database
cursor.execute('SELECT * FROM soils')
for row in cursor.fetchall():
    print(row)
(1, '2024-01-02', 37.54, -98.78, 15, None)
(2, '2024-01-02', 36.54, -98.12, 5, None)
(4, '2024-05-23', 38.35, -98.01, 29, None)

Closing the connection

Once done with the operations, close the connection to the database.

conn.close()

Use Pandas to set up database

For this exercise we will use a table of sorghum yields for Franklin county, KS obtained in 2023. The dataset contains breeder brand, hybrid name, yield, moisture, and total weight. The spreadsheet contains metadata on the first line (which we are going to skip) and in the last few rows. From these last rows, we will use functions to match strings and retrieve the planting and harvest dates of the trial. We will then add this information to the dataframe, and then create an SQLite database.

Pandas provides all the methods to read, clean, and export the Dataframe to a SQLite database.

# Import modules
import pandas as pd
import sqlite3
df = pd.read_csv('../datasets/sorghum_franklin_county_2023.csv',
                skiprows=[0,1,3])
# Inspect first few rows
df.head()
BRAND NAME YIELD PAVG MOIST TW
0 POLANSKY 5719 137.2 107.8 14.7 59.1
1 DYNA-GRO M60GB88 135.3 106.3 14.0 57.9
2 DYNA-GRO GX22936 134.7 105.8 13.9 58.7
3 POLANSKY 5522 132.3 103.9 13.9 58.4
4 DYNA-GRO GX22932 131.8 103.6 14.5 59.1
# Inspect last few rows
df.tail(10)
BRAND NAME YIELD PAVG MOIST TW
18 DYNA-GRO M67GB87 120.5 94.7 13.9 56.1
19 DYNA-GRO M59GB94 117.7 92.5 13.8 57.7
20 NaN AVERAGE 127.2 100.0 14.1 58.3
21 NaN CV (%) 8.4 8.4 0.3 0.8
22 NaN LSD (0.05) 5.2 4.1 0.3 0.3
23 *Yields must differ by more than the LSD value... NaN NaN NaN NaN NaN
24 different. NaN NaN NaN NaN NaN
25 Planted 5-24-23 NaN NaN NaN NaN NaN
26 Harvested 11-15-23 NaN NaN NaN NaN NaN
27 Fertility 117-38-25-20 Strip till NaN NaN NaN NaN NaN
df.dropna(subset='BRAND', inplace=True) 
df.tail(10)
BRAND NAME YIELD PAVG MOIST TW
15 DYNA-GRO M63GB78 122.7 96.4 13.9 58.0
16 DYNA-GRO GX22937 121.3 95.4 14.2 58.4
17 DYNA-GRO GX22923 121.2 95.2 13.7 55.8
18 DYNA-GRO M67GB87 120.5 94.7 13.9 56.1
19 DYNA-GRO M59GB94 117.7 92.5 13.8 57.7
23 *Yields must differ by more than the LSD value... NaN NaN NaN NaN NaN
24 different. NaN NaN NaN NaN NaN
25 Planted 5-24-23 NaN NaN NaN NaN NaN
26 Harvested 11-15-23 NaN NaN NaN NaN NaN
27 Fertility 117-38-25-20 Strip till NaN NaN NaN NaN NaN
# Extract planting date
idx = df['BRAND'].str.contains("Planted")
planting_date_str = df.loc[idx, 'BRAND'].values[0]
planting_date = planting_date_str.split(' ')[1]
print(planting_date)
5-24-23
# Extract harvest date
idx = df['BRAND'].str.contains("Harvested")
harvest_date_str = df.loc[idx, 'BRAND'].values[0]
harvest_date = harvest_date_str.split(' ')[1]
print(harvest_date)
11-15-23
# Once we are done extracting metadata, let's remove the last few rows
df = df.iloc[:-5]
# Convert header names to lower case to avoid conflict with SQL syntax
df.rename(str.lower, axis='columns', inplace=True)
df.head()
brand name yield pavg moist tw planting_date harvest_date
0 POLANSKY 5719 137.2 107.8 14.7 59.1 5-24-23 11-15-23
1 DYNA-GRO M60GB88 135.3 106.3 14.0 57.9 5-24-23 11-15-23
2 DYNA-GRO GX22936 134.7 105.8 13.9 58.7 5-24-23 11-15-23
3 POLANSKY 5522 132.3 103.9 13.9 58.4 5-24-23 11-15-23
4 DYNA-GRO GX22932 131.8 103.6 14.5 59.1 5-24-23 11-15-23
# Add planting and harvest date to Dataframe to make it more complete
df['planting_date'] = planting_date
df['harvest_date'] = harvest_date
df.head()
brand name yield pavg moist tw planting_date harvest_date
0 POLANSKY 5719 137.2 107.8 14.7 59.1 5-24-23 11-15-23
1 DYNA-GRO M60GB88 135.3 106.3 14.0 57.9 5-24-23 11-15-23
2 DYNA-GRO GX22936 134.7 105.8 13.9 58.7 5-24-23 11-15-23
3 POLANSKY 5522 132.3 103.9 13.9 58.4 5-24-23 11-15-23
4 DYNA-GRO GX22932 131.8 103.6 14.5 59.1 5-24-23 11-15-23
# Use Pandas to turn DataFrame into a SQL Database

# Connect to SQLite database (if it doesn't exist, it will be created)
conn = sqlite3.connect('sorghum_trial.db')

# Write the data to a sqlite table
df.to_sql('sorghum_trial', conn, index=False, if_exists='replace') # to overwrite use option if_exists='replace'

# Close the connection
conn.close()

Connect, access all data, and close database

# Connect to SQLite database (if it doesn't exist, it will be created)
conn = sqlite3.connect('sorghum_trial.db')

# Create cursor
cursor = conn.cursor()

# Access all data
cursor.execute('SELECT * FROM sorghum_trial')
for row in cursor.fetchall():
    print(row)
    
# Access all data
print('') # Add some white space
cursor.execute('SELECT brand, name FROM sorghum_trial WHERE yield > 130')
for row in cursor.fetchall():
    print(row)
    
# Close the connection
conn.close()
('POLANSKY', '5719', 137.2, 107.8, 14.7, 59.1, '5-24-23', '11-15-23')
('DYNA-GRO', 'M60GB88', 135.3, 106.3, 14.0, 57.9, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22936', 134.7, 105.8, 13.9, 58.7, '5-24-23', '11-15-23')
('POLANSKY', '5522', 132.3, 103.9, 13.9, 58.4, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22932', 131.8, 103.6, 14.5, 59.1, '5-24-23', '11-15-23')
('DYNA-GRO', 'M72GB71', 130.9, 102.9, 14.5, 59.0, '5-24-23', '11-15-23')
('MATURITY CHECK', 'MED', 128.7, 101.1, 14.0, 58.2, '5-24-23', '11-15-23')
('DYNA-GRO', 'M71GR91', 128.7, 101.1, 14.4, 59.3, '5-24-23', '11-15-23')
('PIONEER', '86920', 128.1, 100.7, 13.9, 57.9, '5-24-23', '11-15-23')
('MATURITY CHECK', 'EARLY', 127.9, 100.5, 14.2, 58.8, '5-24-23', '11-15-23')
('POLANSKY', '5629', 126.5, 99.4, 13.8, 57.2, '5-24-23', '11-15-23')
('MATURITY CHECK', 'LATE', 126.2, 99.2, 14.2, 58.3, '5-24-23', '11-15-23')
('PIONEER', '84980', 125.5, 98.6, 14.1, 58.8, '5-24-23', '11-15-23')
('DYNA-GRO', 'M60GB31', 124.9, 98.2, 14.2, 59.2, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22934', 122.8, 96.5, 14.6, 59.5, '5-24-23', '11-15-23')
('DYNA-GRO', 'M63GB78', 122.7, 96.4, 13.9, 58.0, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22937', 121.3, 95.4, 14.2, 58.4, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22923', 121.2, 95.2, 13.7, 55.8, '5-24-23', '11-15-23')
('DYNA-GRO', 'M67GB87', 120.5, 94.7, 13.9, 56.1, '5-24-23', '11-15-23')
('DYNA-GRO', 'M59GB94', 117.7, 92.5, 13.8, 57.7, '5-24-23', '11-15-23')

('POLANSKY', '5719')
('DYNA-GRO', 'M60GB88')
('DYNA-GRO', 'GX22936')
('POLANSKY', '5522')
('DYNA-GRO', 'GX22932')
('DYNA-GRO', 'M72GB71')