# Import modules
import sqlite3
32 SQLite Database
SQLite is a lightweight, disk-based database that doesn’t require a separate server. Python comes with a built-in sqlite3
module, which allows you to work with SQLite databases. SQL stands for Structured Query Language, which is a standardized programming language that is used to manage relational databases. To learn more about SQLite daabases, in this tutorial we will:
create a simple database by adding and removing a few entries. This example illustrates how to start a new database from scratch without prior tabulated data.
create a more advanced database using a real-world dataset and the Pandas library. This example illustrates how to convert an existing spreadsheet with data into a database that you can then continue to add, update, modify, or delete data entries.
Additional software
To access and inspect the database I recommend using an open source tool like sqlitebrowser. With this tool you can also create, design, and edit sqlite databases, but we will do some of these steps using Python.
Key commands
CREATE TABLE
: A SQL command used to create a new table in a database.INSERT
: A SQL command used to add new rows of data to a table in the database.SELECT
: A SQL command used to query data from a table, returning rows that match the specified criteria.UPDATE
: A SQL command used to modify existing data in a table.DELETE
: A SQL command used to remove rows from a table in the database.
Data types
SQLite supports a variety of data types:
TEXT: For storing character data. SQLite supports UTF-8, UTF-16BE, and UTF-16LE encodings.
INTEGER: For storing integer values. The size can be from 1 byte to 8 bytes, depending on the magnitude of the value.
REAL: For storing floating-point values. It is a double-precision (8-byte) floating point number.
BLOB: Stands for Binary Large Object. Used to store data exactly as it was input, such as images, files, or binary data.
NUMERIC: This type can be used for both integers and floating-point numbers. SQLite decides whether to use integer or real based on the value’s nature.
BOOLEAN: SQLite does not have a separate boolean storage class. Instead, boolean values are stored as integers 0 (false) and 1 (true).
DATE and TIME: SQLite does not have a storage class set aside for storing dates and/or times. Instead, they are stored as TEXT (as ISO8601 strings), REAL (as Julian day numbers), or INTEGER (as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC).
In SQLite the datatype you specify for a column acts more like a hint than a strict enforcement, allowing for flexibility in the types of data that can be inserted into a column. This is a distinctive feature compared to more rigid type systems in other database management systems.
Set up a simple database
After importing the sqlite3
module we create a connection to a new SQLite database. If the file doesn’t exist, the SQLite module will create it. This is convenient since we don’t have to be constantly checking whether the database exists or worry about overwriting the database.
# Connect to the database
= sqlite3.connect('soils.db') conn
# Create a cursor object using the cursor() method
= conn.cursor()
cursor
# Create table
'''CREATE TABLE soils
cursor.execute( (id INTEGER PRIMARY KEY, date TEXT, lat REAL, lon REAL, vwc INTEGER);''')
# Save (commit) the changes
conn.commit()
Add Data
In SQLite databases, the construction (?,?,?,?)
is used as a placeholder for parameter substitution in SQL statements, especially with the INSERT
, UPDATE
, and SELECT
commands. This construction offers the following advantages:
SQL Injection Prevention: By using placeholders, you prevent SQL injection, a common web security vulnerability where attackers can interfere with the queries that an application makes to its database.
Data Handling: It automatically handles the quoting of strings and escaping of special characters, reducing errors in SQL query syntax due to data.
Query Efficiency: When running similar queries multiple times, parameterized queries can improve performance as the database engine can reuse the query plan and execution path.
Each ?
is a placeholder that is replaced with provided data values in a tuple when the execute
method is called. This ensures that the values are properly formatted and inserted into the database, enhancing security and efficiency.
# Insert a row of data
= ('2024-01-02', 37.54, -98.78, 38)
obs "INSERT INTO soils (date, lat, lon, vwc) VALUES (?,?,?,?)", obs1)
cursor.execute(
# Save (commit) the changes
conn.commit()
Add with multiple entries
You can insert multiple entries at once using executemany()
# A list of multiple crop records
= [('2024-01-02', 36.54, -98.12, 18),
new_data '2024-04-14', 38.46, -99.78, 21),
('2024-05-23', 38.35, -98.01, 29)]
(
# Inserting multiple records at a time
"INSERT INTO soils (date, lat, lon, vwc) VALUES (?,?,?,?)", new_data)
cursor.executemany(
# Save (commit) the changes
conn.commit()
# Retrieve all data
'SELECT * FROM soils;')
cursor.execute(for row in cursor.fetchall():
print(row)
(1, '2024-01-02', 37.54, -98.78, 38)
(2, '2024-01-02', 36.54, -98.12, 18)
(3, '2024-04-14', 38.46, -99.78, 21)
(4, '2024-05-23', 38.35, -98.01, 29)
Query data
To query specific data from a table in an SQLite database using the SELECT
statement, you can specify conditions using the WHERE
clause. Here’s a basic syntax:
SELECT col1, col2, ... FROM table_name WHERE condition1 AND condition2;
To execute these queries remember to establish a connection, create a cursor object, execute the query using cursor.execute(query)
, and then use cursor.fetchall()
to retrieve the results. Close the connection to the database once you’re done.
# Retrieve all data
'SELECT * FROM soils;')
cursor.execute(for row in cursor.fetchall():
print(row)
(1, '2024-01-02', 37.54, -98.78, 38)
(2, '2024-01-02', 36.54, -98.12, 18)
(3, '2024-04-14', 38.46, -99.78, 21)
(4, '2024-05-23', 38.35, -98.01, 29)
# Retrieve specific data
'SELECT date FROM soils WHERE vwc >= 30;')
cursor.execute(for row in cursor.fetchall():
print(row)
('2024-01-02',)
# Retrieve specific data (note that we need spacify the date as a string)
'SELECT lat,lon FROM soils WHERE date == "2024-03-07";')
cursor.execute(for row in cursor.fetchall():
print(row)
'SELECT * FROM soils WHERE vwc >=30 AND vwc < 40;')
cursor.execute(for row in cursor.fetchall():
print(row)
(1, '2024-01-02', 37.54, -98.78, 38)
Modify data
You can update records that match certain criteria.
# Update quantity for specific date note that this will update both rows with the same date)
"UPDATE soils SET vwc = 15 WHERE date = '2024-01-02';")
cursor.execute(
# Save (commit) the changes
conn.commit()
Note that the previous command updates both rows with the same date. To only specify a single row we need to be more specific in our command.
# Retrieve all data
'SELECT * FROM soils;')
cursor.execute(for row in cursor.fetchall():
print(row)
(1, '2024-01-02', 37.54, -98.78, 15)
(2, '2024-01-02', 36.54, -98.12, 15)
(3, '2024-04-14', 38.46, -99.78, 21)
(4, '2024-05-23', 38.35, -98.01, 29)
# Update quantity for specific date note that this will update both rows with the same date)
"UPDATE soils SET vwc = 5 WHERE date = '2024-01-02' AND id = 2;")
cursor.execute(
# Save (commit) the changes
conn.commit()
# Retrieve all data
'SELECT * FROM soils;')
cursor.execute(for row in cursor.fetchall():
print(row)
(1, '2024-01-02', 37.54, -98.78, 15)
(2, '2024-01-02', 36.54, -98.12, 5)
(3, '2024-04-14', 38.46, -99.78, 21)
(4, '2024-05-23', 38.35, -98.01, 29)
Remove data
To remove records, use the DELETE statement.
# Delete the Wheat record
"DELETE FROM soils WHERE id = 3")
cursor.execute(
# Save (commit) the changes
conn.commit()
# Retrieve all data
'SELECT * FROM soils;')
cursor.execute(for row in cursor.fetchall():
print(row)
(1, '2024-01-02', 37.54, -98.78, 15)
(2, '2024-01-02', 36.54, -98.12, 5)
(4, '2024-05-23', 38.35, -98.01, 29)
Add new column/header
# Add a new column
# Use the DEFAULT construction like: DEFAULT 'Unknown'
# to populate new column with custom value
"ALTER TABLE soils ADD COLUMN soil_type TEXT;")
cursor.execute(
# Retrieve data one more time before we close the database
'SELECT * FROM soils')
cursor.execute(for row in cursor.fetchall():
print(row)
(1, '2024-01-02', 37.54, -98.78, 15, None)
(2, '2024-01-02', 36.54, -98.12, 5, None)
(4, '2024-05-23', 38.35, -98.01, 29, None)
Closing the connection
Once done with the operations, close the connection to the database.
conn.close()
Use Pandas to set up database
For this exercise we will use a table of sorghum yields for Franklin county, KS obtained in 2023. The dataset contains breeder brand, hybrid name, yield, moisture, and total weight. The spreadsheet contains metadata on the first line (which we are going to skip) and in the last few rows. From these last rows, we will use functions to match strings and retrieve the planting and harvest dates of the trial. We will then add this information to the dataframe, and then create an SQLite database.
Pandas provides all the methods to read, clean, and export the Dataframe to a SQLite database.
# Import modules
import pandas as pd
import sqlite3
= pd.read_csv('../datasets/sorghum_franklin_county_2023.csv',
df =[0,1,3])
skiprows# Inspect first few rows
df.head()
BRAND | NAME | YIELD | PAVG | MOIST | TW | |
---|---|---|---|---|---|---|
0 | POLANSKY | 5719 | 137.2 | 107.8 | 14.7 | 59.1 |
1 | DYNA-GRO | M60GB88 | 135.3 | 106.3 | 14.0 | 57.9 |
2 | DYNA-GRO | GX22936 | 134.7 | 105.8 | 13.9 | 58.7 |
3 | POLANSKY | 5522 | 132.3 | 103.9 | 13.9 | 58.4 |
4 | DYNA-GRO | GX22932 | 131.8 | 103.6 | 14.5 | 59.1 |
# Inspect last few rows
10) df.tail(
BRAND | NAME | YIELD | PAVG | MOIST | TW | |
---|---|---|---|---|---|---|
18 | DYNA-GRO | M67GB87 | 120.5 | 94.7 | 13.9 | 56.1 |
19 | DYNA-GRO | M59GB94 | 117.7 | 92.5 | 13.8 | 57.7 |
20 | NaN | AVERAGE | 127.2 | 100.0 | 14.1 | 58.3 |
21 | NaN | CV (%) | 8.4 | 8.4 | 0.3 | 0.8 |
22 | NaN | LSD (0.05) | 5.2 | 4.1 | 0.3 | 0.3 |
23 | *Yields must differ by more than the LSD value... | NaN | NaN | NaN | NaN | NaN |
24 | different. | NaN | NaN | NaN | NaN | NaN |
25 | Planted 5-24-23 | NaN | NaN | NaN | NaN | NaN |
26 | Harvested 11-15-23 | NaN | NaN | NaN | NaN | NaN |
27 | Fertility 117-38-25-20 Strip till | NaN | NaN | NaN | NaN | NaN |
='BRAND', inplace=True)
df.dropna(subset10) df.tail(
BRAND | NAME | YIELD | PAVG | MOIST | TW | |
---|---|---|---|---|---|---|
15 | DYNA-GRO | M63GB78 | 122.7 | 96.4 | 13.9 | 58.0 |
16 | DYNA-GRO | GX22937 | 121.3 | 95.4 | 14.2 | 58.4 |
17 | DYNA-GRO | GX22923 | 121.2 | 95.2 | 13.7 | 55.8 |
18 | DYNA-GRO | M67GB87 | 120.5 | 94.7 | 13.9 | 56.1 |
19 | DYNA-GRO | M59GB94 | 117.7 | 92.5 | 13.8 | 57.7 |
23 | *Yields must differ by more than the LSD value... | NaN | NaN | NaN | NaN | NaN |
24 | different. | NaN | NaN | NaN | NaN | NaN |
25 | Planted 5-24-23 | NaN | NaN | NaN | NaN | NaN |
26 | Harvested 11-15-23 | NaN | NaN | NaN | NaN | NaN |
27 | Fertility 117-38-25-20 Strip till | NaN | NaN | NaN | NaN | NaN |
# Extract planting date
= df['BRAND'].str.contains("Planted")
idx = df.loc[idx, 'BRAND'].values[0]
planting_date_str = planting_date_str.split(' ')[1]
planting_date print(planting_date)
5-24-23
# Extract harvest date
= df['BRAND'].str.contains("Harvested")
idx = df.loc[idx, 'BRAND'].values[0]
harvest_date_str = harvest_date_str.split(' ')[1]
harvest_date print(harvest_date)
11-15-23
# Once we are done extracting metadata, let's remove the last few rows
= df.iloc[:-5] df
# Convert header names to lower case to avoid conflict with SQL syntax
str.lower, axis='columns', inplace=True)
df.rename( df.head()
brand | name | yield | pavg | moist | tw | planting_date | harvest_date | |
---|---|---|---|---|---|---|---|---|
0 | POLANSKY | 5719 | 137.2 | 107.8 | 14.7 | 59.1 | 5-24-23 | 11-15-23 |
1 | DYNA-GRO | M60GB88 | 135.3 | 106.3 | 14.0 | 57.9 | 5-24-23 | 11-15-23 |
2 | DYNA-GRO | GX22936 | 134.7 | 105.8 | 13.9 | 58.7 | 5-24-23 | 11-15-23 |
3 | POLANSKY | 5522 | 132.3 | 103.9 | 13.9 | 58.4 | 5-24-23 | 11-15-23 |
4 | DYNA-GRO | GX22932 | 131.8 | 103.6 | 14.5 | 59.1 | 5-24-23 | 11-15-23 |
# Add planting and harvest date to Dataframe to make it more complete
'planting_date'] = planting_date
df['harvest_date'] = harvest_date
df[ df.head()
brand | name | yield | pavg | moist | tw | planting_date | harvest_date | |
---|---|---|---|---|---|---|---|---|
0 | POLANSKY | 5719 | 137.2 | 107.8 | 14.7 | 59.1 | 5-24-23 | 11-15-23 |
1 | DYNA-GRO | M60GB88 | 135.3 | 106.3 | 14.0 | 57.9 | 5-24-23 | 11-15-23 |
2 | DYNA-GRO | GX22936 | 134.7 | 105.8 | 13.9 | 58.7 | 5-24-23 | 11-15-23 |
3 | POLANSKY | 5522 | 132.3 | 103.9 | 13.9 | 58.4 | 5-24-23 | 11-15-23 |
4 | DYNA-GRO | GX22932 | 131.8 | 103.6 | 14.5 | 59.1 | 5-24-23 | 11-15-23 |
# Use Pandas to turn DataFrame into a SQL Database
# Connect to SQLite database (if it doesn't exist, it will be created)
= sqlite3.connect('sorghum_trial.db')
conn
# Write the data to a sqlite table
'sorghum_trial', conn, index=False, if_exists='replace') # to overwrite use option if_exists='replace'
df.to_sql(
# Close the connection
conn.close()
Connect, access all data, and close database
# Connect to SQLite database (if it doesn't exist, it will be created)
= sqlite3.connect('sorghum_trial.db')
conn
# Create cursor
= conn.cursor()
cursor
# Access all data
'SELECT * FROM sorghum_trial')
cursor.execute(for row in cursor.fetchall():
print(row)
# Access all data
print('') # Add some white space
'SELECT brand, name FROM sorghum_trial WHERE yield > 130')
cursor.execute(for row in cursor.fetchall():
print(row)
# Close the connection
conn.close()
('POLANSKY', '5719', 137.2, 107.8, 14.7, 59.1, '5-24-23', '11-15-23')
('DYNA-GRO', 'M60GB88', 135.3, 106.3, 14.0, 57.9, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22936', 134.7, 105.8, 13.9, 58.7, '5-24-23', '11-15-23')
('POLANSKY', '5522', 132.3, 103.9, 13.9, 58.4, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22932', 131.8, 103.6, 14.5, 59.1, '5-24-23', '11-15-23')
('DYNA-GRO', 'M72GB71', 130.9, 102.9, 14.5, 59.0, '5-24-23', '11-15-23')
('MATURITY CHECK', 'MED', 128.7, 101.1, 14.0, 58.2, '5-24-23', '11-15-23')
('DYNA-GRO', 'M71GR91', 128.7, 101.1, 14.4, 59.3, '5-24-23', '11-15-23')
('PIONEER', '86920', 128.1, 100.7, 13.9, 57.9, '5-24-23', '11-15-23')
('MATURITY CHECK', 'EARLY', 127.9, 100.5, 14.2, 58.8, '5-24-23', '11-15-23')
('POLANSKY', '5629', 126.5, 99.4, 13.8, 57.2, '5-24-23', '11-15-23')
('MATURITY CHECK', 'LATE', 126.2, 99.2, 14.2, 58.3, '5-24-23', '11-15-23')
('PIONEER', '84980', 125.5, 98.6, 14.1, 58.8, '5-24-23', '11-15-23')
('DYNA-GRO', 'M60GB31', 124.9, 98.2, 14.2, 59.2, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22934', 122.8, 96.5, 14.6, 59.5, '5-24-23', '11-15-23')
('DYNA-GRO', 'M63GB78', 122.7, 96.4, 13.9, 58.0, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22937', 121.3, 95.4, 14.2, 58.4, '5-24-23', '11-15-23')
('DYNA-GRO', 'GX22923', 121.2, 95.2, 13.7, 55.8, '5-24-23', '11-15-23')
('DYNA-GRO', 'M67GB87', 120.5, 94.7, 13.9, 56.1, '5-24-23', '11-15-23')
('DYNA-GRO', 'M59GB94', 117.7, 92.5, 13.8, 57.7, '5-24-23', '11-15-23')
('POLANSKY', '5719')
('DYNA-GRO', 'M60GB88')
('DYNA-GRO', 'GX22936')
('POLANSKY', '5522')
('DYNA-GRO', 'GX22932')
('DYNA-GRO', 'M72GB71')