Evaluating IoTfM database performance

The Georgia Tech Manufacturing Institute (GTMI) Factory Information Systems Center (FIS) is a leader in Internet of Things for Manufacturing (IoTfM) research and education. This report evaluates briefly the performance of the database for IoTfM. The main objective is to demonstrate how indexing the database can enhance query performance and data retrieval.

Database preparation

The database used for this evaluation is a MySQL database living in AWS which is maintained by IoTfM. The database contains a only table called Messages which has the following columns: Id, messageId, dateTimeReceived, topic, payload, assetId, dateTime, dataItemId, value.

There are 9,310,176 rows in the table. The database is indexed only on the Id column.

Method 1: Dump and import JSON file

We cannot add indexes to the database directly as we do not have the necessary permissions. To overcome this challenge, we choose to export the entire database and import it into a new database for testing purposes. However, due to the large size of the database (approximately 3.0 GB), directly dumping it to a local machine is not feasible. As an alternative approach, we selected all the rows and saved them as either a CSV or JSON file. However, when importing the data back into a new database on AWS, we encountered some issues. The CSV file caused problems with data-column assignment, while the JSON file, although a better option, still had slow import performance.

To import the data from JSON file into the local database, you first need to create a database called db01 in MySQL using MySQL Workbench. Then, put the exported JSON file Messages.json in the folder, we can then use the following python script in the same folder to import the data.

import mysql.connector
import json
from tqdm import tqdm

# Connect to MySQL. Replace *** with your password.
conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password="***",
    database="db01"
)
cursor = conn.cursor()

# Load JSON data
with open('./Messages.json') as json_file:
    data = json.load(json_file)

# Function to infer MySQL table schema from JSON
def infer_schema(data):
    columns = []
    for key, value in data[0].items():
        if isinstance(value, int):
            columns.append(f"{key} INT")
        elif isinstance(value, float):
            columns.append(f"{key} FLOAT")
        elif isinstance(value, str):
            columns.append(f"{key} TEXT")
        elif isinstance(value, bool):
            columns.append(f"{key} BOOLEAN")
        else:
            columns.append(f"{key} TEXT")  # Default type
    return columns

# Create table statement
table_name = "Messages"
columns = infer_schema(data)
create_table_stmt = f"CREATE TABLE IF NOT EXISTS {table_name} ({', '.join(columns)});"
cursor.execute(create_table_stmt)

# Insert data statement
for entry in tqdm(data, desc="Importing data"):
    placeholders = ', '.join(['%s'] * len(entry))
    columns_names = ', '.join(entry.keys())
    insert_stmt = f"INSERT INTO {table_name} ({columns_names}) VALUES ({placeholders})"
    cursor.execute(insert_stmt, list(entry.values()))
    conn.commit()

# Close the connection
cursor.close()
conn.close()

print("JSON data has been imported successfully.")

Method 2: Export and import SQL dump

Using the build-in MySQL Workbench tool, we can export the database as a SQL dump file. This file can then be imported into a new database on AWS. The following steps are used to export and import the database:

Open MySQL Workbench and connect to the database.
Right-click on the database name and select Dump Data to SQL File.
Save the file as Messages.sql.
Open the new database on AWS and import the Messages.sql file.

This usually takes significantly less time than the JSON file method. However, if you don’t have the necessary permissions to export the database, this method is not feasible.

Evaluate query performance

To evaluate the performance of the query with and without indexing, we imported the data into a local MySQL database on my local machine. While the specific numbers may differ from the database on AWS, the overall trends should remain consistent.

In MySQL Workbench, besides the queries, the following commands would be used:

-- Specify the database to use
use db01;

-- Show the size of each database
SELECT
    table_schema AS 'Database',
    SUM(data_length + index_length) / 1024 / 1024 AS 'Size (MB)'
FROM
    information_schema.TABLES
GROUP BY table_schema;

-- Turn on profiling and show profiles
SET profiling = 1;
SHOW PROFILES;

-- Modify the column types so that indexing can be applied
ALTER TABLE Messages MODIFY COLUMN assetId VARCHAR(255);
ALTER TABLE Messages MODIFY COLUMN dateTime VARCHAR(255);
ALTER TABLE Messages MODIFY COLUMN topic VARCHAR(255);

-- Create indexes
CREATE INDEX idx_assetId ON Messages (assetId);
CREATE INDEX idx_dateTime ON Messages (dateTime);
CREATE INDEX idx_topic ON Messages (topic);
CREATE INDEX idx_assetId_dateTime ON Messages (assetId, dateTime);

-- Drop indexes
DROP INDEX idx_assetId ON Messages;
DROP INDEX idx_dateTime ON Messages;
DROP INDEX idx_topic ON Messages;
DROP INDEX idx_assetId_dateTime ON Messages;

To evaluate the performance of the query, we used the following queries:

-- Query 1: Select records with certain assetId.
-- Returns a total count of 4,404,385.
SELECT
    COUNT(*)
FROM
    Messages
WHERE
    assetId = 'Mazak-VC500AM';

-- Query 2: Select records between certain dateTime range.
-- Returns a total count of 1,023,538.
SELECT
    COUNT(*) AS record_count
FROM
    Messages
WHERE
    dateTime BETWEEN '2021-01-01 00:00:00' AND '2021-12-31 23:59:59';

-- Query 3: Select records with certain topic.
-- Returns a total count of 1,038,778.
SELECT
    COUNT(*)
FROM
    Messages
WHERE
    topic = 'Asset/Mazak-VC500AM/Heartbeat';

-- Query 4: Count records with certain assetId and between certain dateTime range.
-- Returns a total count of 1,023,531.
SELECT
    COUNT(*)
FROM
    Messages
WHERE
    assetId = 'Mazak-VC500AM'
        AND dateTime BETWEEN '2021-01-01 00:00:00' AND '2021-12-31 23:59:59';

We ran the queries on the following different configurations:

Config 1: MacBook Pro, Apple M3 Pro, 1 CPU 4.05 GHz 6+6 cores, 36 GB RAM
Config 2: AWS, db.t3.micro, 2 vCPU cores up to 3.1 GHz, 1 GB RAM
- Used VARCHAR for dateTime and dateTimeReceived fields, which is different from the original database
Config 3: AWS, db.t3.micro, 2 vCPU cores up to 3.1 GHz, 1 GB RAM
- Used the same data types as Config 4, which is the original database: DateTime for dateTime and TimeStamp for dateTimeReceived
Config 4: AWS, db.t3.medium, 2 vCPU cores up to 3.1 GHz, 4 GB RAM (original database)

The results of the queries are as follows, where each query was run 3 times and the average time and standard deviation in parentheses are reported:

Query performance
Query	Time without indexing [s]	Time with indexing [s]	Folds faster with indexing	Time spent creating indexes [s]
Config 1, Query 1	2.30 (0.05)	0.92 (0.01)	2.49 (0.05)	10.79 (0.05)
Config 1, Query 2	2.30 (0.01)	0.38 (0.01)	6.10 (0.12)	11.74 (0.02)
Config 1, Query 3	2.58 (0.04)	0.31 (0.01)	8.24 (0.02)	16.16 (0.02)
Config 1, Query 4	2.73 (0.06)	0.46 (0.01)	5.95 (0.23)	16.12 (0.25)

Config 2, Query 1	52.33 (0.39)	8.30 (1.12)	6.38 (0.83)	119.38 (5.37)
Config 2, Query 2	52.12 (0.31)	1.38 (0.06)	37.82 (1.70)	144.39 (2.30)
Config 2, Query 3	52.22 (0.40)	1.18 (0.13)	44.58 (4.48)	166.44 (3.61)
Config 2, Query 4	55.59 (1.14)	1.56 (0.06)	35.76 (1.92)	178.44 (2.52)

Config 3, Query 1	43.10 (0.41)	6.01 (0.54)	7.21 (0.65)	109.04 (1.04)
Config 3, Query 2	42.87 (0.25)	1.77 (0.59)	26.22 (8.96)	77.06 (1.11)
Config 3, Query 3	42.87 (0.21)	0.87 (0.05)	49.21 (3.00)	160.05 (1.02)
Config 3, Query 4	43.61 (1.07)	1.48 (0.14)	29.62 (2.59)	118.53 (2.39)

Config 4, Query 1	24.77 (0.63)
Config 4, Query 2	29.52 (1.11)
Config 4, Query 3	24.84 (0.34)
Config 4, Query 4	26.94 (0.29)

There is one observations that worth noticing. When idx_assetId_dateTime is the only index created regarding columns assetId and dateTime, the performance of Query 1, 2, 4 is significantly improved. However, when idx_assetId and idx_dateTime are also created separately, the performance gain disappears for Query 4. The reason is that the database engine can only use one index per table in a query, and there is no clue which of the index is chosen to work with. Therefore, creating a composite index on the columns used in the query can improve performance.

The original dataset is shared in the link: https://docs.google.com/spreadsheets/d/1s8hFXYUdbhYIvu6Ww1DvmTuTZAtB8261RfNHcSoXLnA/edit?gid=0#gid=0

Conclusion

Indexing the database could significantly improve query performance. The time spent creating indexes is a small investment compared to the time saved when running queries. A routinely scheduled index creation can help maintain the performance of the database. However, the performance gain may vary depending on the database size, hardware limitation and the query complexity. But in general, indexing is a good practice to follow when working with large databases.