DB Geek Girl

Tuesday, October 15, 2019

MySQL storing time

Storing time in databases seems like a daunting task to many. The general principle with storing time in databases is - ALWAYS STORE TIME IN UTC. However for a lot of people this raises questions.. but my application expects time to be in US/Pacific, or my database server's time zone in PDT or most of my users are in US/Eastern. Some also ask that the global setting for MySQL variable time_zone be set to US/Pacific or US/Eastern to match their requirement.

From a DBA's perspective, I think 99% use cases meet the ALWAYS STORE TIME IN UTC guideline, and for 1% that don't an understanding of the right way of storing and reading/writing time is necessary to ensure that you are doing the right thing.

MySQL's datetime data types

MySQL provides two data types to store time - TIMESTAMP and DATETIME

TIMESTAMP stores time as unix timestamp, or seconds since epoch, regardless of what the time zone is.

DATETIME simply takes the date and stores it as a string like representation as YYYYMMDDHHmmSS

While this difference might not be very evident while writing data, it becomes clearer while reading data.

TIMESTAMP

Let's take an example of a table with TIMESTAMP to store date and time. My time_zone variables has the global setting 'UTC'

CREATE TABLE `test_timestamp` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`time_created` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Inserting data into the table.

mysql> insert into test_timestamp (name, time_created) values ('a','2019-10-09 21:51:00');
Query OK, 1 row affected (0.01 sec)

Reading data.

mysql> select * from test_timestamp;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | a | 2019-10-09 21:51:00 |
+----+------+---------------------+
1 row in set (0.00 sec)

So far the difference isn't quite clear, since we wrote the data in UTC and are reading it in UTC.

Let's change the time zones around a little and read the same data.

mysql> set time_zone='US/Pacific';

mysql> select * from test_timestamp;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | a | 2019-10-09 14:51:00 |
+----+------+---------------------+
1 row in set (0.00 sec)

When the data is retrieved it is presented in the time zone, the session value of time_zone variable is set to. Well, what about if I write in one time zone (other than UTC) and read it in another.

mysql> set time_zone='US/Pacific';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into test_timestamp (name, time_created) values ('b','2019-10-09 14:57:00');
Query OK, 1 row affected (0.01 sec)

mysql> select * from test_timestamp;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | a | 2019-10-09 14:51:00 |
| 2 | b | 2019-10-09 14:57:00 |
+----+------+---------------------+
2 rows in set (0.00 sec)

Lets switch back to UTC and read it.

mysql> set time_zone='UTC';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from test_timestamp;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | a | 2019-10-09 21:51:00 |
| 2 | b | 2019-10-09 21:57:00 |
+----+------+---------------------+
2 rows in set (0.00 sec)

How about reading it in US/Eastern ?

mysql> set time_zone='US/Eastern';
Query OK, 0 rows affected (0.01 sec)

mysql> select * from test_timestamp;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | a | 2019-10-09 17:51:00 |
| 2 | b | 2019-10-09 17:57:00 |
+----+------+---------------------+
2 rows in set (0.00 sec)

TL;DR

TIMESTAMP stores data in the format of seconds since epoch or unix time.
If you want the time you store to follow the time zone of user or application server, use TIMESTAMP
Change the time zone via session value of time_zone, rather than setting the global value
If it is easier, you can add the time_zone setting to your connection parameters or even if you expect it to be global to your init_connect.

DATETIME

Now let's take a look at what happens with DATETIME

CREATE TABLE `test_datetime` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`time_created` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Writing data in US/Pacific time zone

mysql> set time_zone='US/Pacific';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into test_datetime (name, time_created) values ('c','2019-10-09 15:59');
Query OK, 1 row affected (0.01 sec)

mysql> select * from test_datetime;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | c | 2019-10-09 15:59:00 |
+----+------+---------------------+
1 row in set (0.00 sec)

Now let's try to read it in UTC.

mysql> set time_zone='UTC';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from test_datetime;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | c | 2019-10-09 15:59:00 |
+----+------+---------------------+
1 row in set (0.00 sec)

The data does not follow the time zone change. How about if we insert data in UTC and try to read it in other time zones?

mysql> set time_zone='UTC';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into test_datetime (name, time_created) values ('d','2019-10-09 22:00');
Query OK, 1 row affected (0.01 sec)

mysql> select * from test_datetime;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | c | 2019-10-09 15:59:00 |
| 2 | d | 2019-10-09 22:00:00 |
+----+------+---------------------+
2 rows in set (0.00 sec)

mysql> set time_zone='US/Pacific';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from test_datetime;
+----+------+---------------------+
| id | name | time_created |
+----+------+---------------------+
| 1 | c | 2019-10-09 15:59:00 |
| 2 | d | 2019-10-09 22:00:00 |
+----+------+---------------------+
2 rows in set (0.00 sec)

The time value you stored always remains what you stored regardless of time zone changes while reading or writing it.

TL; DR

DATETIME stores the date time value as a string like literal YYYYMMDDHHmmSS
No matter what the time zone changes are while reading or writing the data you will always see what you stored.

What to use when

Use TIMESTAMP when you want the timestamp to follow time zone changes of the application users. Examples of this are - timestamps for log events, or a created at / modified at timestamp, timestamps for chat messages or comments etc., notifications
Use DATETIME when you don't want the timestamp to follow time zone changes. Examples of this are - Date of Birth

If using these two data types right, with setting time_zone on the session level cannot take care of your time zone issues, then and only then explore changing the time_zone variable globally.

On a lighter note, If you are in the business of using TIMESTAMP, have you heard of the Year 2038 problem? Nothing to freak out about yet, we are only 19 years away :) Anyone having Y2K flashbacks?

Tuesday, October 1, 2019

Quick fix for Replication Error "but the master has purged binary logs containing GTIDs"

I ran into this error recently, and wanted to share a quick fix.

This is what the problem looks like -

mysql> show slave status \G
..
Master_Log_File: mysql_binary_log.000382
Read_Master_Log_Pos: 341467211
Relay_Log_File: mysql_relay_log.000001
Relay_Log_Pos: 4
Relay_Master_Log_File: mysql_binary_log.000382
Slave_IO_Running: No
Slave_SQL_Running: Yes

Seconds_Behind_Master: NULL
..
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
Last_SQL_Errno: 0

..

Checking the binary on the master

mysql> show binary logs;
+-------------------------+------------+
| Log_name | File_size |
+-------------------------+------------+
| mysql_binary_log.000388 | 1073741987 |
| mysql_binary_log.000389 | 1073742774 |
| mysql_binary_log.000390 | 1074106024 |
| mysql_binary_log.000391 | 1073748407 |
| mysql_binary_log.000392 | 120633755 |
+-------------------------+------------+
5 rows in set (0.00 sec)

As we can see that the binary log mysql_binary_log.000382 has indeed been purged.
However I have a script that backups binary logs before it purges them. Since the database is in order of TBs, I will attempt to recover from this using those, rather than rebuild the database using a backup.

Copy the binary logs from the backup location to MySQL's binary log directory.

cp mysql_binary_log.00038* /mysql_binlog_dir

Be sure to change the ownership to mysql user

chown -R mysql:mysql /mysql_binlog_dir

Edit mysql_binary.index and add these files to it.

mysql_binary_log.000382
mysql_binary_log.000383
..
mysql_binary_log.000392

The logs still won't show up on MySQL. Lets try FLUSH LOGS;

mysql> flush logs;
Query OK, 0 rows affected (0.02 sec)

mysql> show binary logs;
+-------------------------+------------+
| Log_name | File_size |
+-------------------------+------------+
| mysql_binary_log.000382 | 1073897047 |
| mysql_binary_log.000383 | 1073860223 |
| mysql_binary_log.000384 | 1073742372 |
| mysql_binary_log.000385 | 1075249248 |
| mysql_binary_log.000386 | 1074009237 |
| mysql_binary_log.000387 | 115207752 |
| mysql_binary_log.000388 | 1073741987 |
| mysql_binary_log.000389 | 1073742774 |
| mysql_binary_log.000390 | 1074106024 |
| mysql_binary_log.000391 | 1073748407 |
| mysql_binary_log.000392 | 120924130 |
| mysql_binary_log.000393 | 6322 |
+-------------------------+------------+

12 rows in set (0.00 sec)

The logs now show up in MySQL.

Let us log in to the slave and try to resume replication now.

mysql> change master to master_auto_position=0;
Query OK, 0 rows affected (0.02 sec)

mysql> change master to master_log_file='mysql_binary_log.000382';
Query OK, 0 rows affected, 1 warning (0.01 sec)

mysql> set global slave_parallel_workers=0;

There are warnings, if you use start slave until with parallel replications, you could run into issues.

start slave until master_log_file='mysql_binary_log.000393', master_log_pos=38204485;

MySQL replication now resumes

mysql> show slave status \G
..
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
..
Seconds_Behind_Master: 45844
..

Monitor the replication lag, until the slave catches up to previous position.

Then switch back to master_auto_position and parallel replication

mysql> change master to master_auto_position=1;
mysql> set global slave_parallel_workers=20;
mysql> start slave;

mysql> show slave status \G
..
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
..
Seconds_Behind_Master: 600
..

Eventually replication will catch up and things will be back to normal

mysql> show slave status \G
..
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
..
Seconds_Behind_Master: 0
..

It is a good practice to validate that the data remains consistent after this fix. One way to do this is to check whether the Executed_Gtid_Set that shows up in show slave status \G, contains all transactions from master. However this only validates that the transaction ID is present and not whether the actual data is consistent. If you want to validate if the actual data is consistent, this can be done using a tool like pt-table-checksum to check data consistency.

Wednesday, April 24, 2019

LDAP Authentication for MySQL

Why LDAP?

LDAP stands for Lightweight Directory Access Protocol. It is based on a client server model. A client queries LDAP server, which responds with an answer or with a pointer to where client can get more information. Most organizations have LDAP set up and configured for managing users and their credentials for internal applications. Using LDAP users can have single sign on for most applications.

When using LDAP with MySQL for authentication - the password management is offloaded to the LDAP service. Users that don't exist within the directory cannot access database. Password management functions such as enforcing a strong password, and password rotation can be off-loaded.

How to configure LDAP with MySQL?

If you are using Percona XtraDB i.e. Percona's flavor of MySQL, this can be easily done using the Percona PAM authentication plugin.

Installing Percona PAM authentication plugin

Percona PAM Authentication Plugin acts as a mediator between the MySQL server, the MySQL client, and the PAM stack. The server plugin requests authentication from the PAM stack, forwards any requests and messages from the PAM stack over the wire to the client (in cleartext) and reads back any replies for the PAM stack.

To install the plugin, verify that the plugin exists in the plugin directory. Then,

mysql> INSTALL PLUGIN auth_pam SONAME 'auth_pam.so';

Query OK, 0 rows affected (0.08 sec)

mysql> INSTALL PLUGIN auth_pam_compat SONAME 'auth_pam_compat.so';

Query OK, 0 rows affected (0.05 sec)

Configure Percona PAM to authenticate to LDAP

Create a file /etc/pam.d/mysqld with the following content -

auth required pam_ldap.so audit

account required pam_ldap.so audit

Configure users using LDAP

Users can be created in two ways -

Create user defined outside MySQL tables

LDAP authentication enables MySQL servers to accept connections from users defined outside the MySQL tables in LDAP directories

Proxy user support

LDAP authentication can return to MySQL a user name different from the operating system user, based on the LDAP group of the external user.
For example, an LDAP user named stacy can connect and have the privileges of the MySQL user named dba_users , if the LDAP group for stacy is dba_users.

Creating user defined outside of MySQL tables

mysql> CREATE USER 'yashada'@'%' IDENTIFIED WITH auth_pam;

Query OK, 0 rows affected (0.04 sec)

mysql> GRANT SELECT ON ycsb.* TO 'yashada'@'%';

Query OK, 0 rows affected (0.02 sec)

mysql> FLUSH PRIVILEGES;

Query OK, 0 rows affected (0.01 sec)

mysql -u yashada –p

Enter password:

Welcome to the MySQL monitor. Commands end with ; or \g.

Creating proxy users

If belonging to a group needs certain MySQL privileges, setup proxy users instead to map a user’s privilege to its defined group.

A good example of this is a DBA group, or a reporting_users group that needs read_only access. This offloads management of removal and addition of new users from the MySQL DBA and passes it on to LDAP.

mysql> CREATE USER ''@'' IDENTIFIED WITH auth_pam as 'mysqld,DBA=dba_users';

Query OK, 0 rows affected (0.05 sec)

mysql> CREATE USER dba_users@'localhost' IDENTIFIED BY 'somepassword';

Query OK, 0 rows affected (0.02 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO dba_users@'localhost';

Query OK, 0 rows affected, 1 warning (0.01 sec)

mysql> GRANT PROXY ON dba_users@'localhost' TO ''@'';

Query OK, 0 rows affected, 1 warning (0.02 sec)

mysql> FLUSH PRIVILEGES;

Query OK, 0 rows affected (0.05 sec)

When a user belonging to the group logs in -

mysql -u yashada -p

Enter password:

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 8587

mysql> show grants;

+-------------------------------------------------+

| Grants for @ |

+-------------------------------------------------+

| GRANT USAGE ON *.* TO ''@'' |

| GRANT PROXY ON 'dba_users'@'localhost' TO ''@'' |

+-------------------------------------------------+

2 rows in set (0.00 sec)

Tuesday, January 29, 2019

Passwordless authentication (supposedly) using mysql_config_editor

In the DBA operations world there are always challenges with passwords. How do you use the password for login, automation and operation scripts in a way that does not expose the password.

One of the MySQL utilities that addresses some of these questions is mysql_config_editor.

mysql_config_editor enables you to store authentication credentials in a login path file named .mylogin.cnf. On Linux these credentials are stored in the current user's home directory.

Installing mysql_config_editor

To install mysql_config_editor, all you need is the MySQL client installed. It can be particular to the MySQL flavor you use. I use Percona MySQL, so installing the Percona-Server-client rpm is enough for access to mysql_config_editor.

Common mysql_config_editor operations

To set up log in paths using mysql_config_editor

mysql_config_editor set --login-path=monitor --host=XXXX --port=3306 --user=monitor --password

Enter password:

To log in using mysql_config_editor

mysql --login-path=monitor

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 5087

...

mysql>

To check what logins are set up in mysql_config_editor

mysql_config_editor print --all

[mysqlconn]

user = root

password = *****

host = localhost

[monitor]

user = monitor

password = *****

host = XXXX

port = 3306

Strong Passwords

The advantages of not having to remember or enter passwords are quite a few. For example, now a password can be a secure non dictionary string, that meets password validation rules.

The password for monitor can be B*kA2aBntGYdvJaf

It can be added to mysql_config_editor once using

mysql_config_editor set --login-path=monitor --host=XXXX --port=3306 --user=monitor --password

Enter password:

And can be used thereafter by using -

mysql --login-path=monitor

Using the password in scripts

This means that we can run scripts without providing the password. For example, if we were writing a script to take a backup or promote a slave, that will run on the database server. Here is an example -

[root@ ~]# mysql --login-path=scripts -e "show slave status \G"

Slave_IO_State:

Master_Host: XXXX

Master_User: XXXX

Master_Port: 3306

Connect_Retry: 60

Master_Log_File:

Read_Master_Log_Pos: 4

Relay_Log_File: mysql_relay_log.000001

Risks and Caveats

While this is a comparatively better way of logins for automation scripts it is still not secure. This is the reason for the "supposedly" in the title of this post. While this is a convenient way, it is by no means completely secure.

[root@ ~]# mysql_config_editor print --all

[mysqlconn]

user = root

password = *****

host = localhost

[monitor]

user = monitor

password = *****

host = XXXX

port = 3306

We can read the contents of this encrypted file using the my_print_defaults utility.

[root@ ~]# my_print_defaults -s monitor

--user=monitor

--password=B*kA2aBntGYdvJaf

--host=XXXX

--port=3306

my_print_defaults is a part of standard MySQL install.

For this reason even though mysql_config_editor saves the password in an encrypted file, it is recommended that this only be used for "root" linux user, and the logins saved in the mysql_config_editor be restricted by 'user'@'localhost'. The root access in these servers also needs to be tightly controlled and regulated.

Resources:

mysql_config_editor documentation

Wednesday, December 19, 2018

Exchange Partition: An Archiving Strategy

As applications mature, tables tend to grow infinitely, specially the time series ones. They have tens of millions of rows and run up to a few TBs in size. Even though the tables have so much data, applications only need to access data that was recently saved. (Say within, the last year or so). A trivial task of adding a new index or changing the type of a column for such a humungous table, becomes a very painful task. I was charged with such a task recently and that got me thinking about archiving strategies.

Archiving is a good practice to keep the data to a required working set, while at the same time, maintaining the data that might be needed for legal or compliance reasons, or for certain less frequent workflows in the application, like looking at the 10 year purchase history of a customer.

Though archiving seems like a database problem, it cannot be done in vacuum. It needs buy in from application, legal and compliance to know what the archiving boundaries are, what are the access patterns and availability requirements for the archived data.

But solving for the database problem is something right up my alley, so I thought of testing a few approaches. One of my favorites for the ease of use is the "EXCHANGE PARTITION" feature for 5.7.

In MySQL 5.7, it is possible to exchange a table partition or subpartition with a table using –

ALTER TABLE pt EXCHANGE PARTITION p WITH TABLE nt

where

pt is the partitioned table

p is the partition of pt to be exchanged

and nt is a non partitioned table.

Privileges needed for the statement is the combination of ALTER TABLE and

TRUNCATE TABLE on both tables.

This provides a great opportunity for archival of partitioned tables.

Consider an online retail store, that logs its invoices or orders in a table. Here is an over simplified invoice table for the store.

CREATE TABLE `invoice` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`invoice_num` int(10) unsigned NOT NULL,

`stockcode` int(10) unsigned NOT NULL,

`invoice_date` datetime NOT NULL,

`price` decimal(10,2) DEFAULT NULL,

PRIMARY KEY (`id`,`invoice_date`)

) ENGINE=InnoDB

PARTITION BY RANGE (YEAR(invoice_date))

(PARTITION p2009 VALUES LESS THAN (2010) ENGINE = InnoDB,

PARTITION p2010 VALUES LESS THAN (2011) ENGINE = InnoDB,

PARTITION p2011 VALUES LESS THAN (2012) ENGINE = InnoDB,

PARTITION p2012 VALUES LESS THAN (2013) ENGINE = InnoDB,

PARTITION p2013 VALUES LESS THAN (2014) ENGINE = InnoDB,

PARTITION p2014 VALUES LESS THAN (2015) ENGINE = InnoDB,

PARTITION p2015 VALUES LESS THAN (2016) ENGINE = InnoDB,

PARTITION p2016 VALUES LESS THAN (2017) ENGINE = InnoDB,

PARTITION p2017 VALUES LESS THAN (2018) ENGINE = InnoDB,

PARTITION p2018 VALUES LESS THAN (2019) ENGINE = InnoDB,

PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB)

The queries on this table only need data from a year ago, to be available for queries, however for compliance and other non-frequent workflow we need to maintain older data. The EXCHANGE PARTITION gives us an opportunity to archive partitions, with quick DDL operations.

Consider the need to archive 2010 data, which is in p2010 partition.

mysql> select count(*) from invoice PARTITION (p2010);

+----------+

| count(*) |

+----------+

| 1111215 |

+----------+

1 row in set (0.38 sec)

To use EXCHANGE PARTITION the table to be partitioned pt and the non-partitioned table that the data will be archived into nt need to meet a few requirements -

- The non-partitioned table needs to have the same structure as the partitioned table.

- It cannot be a temporary table

- It cannot have foreign keys, or have any foreign keys that refer to it.

- There are no rows in nt that lie beyond the boundaries of the partition p.

Lets create the table invoice_2010 which will archive all invoices from 2010.

mysql> create table invoice_2010 like invoice;

Query OK, 0 rows affected (0.44 sec)

mysql> alter table invoice_2010 remove partitioning;

Query OK, 0 rows affected (0.38 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> show create table invoice_2010 \G

Table: invoice_2010

Create Table: CREATE TABLE `invoice_2010` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`invoice_num` int(10) unsigned NOT NULL,

`stockcode` int(10) unsigned NOT NULL,

`invoice_date` datetime NOT NULL,

`price` decimal(10,2) DEFAULT NULL,

PRIMARY KEY (`id`,`invoice_date`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8

1 row in set (0.00 sec)

Archiving the partition using EXCHANGE partition, can be done as –

mysql> select count(*) from invoice PARTITION (p2010);

+----------+

| count(*) |

+----------+

| 1111215 |

+----------+

1 row in set (0.38 sec)

mysql> ALTER TABLE invoice EXCHANGE PARTITION p2010 WITH TABLE invoice_2010;

Query OK, 0 rows affected (0.14 sec)

mysql> select count(*) from invoice PARTITION (p2010);

+----------+

| count(*) |

+----------+

| 0 |

+----------+

1 row in set (0.00 sec)

mysql> select count(*) from invoice_2010;

+----------+

| count(*) |

+----------+

| 1111215 |

+----------+

1 row in set (0.21 sec)

If needed the partition can also be moved back.

mysql> ALTER TABLE invoice EXCHANGE PARTITION p2010 WITH TABLE invoice_2010;

Query OK, 0 rows affected (0.73 sec)

mysql> select count(*) from invoice_2010;

+----------+

| count(*) |

+----------+

| 0 |

+----------+

1 row in set (0.00 sec)

mysql> select count(*) from invoice PARTITION (p2010);

+----------+

| count(*) |

+----------+

| 1111215 |

+----------+

1 row in set (0.23 sec)

If any of the rows in the non partitioned table violate the partitioning rule for the partition it is being exchanged with, you would get an error.

mysql> update invoice_2010 set invoice_date='2012-01-01' where id=10000009;

Query OK, 1 row affected (0.17 sec)

Rows matched: 1 Changed: 1 Warnings: 0

mysql> ALTER TABLE invoice EXCHANGE PARTITION p2010 WITH TABLE invoice_2010;

ERROR 1737 (HY000): Found a row that does not match the partition

In its current implementation a row by row validation, does a full table scan on the non partitioned table to evaluate if there is a row in the table that violates the partitioning rule. From the open worklog, it seems like there are plans to have the command use index instead of a full table scan but it isn't implemented yet. A workaround to this is using WITHOUT VALIDATION.

To avoid time consuming validation when exchanging a partition with a table that has many rows, it is possible to skip the row-by-row validation step by appending WITHOUT VALIDATION to the ALTER TABLE ... EXCHANGE PARTITION statement. However with this the onus lies on the engineer to verify that no partitioning rules are getting violated.

mysql> ALTER TABLE invoice EXCHANGE PARTITION p2010 WITH TABLE invoice_2010 WITHOUT VALIDATION;

Query OK, 0 rows affected (0.13 sec)

From a 20,000 ft view, the high level architecture for this involves

- Taking a upgradable metadata lock on both tables

- Verifying that the metadata matches (i.e both tables have the same structure)

- If WITHOUT VALIDATION is not used, verifying data in the non partitioned table

- Upgrading to a exclusive metadata lock for both tables

- Rename non-partitioned table to partition and partition to non-partitioned table.

- Releasing the metadata locks

It would have been nice if it was possible to append to the non partitioned table, rather than a exchange, but my guess is then it wouldn't be a metadata operation. The same can be achieved by an exchange to an intermediate table, followed by copy into the archive table.

As pertains to previous example, this would involve creating a table invoice_2010_intermediate. Exchanging p2010 from invoice table with invoice_2010_intermediate and doing a copy from invoice_2010_intermediate to invoice_2010.

But for what it does, I think it is a delightful approach to archiving data that you still need on your database server, for reading and archiving, but it doesn't change.

Resources :

Script for populating data into invoice table - https://gist.github.com/dontstopbelieveing/3c42338c8f5c756a526ab2f7bef5525e

-->

Friday, January 19, 2018

Crash Recovery : Scratching the surface : Part 3

In the last post we saw how the log is written during the operation of a database system. In this post we will go over the actual recovery mechanism and how the log is used.

Why recovery?

As we discussed in our first post on this topic, a recovery algorithm ensures that all the changes that were part of a transaction that was not committed are rolled back and all the changes that are part of a transaction that was committed persist even after a crash, restart or error.

What exactly happens in recovery?

There are two processes that are essential for recovery that happen in the database system on an ongoing basis Checkpointing and Write-Ahead Logging.

What is a checkpoint?

A checkpoint is a snapshot of the DBMS state. By taking checkpoints periodically the work done during a recovery can be reduced. A begin_checkpoint record is written to indicate start of checkpoint. An end_checkpoint record is written consisting of transaction table and dirty page table and appended to the log. After the checkpoint process is complete a special master record is written containing LSN of the begin_checkpoint log record. When system comes back from the crash the restart process begins by locating the most recent checkpoint.

What is write ahead logging?

We saw this already in the last post, what it essentially means is that any change to database is first recorded in the log and record in the log must be written to stable storage (disk) before the change to the database is written to the disk.

Recovery proceeds in three phases - Analysis, Redo and Undo.

During Analysis the system determines which transactions were committed and which were not and essentially collects information about all transactions that were active (not committed or rolled back) at the time of the crash. Redo retraces or re-does all actions of the system and brings it back to the state that it was at the time of the crash. This is followed by the Undo stage

We will take a look at what happens in each phase in detail next.

Lets continue from the example in our last post,

LSN	prevLSN	transID	type	PageID	Length	Offset	Before	After
1	-	T1	update	P500	3	21	ABC	DEF
2	-	T2	update	P600	3		HIJ	KLM
3	2	T2	update	P500	3	20	GDE	QRS
4	1	T1	update	P505	3		TUV	WXY
5	3	T2	commit	-	-	-	-	-

Let us assume that T1 changes NOP to ABC on page P700, writes to the log record.

LSN	prevLSN	transID	type	PageID	Length	Offset	Before	After
1	-	T1	update	P500	3	21	ABC	DEF
2	-	T2	update	P600	3		HIJ	KLM
3	2	T2	update	P500	3	20	GDE	QRS
4	1	T1	update	P505	3		TUV	WXY
5	3	T2	commit	-	-	-	-	-
6	4	T1	update	P700	3		NOP	ABC

Let us look at a scenario when the system crashes before the last log record is written to stable storage.

What exactly happens in the Analysis phase?

The analysis phase begins by examining the most recent checkpoint, and initializing the dirty page table and transaction table to copies of those structures in the end_checkpoint records. The analysis then scans the log forward till it reaches the end of the log.

If a log record other than an end record for T is encountered an entry for T is added to transaction table if not already there. Entry for T is modified and LastLSN field of T is set to this record.

If an end log record for transaction T is encountered, T is removed from transaction table because it is no longer active.

If the log record is a commit record the status is set to commit, C otherwise to U indicating it needs to be undone

If a redo log record affecting page P is encountered and P is not in dirty page table an entry is added to dirty page table

At end of analysis table transaction table contains all transactions that were active at time of the crash.
In our scenario, let us assume the checkpoint was done at the beginning when both the transaction and dirty page table were dirty and analysis starts from the first log record.

Looking at LSN 1

update

P500

ABC

DEF

Transaction Table

transID	lastLSN
1	1

lastLSN - LSN of the most recent log record belonging to the transaction.
Dirty Page Table

pageID	recLSN
P500	1

recLSN - LSN of the first log record that caused the page to become dirty

Looking at LSN 2

update

P600

HIJ

KLM

Transaction Table

transID	lastLSN
1	1
2	2

Dirty Page Table

pageID	recLSN
P500	1
P600	2

Looking at LSN 3

update

P500

GDE

QRS

Transaction Table

transID	lastLSN
1	1
2	3

Dirty Page Table

pageID	recLSN
P500	1
P600	2

Looking at LSN 4

update

P505

TUV

WXY

Transaction Table

transID	lastLSN
1	4
2	3

Dirty Page Table

pageID	recLSN
P500	1
P600	2

Looking at LSN 5

commit

Transaction Table

transID	lastLSN
1	4

Dirty Page Table

pageID	recLSN
P500	1
P600	2
P505	4

Since the system crashed before log record 6 could be written to stable storage, the log record is not read in the Analysis phase at all. This is the state of transaction table and dirty page table at the end of the analysis phase.

What exactly happens in the Redo phase?

During the redo phase the system applies updates of all transactions committed or otherwise. If a transaction was aborted before the crash and its updates were undone, as indicated by compensation log records (CLRs), the actions described in CLRs are also reapplied.

The Redo phase starts with the smallest LSN in the dirty page table that was constructed in the Analysis phase. This LSN refers to the oldest update that may not have been written to the disk prior to the crash. Starting from this LSN, redo scans forward till it reaches the end of the log.

For each log record (update or CLR), redo checks the dirty page table

If the page is not in dirty page table the log record is ignored as this means all the changes to the page have been already written to the disk.

If the page is in the dirty page table but the recLSN (the LSN that made the page dirty) is greater than LSN of the record being checked the record is ignored. This means that the change was written to disk and a latter LSN made the page dirty.

It then retrieves the page and checks the most recent LSN on the page (pageLSN), if this is greater than or equal to LSN of the record being checked the record is ignored as this means the page already contains the changes from the LSN being checked.

For all other cases, the log action is redone, whether it is an update record or a CLR (records written during rollback / abort) The logged action is reapplied and the pageLSN on the page is set to the LSN of the redone record. No additional log record is written.

Considering the transaction table and the dirty page table at the end of our analysis phase in our example

Transaction Table

transID	lastLSN
1	4

Dirty Page Table

pageID	recLSN
P500	1
P600	2
P505	4

Redo phase starts with smallest LSN is dirty page table which is 1 and scans forward from the log.

Looking at LSN 1

update

P500

ABC

DEF

P500 is in the dirty page table, and recLSN which is 1, is equal to the LSN that is being checked. Therefore the system retrieves the page. It checks the pageLSN, which is less than 1 and therefore decides the action must be redone. It changes ABC to DEF.

Looking at LSN 2

update

P600

HIJ

KLM

P600 is in the dirty page table and recLSN is 2, which is equal to the LSN being checked. Therefore the system retrieves the page. It checks pageLSN (3) which is greater than the LSN being checked and therefore does not need to redo the update (Remember, T2 was committed?)

Looking at LSN 3

update

P500

GDE

QRS

This is the same case as the previous one and no redo is necessary.

Looking at LSN 4

update

P505

TUV

WXY

P505 is in the dirty page table. The pageLSN is less than the LSN being checked and therefore the update needs to be redone. It changes TUV to WXY

Looking at LSN 5

commit

Since this is not an update or CLR record no action needs to be done. At the end of the redo phase, an end record is written for T2.

What exactly happens in the Undo phase?

The aim of the undo phase is to undo the actions of all transactions that were active at the time of the crash, effectively aborting them. The Analysis phase identifies all transactions that were active at the time of the crash, along with their most recent LSN. (lastLSN) All these transactions must be undone in the reverse order of which they appear in the log. Therefore undo starts from the largest i.e. most recent LSN from the transactions to be undone.
For each log record

If the record is a CLR and the undoNextLSN values is not null, the undoNextLSN values is added to the logs records to undo. If the undonextLSN is null an end record is written for the transaction because it is completely undone and the CLR record is discarded.

If the record is an update a CLR is written and the corresponding action is undone just as if the system were doing a rollback and prevLSN in the update record is added to the set of records to be undone.

When the set of records to be undone is completely empty the undo phase is complete.

Once the undo phase is complete, the system is said to be “recovered” and can proceed with normal operations.

In our example, the transaction table from Analysis phase is -

Transaction Table

transID	lastLSN
1	4

The undo phase starts from log record with LSN 4 and creates a set of actions to undo

Looking at LSN 4

update

P505

TUV

WXY

WXY is changed to TUV

Set to Undo - {1} which is the prevLSN

A CLR is added

CLR

P505

WXY

TUV

where 1 is the undoNextLSN

Looking at LSN 1

update

P500

ABC

DEF

DEF is changed to ABC
Set to Undo - {} since prevLSN is null

A CLR is added

CLR

P500

DEF

ABC

Since the set of actions to be undone is zero, the undo is complete. T1 is removed from the transaction table. A checkpoint is taken and the system is in a recovered state and can proceed with normal operation.

What happens if a system crashes during crash recovery?

If there is a crash during crash recovery, the system can still do recovery as for every update that was undone, a CLR is written and the system needs to just redo the CLRs. This is why CLRs are an important part of recovery.

In our example if the system crashed after the change from WXY to TUV was done but before the change from DEF to ABC was done, when the system is in recovery state again, it would see the CLR for the WXY to TUV change in the redo phase and redo or repeat the change. The change from DEF to ABC would be done as a part of undo during the latter/second recovery.

In the next and hopefully last post on recovery, I’ll try to look at MySQL logs and source code related to the recovery component and see some of these things in action.

Pingback

Crash Recovery : Scratching the surface : Part 2