Friday, January 19, 2018

Crash Recovery : Scratching the surface : Part 3

In the last post we saw how the log is written during the operation of a database system. In this post we will go over the actual recovery mechanism and how the log is used.

Why recovery?

As we discussed in our first post on this topic, a recovery algorithm ensures that all the changes that were part of a transaction that was not committed are rolled back and all the changes that are part of a transaction that was committed persist even after a crash, restart or error.


What exactly happens in recovery?

There are two processes that are essential for recovery that happen in the database system on an ongoing basis Checkpointing and Write-Ahead Logging.


What is a checkpoint?

A checkpoint is a snapshot of the DBMS state. By taking checkpoints periodically the work done during a recovery can be reduced. A begin_checkpoint record is written to indicate start of checkpoint. An end_checkpoint record is written consisting of transaction table and dirty page table and appended to the log. After the checkpoint process is complete a special master record is written containing LSN of the begin_checkpoint log record. When system comes back from the crash the restart process begins by locating the most recent checkpoint.

What is write ahead logging?

We saw this already in the last post, what it essentially means is that any change to database is first recorded in the log and record in the log must be written to stable storage (disk) before the change to the database is written to the disk.

Recovery proceeds in three phases - Analysis, Redo and Undo.

During Analysis the system determines which transactions were committed and which were not and essentially collects information about all transactions that were active (not committed or rolled back) at the time of the crash. Redo retraces or re-does all actions of the system and brings it back to the state that it was at the time of the crash. This is followed by the Undo stage
We will take a look at what happens in each phase in detail next.
Lets continue from the example in our last post,
LSN
prevLSN
transID
type
PageID
Length
Offset
Before
After
1
-
T1
update
P500
3
21
ABC
DEF
2
-
T2
update
P600
3

HIJ
KLM
3
2
T2
update
P500
3
20
GDE
QRS
4
1
T1
update
P505
3

TUV
WXY
5
3
T2
commit
-
-
-
-
-

Let us assume that T1 changes NOP to ABC on page P700, writes to the log record.
LSN
prevLSN
transID
type
PageID
Length
Offset
Before
After
1
-
T1
update
P500
3
21
ABC
DEF
2
-
T2
update
P600
3

HIJ
KLM
3
2
T2
update
P500
3
20
GDE
QRS
4
1
T1
update
P505
3

TUV
WXY
5
3
T2
commit
-
-
-
-
-
6
4
T1
update
P700
3

NOP
ABC
Let us look at a scenario when the system crashes before the last log record is written to stable storage.

What exactly happens in the Analysis phase?

The analysis phase begins by examining the most recent checkpoint, and initializing the dirty page table and transaction table to copies of those structures in the end_checkpoint records. The analysis then scans the log forward till it reaches the end of the log.
  • If a log record other than an end record for T is encountered an entry for T is added to transaction table if not already there. Entry for T is modified and LastLSN field of T is set to this record.
  • If an end log record for transaction T is encountered, T is removed from transaction table because it is no longer active.
  • If the log record is a commit record the status is set to commit, C otherwise to U indicating it needs to be undone
  • If a redo log record affecting page P is encountered and P is not in dirty page table an entry is added to dirty page table
At end of analysis table transaction table contains all transactions that were active at time of the crash.
In our scenario, let us assume the checkpoint was done at the beginning when both the transaction and dirty page table were dirty and analysis starts from the first log record.

Looking at LSN 1

1
-
T1
update
P500
3
21
ABC
DEF

Transaction Table
transID
lastLSN
1
1
lastLSN - LSN of the most recent log record belonging to the transaction.
Dirty Page Table
pageID
recLSN
P500
1
recLSN - LSN of the first log record that caused the page to become dirty


Looking at LSN 2
2
-
T2
update
P600
3

HIJ
KLM


Transaction Table
transID
lastLSN
1
1
2
2

Dirty Page Table
pageID
recLSN
P500
1
P600
2

Looking at LSN 3
3
2
T2
update
P500
3
20
GDE
QRS

Transaction Table
transID
lastLSN
1
1
2
3

Dirty Page Table
pageID
recLSN
P500
1
P600
2
Looking at LSN 4
4
1
T1
update
P505
3

TUV
WXY

Transaction Table
transID
lastLSN
1
4
2
3

Dirty Page Table

pageID
recLSN
P500
1
P600
2


Looking at LSN 5
5
3
T2
commit
-
-
-
-
-

Transaction Table
transID
lastLSN
1
4

Dirty Page Table
pageID
recLSN
P500
1
P600
2
P505
4

Since the system crashed before log record 6 could be written to stable storage, the log record is not read in the Analysis phase at all. This is the state of transaction table and dirty page table at the end of the analysis phase.



What exactly happens in the Redo phase?

During the redo phase the system applies updates of all transactions committed or otherwise. If a transaction was aborted before the crash and its updates were undone, as indicated by compensation log records (CLRs), the actions described in CLRs are also reapplied.
The Redo phase starts with the smallest LSN in the dirty page table that was constructed in the Analysis phase. This LSN refers to the oldest update that may not have been written to the disk prior to the crash. Starting from this LSN, redo scans forward till it reaches the end of the log.
For each log record (update or CLR), redo checks the dirty page table
  • If the page is not in dirty page table the log record is ignored as this means all the changes to the page have been already written to the disk.
  • If the page is in the dirty page table but the recLSN (the LSN that made the page dirty) is greater than LSN of the record being checked the record is ignored. This means that the change was written to disk and a latter LSN made the page dirty.
It then retrieves the page and checks the most recent LSN on the page (pageLSN), if this is greater than or equal to LSN of the record being checked the record is ignored as this means the page already contains the changes from the LSN being checked.
For all other cases, the log action is redone, whether it is an update record or a CLR (records written during rollback / abort) The logged action is reapplied and the pageLSN on the page is set to the LSN of the redone record. No additional log record is written.
Considering the transaction table and the dirty page table at the end of our analysis phase in our example
Transaction Table
transID
lastLSN
1
4

Dirty Page Table
pageID
recLSN
P500
1
P600
2
P505
4

Redo phase starts with smallest LSN is dirty page table which is 1 and scans forward from the log.

Looking at LSN 1
1
-
T1
update
P500
3
21
ABC
DEF
P500 is in the dirty page table, and recLSN which is 1, is equal to the LSN that is being checked. Therefore the system retrieves the page. It checks the pageLSN, which is less than 1 and therefore decides the action must be redone. It changes ABC to DEF.
Looking at LSN 2
2
-
T2
update
P600
3

HIJ
KLM
P600 is in the dirty page table and recLSN is 2, which is equal to the LSN being checked. Therefore the system retrieves the page. It checks pageLSN (3) which is greater than the LSN being checked and therefore does not need to redo the update (Remember, T2 was committed?)

Looking at LSN 3
3
2
T2
update
P500
3
20
GDE
QRS
This is the same case as the previous one and no redo is necessary.

Looking at LSN 4
4
1
T1
update
P505
3

TUV
WXY
P505 is in the dirty page table. The pageLSN is less than the LSN being checked and therefore the update needs to be redone. It changes TUV to WXY

Looking at LSN 5
5
3
T2
commit
-
-
-
-
-
Since this is not an update or CLR record no action needs to be done. At the end of the redo phase, an end record is written for T2.


What exactly happens in the Undo phase?

The aim of the undo phase is to undo the actions of all transactions that were active at the time of the crash, effectively aborting them. The Analysis phase identifies all transactions that were active at the time of the crash, along with their most recent LSN. (lastLSN) All these transactions must be undone in the reverse order of which they appear in the log. Therefore undo starts from the largest i.e. most recent LSN from the transactions to be undone.
For each log record

  • If the record is a CLR and the undoNextLSN values is not null, the undoNextLSN values is added to the logs records to undo. If the undonextLSN is null an end record is written for the transaction because it is completely undone and the CLR record is discarded.
  • If the record is an update a CLR is written and the corresponding action is undone just as if the system were doing a rollback and prevLSN in the update record is added to the set of records to be undone. 
When the set of records to be undone is completely empty the undo phase is complete.

Once the undo phase is complete, the system is said to be “recovered” and can proceed with normal operations.

In our example, the transaction table from Analysis phase is -

Transaction Table
transID
lastLSN
1
4
The undo phase starts from log record with LSN 4 and creates a set of actions to undo

Looking at LSN 4

4
1
T1
update
P505
3

TUV
WXY
WXY is changed to TUV
Set to Undo - {1} which is the prevLSN

A CLR is added

6
4
T1
CLR
P505
3

WXY
TUV
1
where 1 is the undoNextLSN


Looking at LSN 1
1
-
T1
update
P500
3
21
ABC
DEF
DEF is changed to ABC
Set to Undo - {} since prevLSN is null

A CLR is added

7
6
T1
CLR
P500
3

DEF
ABC
-


Since the set of actions to be undone is zero, the undo is complete. T1 is removed from the transaction table. A checkpoint is taken and the system is in a recovered state and can proceed with normal operation.


What happens if a system crashes during crash recovery?

If there is a crash during crash recovery, the system can still do recovery as for every update that was undone, a CLR is written and the system needs to just redo the CLRs. This is why CLRs are an important part of recovery.


In our example if the system crashed after the change from WXY to TUV was done but before the change from DEF to ABC was done, when the system is in recovery state again, it would see the CLR for the WXY to TUV change in the redo phase and redo or repeat the change. The change from DEF to ABC would be done as a part of undo during the latter/second recovery.


In the next and hopefully last post on recovery, I’ll try to look at MySQL logs and source code related to the recovery component and see some of these things in action.






















MySQL storing time

Storing time in databases seems like a daunting task to many. The general principle with storing time in databases is - ALWAYS STORE TIME ...