Wednesday, January 10, 2018

Crash Recovery : Scratching the surface : Part 2

Continuing where we left off in the last post, we saw why database systems use logs for recovery. Besides logs there are other supporting data structures that are needed for recovery.
Broadly speaking there are three data structures that are part of a recovery mechanism (We’ll discuss the ARIES mechanism here)
  • Log
  • Transaction Table
  • Dirty Page Table

What exactly is a log record?


As we discussed in the post before  log is nothing but a history of actions executed by the database system. It is divided into two parts log tail and log file. Log tail is most recent portion of the log that is kept in main memory and is periodically forced to stable storage (for example, when a transaction commits) Log file is a file stored on stable storage.

The structure of a log can be summarized as shown in the table below

LSN
prevLSN
transID
type
pageID
length
offset
Before-
image
After-
image


LSN is log sequence number, generally it is a monotonically increasing unique identifier. prevLSN is the LSN of the previous log record in the transaction, implementing the sequence of logs in a transactions as a linked list. It contains the transaction ID of the transaction and pageID of the page being changed, along with the before and after values.

Based on different type of log records, that we will discuss further certain fields may be missing but typically any log record written on an update should have this format.

Transaction table 

A transaction table can be conceptually depicted as -

TransID
LastLSN
Status

It contains a record for all the active transactions in the system. It contains the transaction id, status of the transaction (in progress, committed or aborted) and LastLSN the LSN of the most recent log record belonging to this transaction.



Dirty Page table

A dirty page is, a page with changes that are not yet written to disk. Dirty Page Table contains one entry for each dirty page and recLSN which is the LSN of the first log record that caused the page to become dirty.

A dirty page table can be conceptually depicted as -


pageID
recLSN
Status


What do logs look like when a transaction happens in the database?

Consider updating a value in database - 

Transaction T1 changes value of X from ABC to DEF. Suppose the data lives on page P500 at offset 21-23. The system will make an entry into the log, transaction table, and dirty page table.


Log


LSN
prevLSN
transID
type
pageID
Length
offset
Before
After
1
-
T1
update
P500

21
ABC
DEF

Transaction Table


transID
lastLSN
T1
1

Dirty Page Table


pageID
recLSN
P500
1

Transaction T2 changes ‘HIJ’ to ‘KLM’ on page P600

Log


LSN
prevLSN
transID
type
pageID
Length
offset
Before
After
1
-
T1
update
P500

21
ABC
DEF
2
-
T2
update
P600


HIJ
KLM

Transaction Table

transID
lastLSN
T1
1
T2
2

Dirty Page Table


pageID
recLSN
P500
1
P600
2


Transaction T2 changes bytes 20 through 22 from GDE to QRS on page P500 and Transaction T1 changes ‘TUV’ to ‘WXY’ on P505

Log

LSN
prevLSN
transID
type
pageID
Length
offset
Before
After
1
-
T1
update
P500
3
21
ABC
DEF
2
-
T2
update
P600
3

HIJ
KLM
3
2
T2
update
P500
3
20
GDE
QRS
4
1
T1
update
P505
3

TUV
WXY

Transaction Table


transID
lastLSN
T1
1
T2
4

Dirty Page Table


pageID
recLSN
P500
1
P600
2
P505
4



What happens when you commit a transaction?

If a system issues a  T2 commit, a commit type log record is written to the log. After this the log tail is flushed to disk and the transaction is considered committed. After this additional actions need to be taken like removing entry for T2 from transaction tables, when these actions are complete an end type log record is written to the log.

Log

LSN
prevLSN
transID
type
pageID
Length
offset
Before
After
1
-
T1
update
P500
3
21
ABC
DEF
2
-
T2
update
P600
3

HIJ
KLM
3
2
T2
update
P500
3
20
GDE
QRS
4
1
T1
update
P505
3

TUV
WXY
5
3
T2
commit
-
-
-
-
-
..








7
5
T2
end
-
-
-
-
-

What happens if you rollback or abort a transaction?


If a system issues a  T1 rollback or if T1 is aborted, the system would read the last log record belonging to T1, that is log record 4, it would then undo the changes made by T1, and write a log record of type “compensation log record” (CLR). CLR in addition to information fields of update records contain another field called undoNextLSN which points to the next record that has to be reversed as a part of the undo process.

LSN
prevLSN
transID
type
pageID
Length
offset
Before
After
Undo
NextLSN
1
-
T1
update
P500
3
21
ABC
DEF

2
-
T2
update
P600
3

HIJ
KLM

3
2
T2
update
P500
3
20
GDE
QRS

4
1
T1
update
P505
3

TUV
WXY

5
4
T1
abort
-
-
-
-
-

6
5
T1
CLR
P505
3

WXY
TUV
1
7
6
T1
CLR
P500
3

DEF
ABC
NULL
8
7
T1
end
-
-
-
-
-
-


Compensation log records are an integral part of a crash recovery mechanism because with write-ahead logging, it is very much possible that the log undoing the transaction that was aborted is written to stable storage but the actual pages have not been flushed to disk. It is also possible that a system crashes in the middle of recovery, when it has undone few changes of a transaction but not all. In such cases,  during crash recovery CLR contains the information needed to redo the rollback / undo. We will see more about this during the overview of the recovery process.  

In summary a log record is written whenever a page is updated or transaction is committed or aborted. Once a transaction is aborted or rolled back the updates are undone and the undo is written to the log in the form of a CLR. In the next post, I’ll try to unravel the crash recovery process. Till then, So long!
Pingback

Friday, January 5, 2018

Crash Recovery : Scratching the surface : Part 1

We hear terms like ‘recovery is slow’ , ‘system is in recovery mode’ and I felt that I don’t understand enough about what "recovery" entails. In an attempt to understand it, I did some dabbling in reading and wanted to condense it here on my blog. 

Where is your data?

It is reasonable to assume that most of your data resides on non volatile storage (think disk) and only parts of database are brought into volatile storage (memory) (these are called buffer blocks) at a time. Transactions input data blocks from disk to memory and output data blocks from memory to disk(physical blocks). This data transfer between memory and disk keeps happening during the lifetime of your database. 

A buffer block is eventually written out to disk either because the buffer manager needs the space for other purposes or the system wishes to reflect the changes in memory to disk, it does a force output of the block to disk.


Why Recovery?

RDBMS are ACID compliant. Atomicity requires that each transaction be all or nothing, if one part of transaction fails then the entire transaction fails and the database is left unchanged. Durability means that once a transaction is committed it will remain so. Atomicity and Durability are guaranteed in each and every situation including power failures, errors and crashes. 

Therefore when coming back from a crash it is important to ensure that all the changes that were part of a transaction that was not committed are rolled back and all the changes that are part of a transaction that was committed persist. 

For example,
If a transaction T wants to transfer $100 from Account A having balance $500 to Account B having balance $200.
  • Transaction T will bring the blocks Ax and Bx containing data for A and B respectively into memory using operations read(Ax) and read(Bx) 
  • It performs operations on data in Ax and Bx and the balances for A and B are now $400 and $300 respectively. 
  • It then writes Ax back to persistent storage with the new value, while the operation write(Bx) is in progress the system crashes. 

If we did not have a mechanism for recovery, the system would come back from the crash with the values
A: $400
B: $200

Instead of the values that you would accept for ACID compliance
A: $400
B: $300

To ensure that database operates under ACID compliance and fail stop assumptions - i.e. hardware errors and bugs in the software bring software to a halt but do not corrupt non volatile storage, we need recovery algorithms that ensure that all the changes that were part of a transaction that was not committed are rolled back and all the changes that are part of a transaction that was committed persist. 


How does recovery work?

There are two aspects to a recovery algorithm -
  • Actions taken during normal operations of a database to ensure that enough information exists to allow recovery from failures
  • Actions taken after a failure to recover the database to a state of ACID compliance
The first aspect is achieved using logs and a principle called “Write-Ahead logging”. Write-Ahead Logging means that any change to a database object is first recorded in the log. The record in the log must be written to stable storage before the change to the database object is written to disk. A transaction is considered committed only when all its log records including the commit record are written to stable storage. Thus the log is nothing but a history of actions executed by the database system. It is divided into two parts log tail and log file. Log tail is most recent portion of the log that is kept in main memory and is periodically forced to stable storage (for example, when a transaction commits) Log file is a file stored on stable storage.


Why use logs?

In a world where there are no logs, we would have to use a force approach. When a transaction commits all pages modified by the transaction will have to be written to the disk. In a no-force approach only the log records for the transaction need to be written to the disk. Log record is size is much smaller than the page size. Moreover log records are written sequentially to a file, so the forcing the log-tail to disk is only an append operation. This makes the cost of forcing the log-tail to disk much lesser than the cost of writing all changed pages to a disk when a transaction commits. 

In interest of keeping these blog posts small and readable I'll end this post here, in the next post, I'll go over how the logs and supporting tables are structured, how and when are the logs written, different types of logs and what exactly happens during recovery.

MySQL storing time

Storing time in databases seems like a daunting task to many. The general principle with storing time in databases is - ALWAYS STORE TIME ...