Friday, January 5, 2018

Crash Recovery : Scratching the surface : Part 1

We hear terms like ‘recovery is slow’ , ‘system is in recovery mode’ and I felt that I don’t understand enough about what "recovery" entails. In an attempt to understand it, I did some dabbling in reading and wanted to condense it here on my blog. 

Where is your data?

It is reasonable to assume that most of your data resides on non volatile storage (think disk) and only parts of database are brought into volatile storage (memory) (these are called buffer blocks) at a time. Transactions input data blocks from disk to memory and output data blocks from memory to disk(physical blocks). This data transfer between memory and disk keeps happening during the lifetime of your database. 

A buffer block is eventually written out to disk either because the buffer manager needs the space for other purposes or the system wishes to reflect the changes in memory to disk, it does a force output of the block to disk.

Why Recovery?

RDBMS are ACID compliant. Atomicity requires that each transaction be all or nothing, if one part of transaction fails then the entire transaction fails and the database is left unchanged. Durability means that once a transaction is committed it will remain so. Atomicity and Durability are guaranteed in each and every situation including power failures, errors and crashes. 

Therefore when coming back from a crash it is important to ensure that all the changes that were part of a transaction that was not committed are rolled back and all the changes that are part of a transaction that was committed persist. 

For example,
If a transaction T wants to transfer $100 from Account A having balance $500 to Account B having balance $200.
  • Transaction T will bring the blocks Ax and Bx containing data for A and B respectively into memory using operations read(Ax) and read(Bx) 
  • It performs operations on data in Ax and Bx and the balances for A and B are now $400 and $300 respectively. 
  • It then writes Ax back to persistent storage with the new value, while the operation write(Bx) is in progress the system crashes. 

If we did not have a mechanism for recovery, the system would come back from the crash with the values
A: $400
B: $200

Instead of the values that you would accept for ACID compliance
A: $400
B: $300

To ensure that database operates under ACID compliance and fail stop assumptions - i.e. hardware errors and bugs in the software bring software to a halt but do not corrupt non volatile storage, we need recovery algorithms that ensure that all the changes that were part of a transaction that was not committed are rolled back and all the changes that are part of a transaction that was committed persist. 

How does recovery work?

There are two aspects to a recovery algorithm -
  • Actions taken during normal operations of a database to ensure that enough information exists to allow recovery from failures
  • Actions taken after a failure to recover the database to a state of ACID compliance
The first aspect is achieved using logs and a principle called “Write-Ahead logging”. Write-Ahead Logging means that any change to a database object is first recorded in the log. The record in the log must be written to stable storage before the change to the database object is written to disk. A transaction is considered committed only when all its log records including the commit record are written to stable storage. Thus the log is nothing but a history of actions executed by the database system. It is divided into two parts log tail and log file. Log tail is most recent portion of the log that is kept in main memory and is periodically forced to stable storage (for example, when a transaction commits) Log file is a file stored on stable storage.

Why use logs?

In a world where there are no logs, we would have to use a force approach. When a transaction commits all pages modified by the transaction will have to be written to the disk. In a no-force approach only the log records for the transaction need to be written to the disk. Log record is size is much smaller than the page size. Moreover log records are written sequentially to a file, so the forcing the log-tail to disk is only an append operation. This makes the cost of forcing the log-tail to disk much lesser than the cost of writing all changed pages to a disk when a transaction commits. 

In interest of keeping these blog posts small and readable I'll end this post here, in the next post, I'll go over how the logs and supporting tables are structured, how and when are the logs written, different types of logs and what exactly happens during recovery.

No comments:

Post a Comment

MySQL storing time

Storing time in databases seems like a daunting task to many. The general principle with storing time in databases is - ALWAYS STORE TIME ...