This post is just an introduction to B Trees and here we will discuss one of the complicated data structures, B Trees. I would advice the reader to get familiar with the Tree data structures, and Balanced Trees. For e.g. : Binary Search Trees, AVL Trees, Red Black Trees and 2-3-4 Trees.
If you are not familiar with any/few of the above mentioned data structures, then please do not worry, I will try to explain everything here from scratch and hopefully you might not need any external support. But in case you are trying to learn the above topics, please look into the Additional Read section at the bottom of this post.
I believe that the topic is huge and cannot be efficiently covered in one post, hence we will extend this to multiple posts.
Introduction to B Trees
The very first thing I would like to point out is, "The B in B Tree does not stand for binary, there are many other options which might be represented by the B. As no one has mentioned the real intention behind the name B Tree, it is quite ambiguous and here are few words which go nicely with B here Balanced, Broad, Bayer
There are reasons why these words make more sense in the name of a B Tree.
The tree is self balancing in nature and it employs split
operations in addition to copy and paste
to balance itself.
The breadth of the tree is at least two order of magnitude higher than the height of the tree. The tree expands horizontally and hence the name looks more appropriate.
The inventor of this data structure, he also invented the Red Black Trees
The motivation behind B Trees
You won't see B Trees in normal applications, this data structure was invented with a motivation to deal with huge amount of data. When I say huge, I mean it. It is ideal for dealing with data which won't fit in the Random Access Memory or Main Memory. In fact the data size can be as big as GBs and TBs. The data structure doesn't put any restrictions on the size of data it can handle. But there are limitations due to hardware which I will mention later.
So, looking at this property, it is more likely that the applications of B Tree mostly lie in the database and similar industries where the data is stored on disk and it needs to be accessed, read, modified and rewritten.
Why do we need B Tree particularly for the data stored on disk?
I would like to draw your attention to the various kinds of memory/storage options available to a modern computer. Here is a list of these options in the increasing order of time of access:
- Cache (L1, L2 & L3)
- RAM (Random Access Memory)
- Disk Drives (Solid State Disks)
- Disk Drives (Magnetic Disks)
The order of magnitude of access increases significantly from top to bottom. The size of the storage also increases significantly from top to bottom and the price decreases drastically from top to bottom.
The access time for RAM is in nano seconds where as the access time for Magnetic Disk is in milliseconds. Which means if a program has to read from the disk it would be slower by 100, 000 times compared to reading from the RAM.
Which also translates to "more number of Disk Operations means slower programs and vice versa".
How is data stored on a magnetic disk?
The magnetic disk contains magnetic material on the surface and it is divided into various tracks. The read/write head is positioned on one track at a time and can read a huge amount of data which is contiguous. The data is typical called a page and the size may vary from 1 megabyte to 4 megabytes or even more.
It is always advisable to read a data from the disk in a chunk and then filter it out in the main memory instead of accessing the disk every time to fetch one value from the disk and then test it.
Understanding the B Tree properties
The B Tree data structure outlines the following properties:
- The B Tree has nodes and each node has the following structure:
- It contains a list of keys such that the keys are arranged in ascending order.
- For n keys in the node, there will be n+1 children. This means, the n keys will mark boundaries to n+1 children.
- The B Tree is a search tree, which means for a key k in a node X all the elements in the predecessor children to k of node X will have values smaller than k.
- All leaves have the same depth, which is the height of the tree.
- Each B Tree is defined by minimum degree D( >=2 ), which is the lowest number of children an internal node must have. The number of children in a B Tree is variable, it can range anywhere from D to 2D.
- A direct result of the above property is that each node in a B Tree has variable number of keys and the number of keys may range from D-1 to 2D-1. The root can be an exception and it can have less than D-1 keys.
Here is the detailed image of a B Tree node.
On an average, the number of keys stored in a node can range from 50 to 2000. Typically the node stores data corresponding to one page so that the node can be populated in one DISK READ operation. That is why 2000 keys is a good number for a node size.
And as we discussed above the number of children of a node is one greater than the number of keys in the node. This will mean the following:
- If root has around 2000 keys, It has 2001 child nodes.
- Each of the node at level 1 will have 2000 keys, hence the total keys stored at the level 1 is 2000*2001 >= 4,000,000
- At level 2, the total number of keys would be 8,000,000,000 approx.
This shows, how small the height of the tree would be when the branching factor is in 1000s.
Here is how a typical B Tree would look like:
Advantages of B Trees
As discussed above the height of the B Tree is very small compared to the number of nodes due to its large branching factor. A search would mean evaluating one node from each level. If there are H levels in the tree, then we need to access H nodes at max, and each access is a DISK READ operation. This ensures that the number of DISK READ operations will be minimal.
The root can always reside in the main memory.
This is just an introductory post for B Trees, we understood the importance of a B Tree and also learnt about the structure of the B Tree. In the next post, we will learn about the balancing procedure of the B Tree and few of the CRUD operations on the B Tree.
In another related post, we will learn about the remaining operations on a B Tree.
Stay Connected and Stay subscribed.