Hierarchical Datastructure – detailed discussion

Introduction

I was planning to write an article on recursion before writing this article, but I got few requests to write about Tree Data Structure. So here I go, please treat this post as the starting of a series of articles based on Tree Data Structure in general, this might extend to a list of ten full length articles where we will learn in detail about Tree Data Structure, Types of Trees, How to store Trees in Memory, Variations of Trees, Tree Traversals, N-ary Trees and few real life examples of trees.

Purpose of the Article

This is just an introduction on Tree data structure, we will learn basic concepts on how to calculate some of the attributes of a tree, what is the depth of a tree and what is a tree traversal, what does complete tree means and what is the depth of a node. We will also understand the terminologies like ancestors, parent children, in order, pre order and post order.

What is a tree data structure?

Before understanding tree data structures , we must understand few categories of data structures, they are as below:

  • Linear Data structures.
  • Non-Linear or Hierarchical Data structures.

If the data is stored in a linear fashion, in which we can access the data sequentially then we call it a linear data structure. Few examples could be arrays, linked lists, queues, stacks etc.

Everything other than a linear data structure is called a non-linear data structure, hierarchical data structure can be considered as a sub-set (which is a major portion) of the non-linear data structure. Few examples could be trees, graphs, heaps, tries etc.

We will start with a tree and move to heaps, tries etc in subsequent articles.

The Tree Data Structure

The tree is a data structure which is hierarchical, it is very much similar to the real world trees with a very obvious difference, the tree data structure in computers has its ROOT at the top and leaves flow downwards.

Why do we require a tree data structure?

While learning about arrays or linked lists we discovered that searching an element in a linear data structure is mostly sequential, that means no matter what you do the search will always remain proportional to N that makes the running time θ(N), where N is the number of elements in the list or array. Of, course there are few optimizations but in most of the cases it is tough to achieve a running time less than θ(N).

Trees and other hierarchical data structures are meant to bring down the search time and make the storage running time much smaller than θ(N). This is the basic idea and also the need to implement such data structures.

Moreover, there are few other cases, consider a scenario where you want to store the relation between each of the family members right from great grandfather to kids and grand kids. We can definitely store this in a table like structure but, it might be making a lot of repetition of data all around. Also, while finding all the relations between two members it might be tough in a table and might need access to more than one row.

Hence, there is a need for hierarchical data structure.
tree1

The Basic Hierarchical Data Structure – Tree

In the tree data structure there is one node called ROOT which is at the top and it is considered as the ancestor of all the nodes in the tree, many leaves come out from the root which are considered as the children of the root.

Ancestor

If A is a node which can be called an ancestor of node N1 then A must be closer to the ROOT compared to N1 and when we start tracking downwards towards the leaves we much meet N1 in one or more steps.

There can be more than one ancestors for a given node N1. There is no ancestor for the ROOT.
In the above diagram, P2 is an ancestor of L3 and L4, also ROOT is an ancestor of L3, L4, L5 etc.

Parent

If P is a node which can be called the parent of node N1 then P must be closer to the ROOT compare to N1 and N1 must be linked to P and could be reachable. If we start at P and travel downwards toward the leaves then we must reach N1 in a single step.

There can be one and only one parent for a node N1 which lies immediately above N1. Hence, a parent is also an ancestor.
In the above diagram, P1 is an parent of C1 and C2, also ROOT is the parent of P1, P2 and P3.

Siblings

The nodes N1, N2, N3 etc can be called siblings if N1, N2, N3 have a common parent by the definition of parent above.

Children

Nodes N1 and N2 can be called children of P if P is the parent of N1 and N2. Similarly we can have grand children etc for an ancestor.
In the above diagram, C4 and C5 are children of P3, also L5 is a child of C4.

Depth of a Node

The depth of a node N1 in a tree is one more than the depth of its parent. Hence, we can define the depth of N1 as Depth(N1) = 1 + Depth(P) where P is the parent of N1. The depth of the root is zero by definition. If the nodes N1, N2 and N3 have same depth, they are said to be at the same level.

If there are three children C1, C2 and C3 of a node N1 and let their respective depths be D1 , D2 and D3 then by definition D1 = D2 = D3, which means that all the children of a given node will have the same depth and are considered to be at the same level. We can also define depth of a node N1 as the minimum distance of N1 from ROOT.

You must have notice, this is a recursive definition.

Note: I just would like the reader not to get confused or scared of the word recursion or recursive, I will try my best to make it as simple as possible or we may have a dedicated article on recursion on demand. For now, recursion is just a way of defining a variable of order n in terms of the variable itself with an order less than n. This precisely looks like the equation below:
X(K) = X(k-1) + 4
In such equations the variable X is present in both sides of the equation with the only difference that the parameter of X is reducing each time by a certain factor.
Same is our definition of Depth(N1) where Depth is present on both the sides of the equation.

In the above diagram, depth of C3 is 2 ad depth of L5 is 3

Height of a Node

The height of a node N1 in a tree is one more than the maximum height of all its children. For e.g. If there are three children C1, C2 and C3 they can absolutely have different heights. Lets assume they have heights H1, H2 and H3 respectively then height of N1 would be Height(N1) = 1 + Max(H1, H2, H3) where the Max function gives the maximum of all the parameters passed into it.

Also, we noticed that height of a node N1 depends on the size of the sub-tree it contains and this is a recursive definition as well. If the nodes L1, L2, L3 … Ln is the set of all such nodes for which N1 is an ancestor, and D1, D2, D3 … Dn are the individual distances of N1 from L1, L2, L3 … Ln respectively, then Height(N1) = Max(D1, D2, D3 … Dn).

In the above diagram, height of P2 is 2 and height of P3 is also 2.

Leaf Node

A node L1 can be called a leaf node if its height is zero, which also means it has no children.
In the above diagram, all the green color nodes are leaf nodes because they do not have children.

Terminologies for Trees, based on their number of children

In real world the children of a node in a tree are not uniformly distributed, means we can’t say that all the branches nodes have same number of children. On the contrary, we mostly design a tree data structure with pre-decided count for the maximum number M0 of children a node can have.

When M0 equals to two , it is called a Binary Tree,
for M0 equals to three, it is called a Ternary Tree,
for M0 equals to n, it is called a n-ary Tree

Most of our discussion we have will revolve around the Binary tree and all possible variations of a binary tree.

Complete Trees

We say a tree to be complete only when all its nodes except the leaf nodes have M0 children defined for that tree. That means if we consider a binary tree, then a complete binary tree would be one in which all the nodes except the leaf nodes have two children.

In a complete ternary tree, all the nodes except the leaf nodes will have three children and so on for a n-ary tree.

Total Number of nodes in a Tree

The total number of nodes add up to a sum of all the nodes including the ROOT node and the leaf nodes. To calculate that lets take a complete binary tree of height H.

This means there are H levels in the tree, also if we notice then each level contains two times the nodes in the previous level. The level 0 which is the root level has just one node (the ROOT itself).

So lets keep adding till level H and find out the total number of leaves.
Level 0 = 1
Level 1 = 2 * 1 = 2
Level 2 = 2 * ( 2 * 1 ) = 4
Level 3 = 2 * ( 2 * ( 2 * 1 ) ) = 8

Which means at level k it the number of nodes is 2k and at level H it is 2H. To add all of these nodes, it would be a sum of a geometric series with a common ratio 2 and up to H terms.

Total number of nodes in the tree would be 2H+1 as per the sum of geometric series.

Please Note:
Sum of geometric series a + ar + ar2 + ar3 + ar4 + ar5 + …. arn = a * (rn+1 – 1) / ( r-1) , if r > 1
Sum of geometric series a + ar + ar2 + ar3 + ar4 + ar5 + …. arn = a * (1 – rn+1) / (1 – r) , if r < 1

Similarly we can calculate the total number of nodes in a ternary or a n-ary tree. Moreover, we can only say the maximum number of nodes possible in a tree, we cannot give
the exact number if the tree is not complete.

We can also calculate the height of the tree (i.e. height of the root) if the total number of nodes in a complete binary tree is given using log2N , where N is the total number of nodes in the tree.

Moerover, we can generalise this to find the height of the a n-ary tree as well.

Binary Tree traversal

Traversing a binary tree mostly means visiting each node of the tree, there can be different conditions applied based on certain situation where we can have a restriction of just visiting each node once or may be visiting only the left nodes, you can imagine many more restrictions as well.

Mostly there are three types of traversals as listed below, please see that most of the readers are pretty confused about what these three mean and this is a potential case where we might get things wrong.

Pre Order Traversal

In this traversal the node is visited first, the left child of the node is visited second and the right child of the node is visited at the end.

In order Traversal

In this traversal the left child of a node is visited first, the node itself is visited second and the right child of the node is visited at the end.

Post order Traversal

In this traversal the left child of a node is visited first, the right child of the node is visited second and the node itself is visited at the end.

Point worth noting is that in each of the above traversals the relative order of visiting the left and right child is same, the left is always visited before the right. The only thing which changes is the the order in which we visit the node in consideration.

In the PREORDER traversal the node itself is the first one to be visited, in the INORDER the node is the second one to be visited and in the POSTORDER the node is the last one to be visited.

Conclusion

We conclude this article with a good understanding of the Tree data structure, as I mentioned in the beginning that , it is just an introduction there are many more to come in this series.

Hope this helps, happy reading. Don’t forget to subscribe to get updated articles and notifications.