Otherwise known as an object instance...
A dictionary contains mapping information of (key, value) pairs. Given a key, the corresponding value can be found in the dictionary. It’s also one of Python’s built-in types. The (key, value) pairs can be dynamically added, removed, and modified. A dictionary also allows iteration through the keys, values, and (key, value) pairs.Python code for operations on a dictionary
Data Structures are a specialized means of organizing and storing data in computers in such a way that we can perform operations on the stored data more efficiently. Data structures have a wide and diverse scope of usage across the fields of Computer Science and Software Engineering.Image by author
Data structures are being used in almost every program or software system that has been developed. Moreover, data structures come under the fundamentals of Computer Science and Software Engineering. It is a key topic when it comes to Software Engineering interview questions. Hence as developers, we must have good knowledge about data structures.
In this article, I will be briefly explaining 8 commonly used data structures every programmer must know.
Traverse: Go through the elements and print them.
Search: Search for an element in the array. You can search the element by its value or its index
Update: Update the value of an existing element at a given index
Inserting elements to an array and deleting elements from an array cannot be done straight away as arrays are fixed in size. If you want to insert an element to an array, first you will have to create a new array with increased size (current size + 1), copy the existing elements and add the new element. The same goes for the deletion with a new array of reduced size.
Used as the building blocks to build other data structures such as array lists, heaps, hash tables, vectors and matrices.
Used for different sorting algorithms such as insertion sort, quick sort, bubble sort and merge sort.
A linked list is a sequential structure that consists of a sequence of items in linear order which are linked to each other. Hence, you have to access data sequentially and random access is not possible. Linked lists provide a simple and flexible representation of dynamic sets.
Let’s consider the following terms regarding linked lists. You can get a clear idea by referring to Figure 2.
Elements in a linked list are known as nodes.
Each node contains a key and a pointer to its successor node, known as next.
The attribute named head points to the first element of the linked list.
The last element of the linked list is known as the tail.
Following are the various types of linked lists available.
Singly linked list — Traversal of items can be done in the forward direction only.
Doubly linked list — Traversal of items can be done in both forward and backward directions. Nodes consist of an additional pointer known as prev, pointing to the previous node.
Circular linked lists — Linked lists where the prev pointer of the head points to the tail and the next pointer of the tail points to the head.
Search: Find the first element with the key k in the given linked list by a simple linear search and returns a pointer to this element
Insert: Insert a key to the linked list. An insertion can be done in 3 different ways; insert at the beginning of the list, insert at the end of the list and insert in the middle of the list.
Delete: Removes an element x from a given linked list. You cannot delete a node by a single step. A deletion can be done in 3 different ways; delete from the beginning of the list, delete from the end of the list and delete from the middle of the list.
Used for symbol table management in compiler design.
Used in switching between programs using Alt + Tab (implemented using Circular Linked List).
A stack is a LIFO (Last In First Out — the element placed at last can be accessed at first) structure which can be commonly found in many programming languages. This structure is named as “stack” because it resembles a real-world stack — a stack of plates.Image by congerdesign from Pixabay
Given below are the 2 basic operations that can be performed on a stack. Please refer to Figure 3 to get a better understanding of the stack operations.
Push: Insert an element on to the top of the stack.
Pop: Delete the topmost element and return it.
Furthermore, the following additional functions are provided for a stack in order to check its status.
Peek: Return the top element of the stack without deleting it.
isEmpty: Check if the stack is empty.
isFull: Check if the stack is full.
Used for expression evaluation (e.g.: shunting-yard algorithm for parsing and evaluating mathematical expressions).
Used to implement function calls in recursion programming.
Given below are the 2 basic operations that can be performed on a queue. Please refer to Figure 4 to get a better understanding of the queue operations.
Enqueue: Insert an element to the end of the queue.
Dequeue: Delete the element from the beginning of the queue.
Used to manage threads in multithreading.
Used to implement queuing systems (e.g.: priority queues).
A Hash Table is a data structure that stores values which have keys associated with each of them. Furthermore, it supports lookup efficiently if we know the key associated with the value. Hence it is very efficient in inserting and searching, irrespective of the size of the data.
Direct Addressing uses the one-to-one mapping between the values and keys when storing in a table. However, there is a problem with this approach when there is a large number of key-value pairs. The table will be huge with so many records and may be impractical or even impossible to be stored, given the memory available on a typical computer. To avoid this issue we use hash tables.
A special function named as the hash function (h) is used to overcome the aforementioned problem in direct addressing.
In direct accessing, a value with key k is stored in the slot k. Using the hash function, we calculate the index of the table (slot) to which each value goes. The value calculated using the hash function for a given key is called the hash value which indicates the index of the table to which the value is mapped.
h(k) = k % m
h: Hash function
k: Key of which the hash value should be determined
m: Size of the hash table (number of slots available). A prime value that is not close to an exact power of 2 is a good choice for m.
Consider the hash function h(k) = k % 20, where the size of the hash table is 20. Given a set of keys, we want to calculate the hash value of each to determine the index where it should go in the hash table. Consider we have the following keys, the hash and the hash table index.
1 → 1%20 → 1
5 → 5%20 → 5
23 → 23%20 → 3
63 → 63%20 → 3
From the last two examples given above, we can see that collision can arise when the hash function generates the same index for more than one key. We can resolve collisions by selecting a suitable hash function h and use techniques such as chaining and open addressing.
Used to implement database indexes.
Used to implement associative arrays.
Used to implement the “set” data structure.
A tree is a hierarchical structure where data is organized hierarchically and are linked together. This structure is different from a linked list whereas, in a linked list, items are linked in a linear order.
Various types of trees have been developed throughout the past decades, in order to suit certain applications and meet certain constraints. Some examples are binary search tree, B tree, treap, red-black tree, splay tree, AVL tree and n-ary tree.
A binary search tree (BST), as the name suggests, is a binary tree where data is organized in a hierarchical structure. This data structure stores values in sorted order.
Every node in a binary search tree comprises the following attributes.
key: The value stored in the node.
left: The pointer to the left child.
right: The pointer to the right child.
p: The pointer to the parent node.
A binary search tree exhibits a unique property that distinguishes it from other trees. This property is known as the binary-search-tree property.
Let x be a node in a binary search tree.
If y is a node in the left subtree of x, then y.key ≤ x.key
If y is a node in the right subtree of x, then y.key ≥ x.key
Binary Trees: Used to implement expression parsers and expression solvers.
Binary Search Tree: used in many search applications where data are constantly entering and leaving.
Heaps: used by JVM (Java Virtual Machine) to store Java objects.
Treaps: used in wireless networking.
Check my articles below on 8 useful tree data structures and self-balancing binary search trees.8 Useful Tree Data Structures Worth KnowingAn overview of 8 different tree data structurestowardsdatascience.comSelf-Balancing Binary Search Trees 101Introduction to Self-Balancing Binary Search Treestowardsdatascience.com
A Heap is a special case of a binary tree where the parent nodes are compared to their children with their values and are arranged accordingly.
Heaps can be of 2 types.
Min Heap — the key of the parent is less than or equal to those of its children. This is called the min-heap property. The root will contain the minimum value of the heap.
Max Heap — the key of the parent is greater than or equal to those of its children. This is called the max-heap property. The root will contain the maximum value of the heap.
Used in heapsort algorithm.
Used to implement priority queues as the priority values can be ordered according to the heap property where the heap can be implemented using an array.
Queue functions can be implemented using heaps within O(log n) time.
Used to find the kᵗʰ smallest (or largest) value in a given array.
Check my article below on implementing a heap using the python heapq module.Introduction to Python Heapq ModuleA simple introduction on how to use Python’s heapq moduletowardsdatascience.com
A graph consists of a finite set of vertices or nodes and a set of edges connecting these vertices.
The order of a graph is the number of vertices in the graph. The size of a graph is the number of edges in the graph.
Two nodes are said to be adjacent if they are connected to each other by the same edge.
A graph G is said to be a directed graph if all its edges have a direction indicating what is the start vertex and what is the end vertex.
We say that (u, v) is incident from or leaves vertex u and is incident to or enters vertex v.
Self-loops: Edges from a vertex to itself.
A graph G is said to be an undirected graph if all its edges have no direction. It can go in both ways between the two vertices.
You can read more about graph algorithms from my article 10 Graph Algorithms Visually Explained.10 Graph Algorithms Visually ExplainedA quick introduction to 10 basic graph algorithms with examples and visualisationsmedium.com
Used to represent social media networks. Each user is a vertex, and when users connect they create an edge.
Used to represent web pages and links by search engines. Web pages on the internet are linked to each other by hyperlinks. Each page is a vertex and the hyperlink between two pages is an edge. Used for Page Ranking in Google.
Used to represent locations and routes in GPS. Locations are vertices and the routes connecting locations are edges. Used to calculate the shortest route between two locations.
An array is a structure of fixed-size, which can hold items of the same data type. It can be an array of integers, an array of floating-point numbers, an array of strings or even an array of arrays (such as 2-dimensional arrays). Arrays are indexed, meaning that random access is possible.Fig 1. Visualization of basic Terminology of Arrays (Image by author)
Fig 2. Visualization of basic Terminology of Linked Lists (Image by author)
Fig 3. Visualization of basic Operations of Stacks (Image by author)
A queue is a FIFO (First In First Out — the element placed at first can be accessed at first) structure which can be commonly found in many programming languages. This structure is named as “queue” because it resembles a real-world queue — people waiting in a queue.Image by Sabine Felidae from Pixabay
Fig 4. Visualization of Basic Operations of Queues (Image by author)
Fig 5. Representation of a Hash Function (Image by author)
Fig 6. Visualization of Basic Terminology of Trees (Image by author)
Let us see how we can represent heaps. Heaps can be represented using trees as well as arrays. Figures 7 and 8 show how we can represent a binary heap using a binary tree and an array.Fig 7. Binary Tree Representation of a Heap (Image by author)Fig 8. Array Representation of a Heap (Image by author)
If a vertex is not connected to any other node in the graph, it is said to be isolated.Fig 9. Visualization of Terminology of Graphs (Image by author)
Linear data structures store elements in a sequential manner. These are the most fundamental and widely used data structures.
Array/List – An array or list is a linear data structure where elements are stored in a sequential manner, numbered from 0 to n-1, where n is the size of the array. The array elements can be accessed using their index.
Stacks – A stack is a linear data structure that stores data in a Last In, First Out
(LIFO) manner.
Queues – A data structure that stores data in a First In, First Out
(FIFO) manner.
Linked Lists – A linear data structure that stores elements sequentially but cannot be accessed directly using an index. It consists of links to the next item, along with the data.
Non-Linear data structures are more complex data structures, that are not sequential in manner.
Trees – Trees store data in a tree-like manner. They consist of a root, which contains the data and links to children nodes. The children nodes may in turn contain more children. They are the building blocks of many other Data Structures.
Heaps – A heap is a tree, which is used as a priority queue. There are max-heaps, and min-heaps that contain the maximum value and the minimum value at the root node, respectively.
Hash Tables – Hash Tables are data structures that contain key-value pairs. They use the concept of hashing to determine the keys in the tables. Usually used for quick access of values.
Graphs – Graphs are complex data structures that implement a collection of nodes, connected together by vertices.
Tuples – They are sequential data structures, that are similar to lists, but they’re immutable.
Dictionary – Dictionaries are python-specific data structures that are similar to hashtables, and are used for quick access of values.
Sets – A collection of unordered distinct elements, that have quick access time.
Collections – Collections are a group of optimized implementations for data structures like dictionaries, maps, tuples, queues.
Now that we have an overview of data structures, let’s delve into the nuances of each of them in the upcoming articles of the series. Here are some resources to get you started with interview and programming preparation.
There are numerous platforms like LeetCode, HackerRank, InterviewBit, TopCoder, HackerEarth, CodeChef, CodeForces, SPOJ, etc. to learn programming. Out of these, after a lot of trial and error I felt that a combination of HackerRank and LeetCode worked really well for preparing for interviews.
HackerRank
HackerRank is amazing because it has customized tracks for each programming language, and one each for data structures and algorithms. There are also customized filters for each DS type like Stack, Queue, etc.
LeetCode
LeetCode is an excellent website for company-specific preparation. If you pay for LeetCode Premium, you get company-specific questions. This is great for preparing for companies like FaceBook, Microsoft, NetFlix, Amazon, Apple, Google, etc. They also have contests and real-time mock interviews that boost your confidence and skills.
By personal experience, implementing the data structures from scratch helps a lot. It helps you to understand the data structure in detail and helps in debugging, in case you run into any errors. Python offers in-build implementation for many data structures, but it is better to learn how to implement it at least once.
In this post we will be looking briefly at, and at a high-level, the various data types and data structures used in designing software systems, and from which specific types of algorithms can subsequently be built upon and optimized for.
There are many data structures, and even the ones that are covered here have many nuances that make it impossible to cover every possible detail. But my hope is that this will give you an interest to research them further.
But before we get into it... time for some self-promotion 🙊Want to be up and running quickly with Python? Get started here with my book "Python for Programmers"!
A data type is an attribute of data which tells the compiler (or interpreter) how the programmer intends to use the data.
Scalar: basic building block (boolean, integer, float, char etc.)
Composite: any data type (struct, array, string etc.) composed of scalars or composite types (also referred to as a ‘compound’ type).
Abstract: data type that is defined by its behaviour (tuple, set, stack, queue, graph etc).
NOTE: You might also have heard of a ‘primitive’ type, which is sometimes confused with the ‘scalar’ type. A primitive is typically used to represent a ‘value type’ (e.g. pass-by-value semantics) and this contrasts with ‘reference types’ (e.g. pass-by-reference semantics).
If we consider a composite type, such as a ‘string’, it describes a data structure which contains a sequence of char scalars (characters), and as such is referred to as being a ‘composite’ type. Whereas the underlying implementation of the string composite type is typically implemented using an array data structure (we’ll cover data structures shortly).
Note: in a language like C the length of the string’s underlying array will be the number of characters in the string followed by a ‘null terminator’.
An abstract data type (ADT) describes the expected behaviour associated with a concrete data structure. For example, a ‘list’ is an abstract data type which represents a countable number of ordered values, but again the implementation of such a data type could be implemented using a variety of different data structures, one being a ‘linked list’.
Note: an ADT describes behaviour from the perspective of a consumer of that type (e.g. it describes certain operations that can be performed on the data itself). For example, a list data type can be considered a sequence of values and so one available operation/behaviour would be that it must be iterable.
A data structure is a collection of data type ‘values’ which are stored and organized in such a way that it allows for efficient access and modification. In some cases a data structure can become the underlying implementation for a particular data type.
For example, composite data types are data structures that are composed of scalar data types and/or other composite types, whereas an abstract data type will define a set of behaviours (almost like an ‘interface’ in a sense) for which a particular data structure can be used as the concrete implementation for that data type.
When we think of data structures, there are generally four forms:
Linear: arrays, lists
Tree: binary, heaps, space partitioning etc.
Hash: distributed hash table, hash tree etc.
Graphs: decision, directed, acyclic etc.
Note: for a more complete reference, please see this Wikipedia article.
Let’s now take a look at the properties that make up a few of the more well known data structures.
An array is a finite group of data, which is allocated contiguous (i.e. sharing a common border) memory locations, and each element within the array is accessed via an index key (typically numerical, and zero based).
The name assigned to an array is typically a pointer to the first item in the array. Meaning that given an array identifier of arr
which was assigned the value ["a", "b", "c"]
, in order to access the "b"
element you would use the index 1 to lookup the value: arr[1]
.
Arrays are traditionally ‘finite’ in size, meaning you define their length/size (i.e. memory capacity) up front, but there is a concept known as ‘dynamic arrays’ (and of which you’re likely more familiar with when dealing with certain high-level programmings languages) which supports the growing (or resizing) of an array to allow for more elements to be added to it.
In order to resize an array you first need to allocate a new slot of memory (in order to copy the original array element values over to), and because this type of operation is quite ‘expensive’ (in terms of computation and performance) you need to be sure you increase the memory capacity just the right amount (typically double the original size) to allow for more elements to be added at a later time without causing the CPU to have to resize the array over and over again unnecessarily.
One consideration that needs to be given is that you don’t want the resized memory space to be too large, otherwise finding an appropriate slot of memory becomes more tricky.
When dealing with modifying arrays you also need to be careful because this requires significant overhead due to the way arrays are allocated memory slots.
So if you imagine you have an array and you want to remove an element from the middle of the array, try to think about that in terms of memory allocation: an array needs its indexes to be contiguous, and so we have to re-allocate a new chunk of memory and copy over the elements that were placed around the deleted element.
These types of operations, when done at scale, are the foundation behind why it’s important to have an understanding of how data structures are implemented. The reason being, when you’re writing an algorithm you will hopefully be able to recognize when you’re about to do something (let’s say modify an array many times within a loop construct) that could ultimately end up being quite a memory intensive set of operations.
Note: interestingly I’ve discovered that in some languages an array (as in the composite data type) has been implemented using a variety of different data structures such as hash table, linked list, and even a search tree.
A linked list is different to an array in that the order of the elements within the list are not determined by a contiguous memory allocation. Instead the elements of the linked list can be sporadically placed in memory due to its design, which is that each element of the list (also referred to as a ‘node’) consists of two parts:
the data
a pointer
The data is what you’ve assigned to that element/node, whereas the pointer is a memory address reference to the next node in the list.
Also unlike an array, there is no index access. So in order to locate a specific piece of data you’ll need to traverse the entire list until you find the data you’re looking for.
This is one of the key performance characteristics of a linked list, and is why (for most implementations of this data structure) you’re not able to append data to the list (because if you think about the performance of such an operation it would require you to traverse the entire list to find the end/last node). Instead linked lists generally will only allow prepending to a list as it’s much quicker. The newly added node will then have its pointer set to the original ‘head’ of the list.
There is also a modified version of this data structure referred to as a ‘doubly linked list’ which is essentially the same concept but with the exception of a third attribute for each node: a pointer to the previous node (whereas a normal linked list would only have a pointer to the following node).
Note: again, performance considerations need to be given for the types of operations being made with a doubly linked list, such as the addition or removal of nodes in the list, because you now have not only the pointers to the following node that need to be updated, but also the pointers back to a previous node that now also need to be updated.
The concept of a ‘tree’ in its simplest terms is to represent a hierarchical tree structure, with a root value and subtrees of children (with a parent node), represented as a set of linked nodes.
A tree contains “nodes” (a node has a value associated with it) and each node is connected by a line called an “edge”. These lines represent the relationship between the nodes.
The top level node is known as the “root” and a node with no children is a “leaf”. If a node is connected to other nodes, then the preceeding node is referred to as the “parent”, and nodes following it are “child” nodes.
There are various incarnations of the basic tree structure, each with their own unique characteristics and performance considerations:
Binary Tree
Binary Search Tree
Red-Black Tree
B-tree
Weight-balanced Tree
Heap
Abstract Syntax Tree
A binary tree is a ‘rooted tree’ and consists of nodes which have, at most, two children. This is as the name suggests (i.e. ‘binary’: 0 or 1), so two potential values/directions.
Rooted trees suggest a notion of distance (i.e. distance from the ‘root’ node)
Note: in some cases you might refer to a binary tree as an ‘undirected’ graph (we’ll look at graphs shortly) if talking in the context of graph theory or mathematics.
Binary trees are the building blocks of other tree data structures (see also: this reference for more details), and so when it comes to the performance of certain operations (insertion, deletion etc) consideration needs to be given to the number of ‘hops’ that need to be made as well as the re-balancing of the tree (much the same way as the pointers for a linked list need to be updated).
A binary search tree is a ‘sorted’ tree, and is named as such because it helps to support the use of a ‘binary search’ algorithm for searching more efficiently for a particular node (more on that later).
To understand the idea of the nodes being ‘sorted’ (or ‘ordered’) we need to compare the left node with the right node. The left node should always be a lesser number than the right node, and the parent node should be the decider as to whether a child node is placed to the left or the right.
Consider the example image above, where we can see the root node is 8
. Let’s imagine we’re going to construct this tree.
We start with 8
as the root node and then we’re given the number 3
to insert into the tree. At this point the underlying logic for constructing the tree will know that the number 3
is less than 8
and so it’ll first check to see if there is already a left node (there isn’t), so in this scenario the logic will determine that the tree should have a new left node under 8
and assign it the value of 3
.
Now if we give the number 6
to be inserted, the logic will find that again it is less than 8
and so it’ll check for a left node. There is a left node (it has a value of 3
) and so the value 6
is greater than 3
. This means the logic will now check to see if there is a right node (there isn’t) and subsequently creates a new right node and assigns it the value 6
.
This process continues on and on until the tree has been provided all of the relevant numbers to be sorted.
In essence what this sorted tree design facilitates is the means for an operation (such as lookup, insertion, deletion) to only take, on average, time proportional to the logarithm of the number of items stored in the tree.
So if there were 1000 nodes in the tree, and we wanted to find a specific node, then the average case number of comparisons (i.e. comparing left/right nodes) would be 10
.
By using the logarithm to calculate this we get: log 2(10) = 1024
which is the inverse of the exponentiation 2^10
(“2 raised to the power of 10”), so this says we’ll execute 10 comparisons before finding the node we were after.
To break that down a bit further: the exponentiation calculation is 1024 = 2 × 2 × 2 x 2 x 2 x 2 × 2 × 2 x 2 x 2 = 2^10
, so the “logarithm to base 2” of 10 is 1024.
The logarithm (i.e. the inverse function of exponentiation) of 1000 to base 2, in this case abstracted to n
, is denoted as log 2 (n)
, but typically the base 2 is omitted to just log(n)
.
When determining the ‘time complexity’ for operations on this type of data structure we typically use ‘Big O’ notation and thus the Big O complexity would be defined as O(log n)
for the average search case (which is good), but the worst case for searching would still be O(n)
linear time (which is bad – and I’ll explain why in the next section on red-black trees).
Note: I’ve covered the basics of logarithm and binary search in a much older post about Big O notation, and so I’ll refer you to that for more details.
Similarly when considering complexity for a particular algorithm, we should take into account both ‘time’ and ‘space’ complexity. The latter is the amount of memory necessary for the algorithm to execute and is similar to time complexity in that we’re interested in how that resource (time vs space) will change and affect the performance depending on the size of the input.
The performance of a binary search tree is dependant on the height of the tree. Meaning we should aim to keep the tree as ‘balanced’ as possible, otherwise the logarithm performance is lost in favor of linear time.
To understand why that is, consider the following data stored in an array:
If we construct a binary search tree from this data, what we would ultimately end up with is a very ‘unbalanced’ tree in the sense that all the nodes would be to the right, and none to the left.
When we search this type of tree (which for all purposes is effectively a linked list) we would, worst case, end up with linear time complexity: O(n)
. To resolve that problem we need a way to balance the nodes in the tree.
This is where the concept of a red-black tree comes in to help us. With a red-black tree (due to it being consistently balanced) we get O(log n)
for search/insert/delete operations (which is great).
Let’s consider the properties of a red-black tree:
Each node is either red or black.
The root node is always black.
All leaves are ‘NIL’ and should also be black.
All red nodes should have two black child nodes.
All paths from given node to NIL must have same num of black nodes.
New nodes should be red by default (we’ll clarify below).
Note: when counting nodes we don’t include the root node, and we count each black node up to (and including) the NIL node.
The height of the tree is referred to as its ‘black-height’, which is the number of black nodes (not including the root) to the furthest leaf, and should be no longer than twice as long as the length of the shortest path (the nearest NIL).
These properties are what enable the red-black tree to provide the performance characteristics it has (i.e. O(log n)
), and so whenever changes are made to the tree we want to aim to keep the tree height as short as possible.
On every node insertion, or deletion, we need to ensure we have not violated the red-black properties. If we do, then there are two possible steps that we have to consider in order to keep the tree appropriately balanced (which we’ll check in this order):
Recolour the node in the case of a red node no longer having two black child nodes.
Make a rotation (left/right) in the case where recolouring then requires a structural change.
The goal of a rotation is to decrease the height of the tree. The way we do this is by moving larger subtrees up the tree, and smaller subtrees down the tree. We rotate in the direction of the smaller subtree, so if the smaller side is the right side we’ll do a right rotation.
Note: there is an inconsistency between what node/subtree is affected by a rotation. Does the subtree being moved into the parent position indicate the direction or does the target node affected by the newly moved subtree indicate the direction (I’ve opted for the latter, as we’ll see below, but be aware of this when reading research material).
In essence there are three steps that need to be applied to the target node (T
) being rotated, and this is the same for either a left rotation or a right rotation. Let’s quickly look at both of these rotation movements:
Left Rotation:
T
’s right node (R
) is unset & becomes T
’s parent †
R
’s original left node L
is now orphaned.
T
’s right node is now set to L
.
† we now find
R
’s left pointer has to be set toT
(in order for it to become the parent node), meaningR
’s original left pointer is orphaned.
Right Rotation:
T
’s left node (L
) is unset & becomes T
’s parent †
L
’s original right node R
is now orphaned.
T
’s left node is now set to R
.
† we now find
L
’s right pointer has to be set toT
(in order for it to become the parent node), meaningL
’s original right pointer is orphaned.
Let’s now visualize the movements for both rotations:
Left Rotation
Right Rotation
Note: rotations are confusing, so I recommend watching this short video for some examples and pseudo-code.
A B-tree is a sorted tree that is very similar in essence to a red-black tree in that it is self-balancing and as such can guarantee logarithmic time for search/insert/delete operations.
A B-tree is useful for large read/writes of data and is commonly used in the design of databases and file systems, but it’s important to note that a B-tree is not a binary search tree because it allows more than two child nodes.
The reasoning for allowing multiple children for a node is to ensure the height of the tree is kept as small as possible. The rationale is that B-trees are designed for handling huge amounts of data which itself cannot exist in-memory, and so that data is pulled (in chunks) from external sources.
This type of I/O is expensive and so keeping the tree ‘fat’ (i.e. to have a very short height instead of lots of node subtrees creating extra length) helps to reduce the amount of disk access.
The design of a B-tree means that all nodes allow a set range for its children but not all nodes will need the full range, meaning that there is a potential for wasted space.
Note: there are also variants of the B-tree, such as B+ trees and B* trees (which we’ll leave as a research exercise for the reader).
A weight-balanced tree is a form of binary search tree and is similar in spirit to a weighted graph, in that individual nodes are ‘weighted’ to indicate the more likely successful route with regards to searching for a particular value.
The search performance is the driving motivation for using this data structure, and typically used for implementing sets and dynamic dictionaries.
A binary heap tree is a binary tree, not a binary search tree, and so it’s not a sorted tree. It has some additional properties that we’ll look at in a moment, but in essence the purpose of this data structure is primarily to be used as the underlying implementation for a priority queue.
The additional properties associated with a binary heap are:
heap property: the node value is either greater (or lesser depending on the direction of the heap) or equal to the value of its parent.
shape property: if the last level of the tree is incomplete, the missing nodes are filled.
The insertion and deletion operations yield a time complexity of O(log n)
.
Below are some examples of a max and min binary heap tree structure.
Max Heap:
Min Heap:
A hash table is a data structure which is capable of maping ‘keys’ to ‘values’, and you’ll typically find this is abstracted and enhanced with additional behaviours by many high-level programming languages such that they behave like an ‘associative array’ abstract data type.
In Python it’s called a ‘dictionary’ and has the following structure (on top of which are functions such as del
, get
and pop
etc that can manipulate the underlying data):
The keys for the hash table are determined by way of a hash function but implementors need to be mindful of hash ‘collisions’ which can occur if the hash function isn’t able to create a distinct or unique key for the table.
The better the hash generation, the more distributed the keys will be, and thus less likely to collide. Also the size of the underlying array data structure needs to accommodate the type of hash function used for the key generation.
For example, if using modular arithmetic you might find the array needs to be sized to a prime number.
There are many techniques for resolving hashing collisions, but here are two that I’ve encountered:
Separate Chaining
Linear Probing
With this option our keys will contain a nested data structure, and we’ll use a technique for storing our conflicting values into this nested structure, allowing us to store the same hashed value key in the top level of the array.
With this option when a collision is found, the hash table will check to see if the next available index is empty, and if so it’ll place the data into that next index.
The rationale behind this technique is that because the hash table keys are typically quite distributed (e.g. they’re rarely sequential 0, 1, 2, 3, 4), then it’s likely that you’ll have many empty empty elements and you can use that empty space to store your colliding data.
Note: Linear Probing is suggested over Separate Chaining if your data structure is expected to be quite large.
Personally I don’t like the idea of the Linear Probing technique as it feels like it’ll introduce more complexity and bugs. Also, there is a problem with this technique which is that it relies on the top level data structure being an array. Which is fine if the key we’re constructing is numerical, but if you want to have strings for your keys then that wont work very well and so you’ll need to be clever with how you implement this.
A graph is an abstract data type intended to guide the implementation of a data structure following the principles of graph theory.
The data struture itself is non-linear and it consists of:
nodes: points on the graph (also known as ‘vertices’).
edges: lines connecting each node.
Note: an ‘undirected’ graph simply has no arrow heads, so the flow between nodes can go in either direction.
Some graphs are ‘weighted’ which means each ‘edge’ has a numerical attribute assigned to them. These weights can indicate a stronger preference for a particular flow of direction.
Graphs are used for representing networks (both real and electronic), such as streets on a map or friends on Facebook.
When it comes to searching a graph, there are two methods:
Breadth First Search: look at siblings.
Depth First Search: look at children.
Which approach you choose depends on the type of values you’re searching for. For example, relationship across fields would lend itself to BFS, whereas hierarchical tree searches would be better suited to DFS.
Book: Problem Solving with Algorithms and Data Structures using Python
For those who needs to study the fundamental data structures and algorithms, highly recommend to go over above textbook thoroughly first, and then come back to the following content, or practice on Leetcode or other platform
Basic data structures:
Array
Linked List
Stack
Queue
Hash Table
Tree
Graph
Common Algorithm Types:
Brute Force
Search and Sort
Recursive
Backtracking
Dynamic Programming
Divide and Conquer
Greedy
Branch and Bound
Big O Notations:
It is critical that you understand and are able to calculate the Big O for the code you wrote.
The order of magnitude function describes the part of T(n) that increases the fastest as the value of n increases. Order of magnitude is often called Big-O notation (for “order”) and written as O(f(n)).
Basically, the Big O measures the number of assignment statements
1.1 Array
An array (in Python its called list) is a collection of items where each item holds a relative position with respect to the others.
1.2 Linked List
Similar to array, but requires O(N) time on average to visit an element by index
Linked list utilize memory better than array, since it can use discrete memory space, whereas array must use continuous memory space
1.3 Stack
Stacks are fundamentally important, as they can be used to reverse the order of items.
The order of insertion is the reverse of the order of removal.
Stack maintain a FILO (first in last out) ordering property.
When pop is called on the end of the list it takes O(1) but when pop is called on the first element in the list or anywhere in the middle it is O(n) (in Python).
1.3.1 Arithmetic Expressions
Infix: the operator is in between the two operands that it is working on (i.e. A+B)
Fully Parenthesized expression: uses one pair of parentheses for each operator. (i.e. ((A + (B * C)) + D))
Prefix: all operators precede the two operands that they work on (i.e. +AB)
Postfix: operators come after the corresponding operands (i.e. AB+)
NOTE:
Only infix notation requires parentheses to determine precedence
The order of operations within prefix and postfix expressions is completely determined by the position of the operator and nothing else
1.4 Queue
A queue is structured as an ordered collection of items which are added at one end, called the “rear,” and removed from the other end, called the “front.”
Queues maintain a FIFO ordering property.
A deque, also known as a double-ended queue, is an ordered collection of items similar to the queue.
It has two ends, a front and a rear, and the items remain positioned in the collection.
New items can be added at either the front or the rear.
Likewise, existing items can be removed from either end.
1.5 Hash Table
A hash table is a collection of items which are stored in such a way as to make it easy to find them later.
Each position of the hash table, often called a slot, can hold an item and is named by an integer value starting at 0.
The mapping between an item and the slot where that item belongs in the hash table is called the hash function.
Remainder method takes an item and divides it by the table size, returning the remainder as its hash value (i.e. h(item) = item % 11
)
load factor is the number of items devided by the table size
collision refers to the situation that multiple items have the same hash value
folding method for constructing hash functions begins by dividing the item into equal-size pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash value.
mid-square method first squares the item, and then extract some portion of the resulting digits. For example, 44^2 = 1936, extract middle two digits 93, then perform remainder step (93%11=5).
Collision Resolution is the process to systemacticly place the second item in the hash table when two items hash to the same slot.
Open addressing (linear probing): sequentially find the next open slot or address in the hash table
A disadvantage to linear probing is the tendency for clustering; items become clustered in the table.
Rehashing is one way to deal with clustering, which is to skip the slot when looking sequentially for the next open slot, thereby more evenly distributing the items that have caused collisions.
Quadratic probing: instead of using a constant “skip” value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on. This means that if the first hash value is h, the successive values are h+1, h+4, h+9, h+16, and so on.
Chaining allows many items to exist at the same location in the hash table.
When collisions happen, the item is still placed in the proper slot of the hash table.
The initial size for the hash table has to be a prime number so that the collision resolution algorithm can be as efficient as possible.
1.6 Trees
A tree data structure has its root at the top and its leaves on the bottom.
Three properties of tree:
we start at the top of the tree and follow a path made of circles and arrows all the way to the bottom.
all of the children of one node are independent of the children of another node.
each leaf node is unique.
binary tree: each node in the tree has a maximum of two children.
A balanced binary tree has roughly the same number of nodes in the left and right subtrees of the root.
1.6.1 Tree Traversal: access the nodes of the tree
Tree traversal is the foundation of all tree related problems.
Here are a few different ways to traverse a tree:
BFS: Level-order
DFS: Pre-order, In-order, Post-order
1.6.2 Binary Search Tree (BST)
BST Property (left subtree < root < right subtree):
The value in each node must be greater than (or equal to)
any values stored in its left subtree.
The value in each node must be less than (or equal to)
any values stored in its right subtree.
Inorder traversal
in BST will be in ascending order
. Therefore, the inorder traversal is the most frequent used traversal method of a BST.
successor: the node that has the next-largest key in the tree
it has no more than one child
You could go over the Leetcode Binary Search Tree topic for details
1.6.3 Heap / Priority Queue / Binary Heap
Priority Queue:
the logical order of items inside a queue is determined by their priority.
The highest priority items are at the front of the queue and the lowest priority items are at the back.
Binary Heap: the classic way to implement a priority queue.
both enqueue and dequeue items are O(logn)
min heap: the smallest key is always at the front
max heap: the largest key value is always at the front
complete binary tree: a tree in which each level has all of its nodes (except the bottom level)
can be implemented using a single list
heap order property: In a heap, for every node x with parent p, the key in p is smaller than or equal to the key in x.
For example, the root of the tree must be the smallest item in the tree
When to use heap:
Priority Queue implementation
whenever need quick access to largest/smallest item
Instant access to the item
insertions are fast, allow in-place sorting
More details can be seen in this discussion
1.6.4 More Trees
Parse tree can be used to represent real-world constructions like sentences or mathematical expressions.
A simple solution to keeping track of parents as we traverse the tree is to use a stack.
When we want to descend to a child of the current node, we first push the current node on the stack.
When we want to return to the parent of the current node, we pop the parent off the stack.
AVL Tree: a balanced binary tree. the AVL is named for its inventors G.M. Adelson-Velskii and E.M. Landis.
For each node: balanceFactor = height(leftSubTree) − height(rightSubTree)
a subtree is left-heavy if balance_factor > 0
a subtree is right-heavy if balance_factor < 0
a subtree is perfectly in balance if balance_factor = 0
For simplicity we can define a tree to be in balance if the balance factor is -1, 0, or 1.
The number of nodes follows the pattern of Fibonacci sequence, as the number of elements get larger the ratio of Fi/Fi-1 closes to the golden ratio, so the time complexity is derived to be O(log n)
Red-Black Tree
B+ Tree: N-array tree
Trie
This is a common data structure in interviews
Binary Index Tree (Fenwick Tree)
1.7 Graph
1.7.1 Vocabulary and Definitions
Vertex (or Node): the name is called "key" and the additional information is called "payload"
Edge (or arc): it connects two vertices to show that there is a relationship between them.
One way edge is called directed graph (or digraph)
Weight: edges maybe weighted to show that there is a coset to fo from one vertex to another.
Path: a sequence of vertices that are connected bny edges
Unweighted path length is the number of edges in the path, specifically n-
Weighted path is the sum of the weights of all the edges in the path
Cycle: a path that starts and ends at the same vertex
A graph with no cycles is called an acyclic graph.
A directed graph with no cycles is called a directed acyclic graph (or DAG)
Graph: a graph (G) is composed with a set of vertices (V) and edges (E) Each edge is a tuple of vertex and weight (v,w). G=(V,E) where w,v∈V
1.7.2 Graph Representation
Adjacency Matrix (2D matrix)
Good when number of edges is large
Each of the rows and columns represent a vertex in the graph.
The value in the cell at the intersection of row v and column w indicates if there is an edge from vertex v to vertex w. It also represents the weight of the edge from vertex v to vertex w.
sparse: most of the cells in the matrix are empty
Adjacency List
space-efficient way for implementation
1.7.3 Graph Algorithms
Graph traversal: BFS & DFS
2.1 Brute Force
Most common algorithm
Whenever you are facing a problem without many clues, you should solve it using brute force first, and observe the process and try to optimize your solution
2.2 Search
2.2.1 Sequential Search
Sequential Search: visit the stored value in a sequence (use loop)
2.2.2 Binary Search
Examine the middle item of an ordered list
KEY is the search interval
2.3 Sort
2.3.1 Bubble Sort
Compares adjacent items and exchanges those that are out of order.
Short bubble: stop early if it finds that the list has become sorted.
time complexity: O(n2)
2.3.2 Selection Sort
Looks for the largest value as it makes a pass and, after completing the pass, places it in the proper location.
time complexity: O(n2)
2.3.3 Insertion Sort
Maintains a sorted sub-list in the lower positions of the list.
Each new item is then “inserted” back into the previous sub-list such that the sorted sub-list is one item larger.
time complexity: O(n2)
2.3.4 Shell Sort
Breaks the original list into a number of smaller sub-lists, each of which is sorted using an insertion sort.
the shell sort uses an increment i, sometimes called the gap, to create a sub-list by choosing all items that are i items apart.
After all the sub-lists are sorted, it finally does a standard insertion sort
time complexity goes between O(n) and O(n2), by changing the increment, a shell sort can perform at O(n^(3/2)).
2.3.5 Merge Sort
A recursive algorithm that continually splits a list in half.
2.3.6 Quick Sort
First selects a value (pivot value), and then use this value to assist with splitting the list.
2.3.7 Heap Sort
Use the property of heap to sort the list
2.4 Recursion
Recursion is a method of solving problems that involves breaking a problem down into smaller and smaller sub-problems until you get to a small enough problem that it can be solved trivially. Usually recursion involves a function calling itself.
Three Laws of Recursion:
A recursive algorithm must have a base case.
A recursive algorithm must change its state and move toward the base case.
A recursive algorithm must call itself, recursively.
Recursive visualization: Fractal tree
A fractal is something that looks the same at all different levels of magnification.
A fractal tree: a small twig has the same shape and characteristics as a whole tree.
2.4.1 Recursive function in Python
When a function is called in Python, a stack frame is allocated to handle the local variables of the function.
When the function returns, the return value is left on top of the stack for the calling function to access.
Even though we are calling the same function over and over, each call creates a new scope for the variables that are local to the function.
2.5 Backtracking
a general algorithm for finding all (or some) solutions to constraint satisfaction problems (i.e. chess, puzzles, crosswords, verbal arithmetic, Sudoku, etc)
2.6 Dynamic Programming
Dynamic Programming (DP) is an algorithm technique which is usually based on a recurrent formula and one (or some) starting states. - A sub-solution of the problem is constructed from previously found ones. - Usually used to find the extreme cases such as shortest path, best fit, smallest set, etc.
2.7 Divide and Conquer
Divide: break into non-overlapping sub-problems of the same type (of problem)
Conquer: solve sub-problems
the algorithm is to keep dividing and conquering, and finally combine them to get the solution
the algorithm can be written in recursive or loop
2.8 Greedy
Greedy algorithm:
find a safe move first
prove safety
solve subproblem (which should be similar to original problem)
estimate running time
The following image demonstrates a ‘directed’ graph (notice the edges have arrows indicating the direction and flow):
As more and more items hash to the same location, the difficulty of searching for the item in the collection increases.
Because the tree is complete, the left child of a parent (at position p) is the node that is found in position 2p in the list. Similarly, the right child of the parent is at position 2p+1 in the list.
keep a master list of all the vertices in the Graph object. each vertex is an element of the list with the vertex as ID and a list of its adjacent vertices as value
f(n)
Name
1
Constant
log n
Logarithmic
n
Linear
n log n
Log Linear
n^2
Quadratic
n^3
Cubic
2^n
Exponential
Infix Expression
Prefix Expression
Postfix Expression
A + B
+ A B
A B +
A + B * C
+ A * B C
A B C * +
(A + B) * C
* + A B C
A B + C *
A + B * C + D
+ + A * B C D
A B C * + D +
(A + B) * (C + D)
* + A B + C D
A B + C D + *
A * B + C * D
+ * A B * C D
A B * C D * +
A + B + C + D
+ + + A B C D
A B + C + D +