Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Website Version
Curriculum
Utilities
practice
Resources
quick-reference
Python-Docs
MISC
Interview Prep
Installations Setup & Env
Aux-Exploration
Main article: Associative array
Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like Ruby, Python, and PHP.
When storing a new item into a multimap and a hash collision occurs, the multimap unconditionally stores both items.
When storing a new item into a typical associative array and a hash collision occurs, but the actual keys themselves are different, the associative array likewise stores both items. However, if the key of the new item exactly matches the key of an old item, the associative array typically erases the old item and overwrites it with the new item, so every item in the table has a unique key.
Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B-trees are more popular in these applications. In multi-node database systems, hash tables are commonly used to distribute rows amongst nodes, reducing network traffic for hash joins.
Main article: Cache (computing)
Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.
Besides recovering the entry that has a given key, many hash table implementations can also tell whether such an entry exists or not.
Those structures can therefore be used to implement a set data structure,[43] which merely records whether a given key belongs to a specified set of keys. In this case, the structure can be simplified by eliminating all parts that have to do with the entry values. Hashing can be used to implement both static and dynamic sets.
Several dynamic languages, such as Perl, Python, JavaScript, Lua, and Ruby, use hash tables to implement objects. In this representation, the keys are the names of the members and methods of the object, and the values are pointers to the corresponding member or method.
Main article: String interning
Hash tables can be used by some programs to avoid creating multiple character strings with the same contents. For that purpose, all strings in use by the program are stored in a single string pool implemented as a hash table, which is checked whenever a new string has to be created. This technique was introduced in Lisp interpreters under the name hash consing, and can be used with many other kinds of data (expression trees in a symbolic algebra system, records in a database, files in a file system, binary decision diagrams, etc.).
Main article: Transposition table
A transposition table to a complex Hash Table which stores information about each section that has been searched.[44]
Data requires a number of ways in which it can be stored and accessed. One of the most important implementations includes Hash Tables. In Python, these Hash tables are implemented through the built-in data type i.e, dictionary. In this article, you will learn what are Hash Tables and Hashmaps in Python and how you can implement them using dictionaries.
Before moving ahead, let us take a look at all the topics of discussion:
What is a Hash table or a Hashmap in Python?
Hash table vs Hashmap
Creating Dictionaries
Creating Nested Dictionaries
Performing Operations on Hash Tables using dictionaries
Accessing Values
Updating Values
Deleting Items
Converting a Dictionary into a Dataframe
In computer science, a Hash table or a Hashmap is a type of data structure that maps keys to its value pairs (implement abstract array data types). It basically makes use of a function that computes an index value that in turn holds the elements to be searched, inserted, removed, etc. This makes it easy and fast to access data. In general, hash tables store key-value pairs and the key is generated using a hash function.
Hash tables or has maps in Python are implemented through the built-in dictionary data type. The keys of a dictionary in Python are generated by a hashing function. The elements of a dictionary are not ordered and they can be changed.
An example of a dictionary can be a mapping of employee names and their employee IDs or the names of students along with their student IDs.
Moving ahead, let’s see the difference between the hash table and hashmap in Python.
Dictionaries can be created in two ways:
Using curly braces ({})
Using the dict() function
Dictionaries in Python can be created using curly braces as follows:
EXAMPLE:
OUTPUT:
{‘Dave’: ‘001’, ‘Ava’: ‘002’, ‘Joe’: ‘003’} dict
Python has a built-in function, dict() that can be used to create dictionaries in Python. For example:
EXAMPLE:
OUTPUT:
{} dict
In the above example, an empty dictionary is created since no key-value pairs are supplied as a parameter to the dict() function. In case you want to add values, you can do as follows:
EXAMPLE:
OUTPUT:
{‘Dave’: ‘001’, ‘Ava’: ‘002’, ‘Joe’: ‘003’} dict
Nested dictionaries are basically dictionaries that lie within other dictionaries. For example:
EXAMPLE:
There are a number of operations that can be performed on has tables in Python through dictionaries such as:
Accessing Values
Updating Values
Deleting Element
The values of a dictionary can be accessed in many ways such as:
Using key values
Using functions
Implementing the for loop
Dictionary values can be accessed using the key values as follows:
EXAMPLE:
OUTPUT: ‘ 001′
There are a number of built-in functions that can be used such as get(), keys(), values(), etc.
EXAMPLE:
OUTPUT:
dict_keys([‘Dave’, ‘Ava’, ‘Joe’]) dict_values([‘001’, ‘002’, ‘003’]) 001
The for loop allows you to access the key-value pairs of a dictionary easily by iterating over them. For example:
OUTPUT:
All keys Dave Ava Joe All values 001 002 003 All keys and values Dave : 001 Ava : 002 Joe : 003
Dictionaries are mutable data types and therefore, you can update them as and when required. For example, if I want to change the ID of the employee named Dave from ‘001’ to ‘004’ and if I want to add another key-value pair to my dictionary, I can do as follows:
EXAMPLE:
OUTPUT: {‘Dave’: ‘004’, ‘Ava’: ‘002’, ‘Joe’: ‘003’, ‘Chris’: ‘005’}
There a number of functions that allow you to delete items from a dictionary such as del(), pop(), popitem(), clear(), etc. For example:
EXAMPLE:
OUTPUT: {‘Joe’: ‘003’}
The above output shows that all the elements except ‘Joe: 003’ have been removed from the dictionary using the various functions.
As you have seen previously, I have created a nested dictionary containing employee names and their details mapped to it. Now to make a clear table out of that, I will make use of the pandas library in order to put everything as a dataframe.
EXAMPLE:
OUTPUT:
Hash tables are also called hash maps, maps, unordered maps, or dictionaries. A hash table is a structure that maps keys to values. This makes them extremely efficient for lookups because if you have the key, retrieving the associated value is a constant-time operation.
Hash tables have fast lookups (O(1)
) on average. However, in the worst case, they have slow (O(n)
) lookups. The slow lookups happen when there is a hash collision (two different keys hash to the same index).
Hash tables have fast insertions (O(1)
) on average. However, in the worst case, they have slow (O(n)
) insertions. Just like with the lookups, the worst case occurs due to hash collisions.
Hash tables have fast deletes (O(1)
) on average. However, in the worst case, they have slow (O(n)
) deletions. Just like with lookups and insertions, the worst case occurs due to hash collisions.
The space complexity of a hash table is linear (O(n)
). Each key-value pair in the hash table will take up space in memory.
The main reason why hash tables are great is that they have constant-time (O(1)
) lookup operations in the average case. That makes them great to use in any situation where you will be conducting many lookup operations. The second reason they are great is that they allow you to use any hashable object as a key. This means they can be used in many different scenarios where you want to map one object (the key) to another object (the value).
One weakness of the hash table is that the mapping goes only one way. So, if you know the key, it's incredibly efficient to retrieve the mapped value to that key. However, if you know the value and want to find the key that is mapped to that value, it is inefficient. Another weakness is that if your hash function produces lots of collisions, the hash table's time complexity gets worse and worse. This is because the underlying linked lists are inefficient for lookups.
A hash collision is when our hash function returns the same index given two different keys. Theoretically, there is no perfect hash function, though some are better than others. Thus, any hash table implementation has to have a strategy to deal with the scenario when two different keys hash to the same index. You can't just have the hash table overwrite the already existing value.
The most common strategy for dealing with hash collisions is not storing the values directly at an index of the hash table's array. Instead, the array index stores a pointer to a linked list. Each node in the linked list stores a key, value, and a pointer to the next item in the linked list.
The above is just one of the ways to deal with hash collisions. Hopefully, you can now see why all of our hash table operations become O(n)
in the worst case. What is the worst case? The worst case is when all of the keys collide at the same index in our hash table. If we have ten items in our hash table, all ten items are stored in one linked list at the same index of our array. That means that our hash table has the same efficiency as a linked list in the worst case.
In your own words, explain how and why the time complexity of hash table operations degrades to O(n)
in the worst case.
Hashing functions take an input (usually a string) and return an integer as the output. Let's say we needed to store five colors in our hash table. Currently, we have an empty table that looks like this:
Now, I need to assign an index given the name of a color. Our hash function will take the name of a color and convert it into an index.
So, hash functions convert the strings into indexes, and then we store the given string into the computed index of an array.
Hash Function + Array = Hash Table
Now that we know what a hash table is let's dive deeper into creating a hash function.
To convert a string into an integer, hashing functions operate on the individual characters that make up the string.
Let's use what we know to create a hashing function in Python.
In Python, we can encode strings into their bytes representation with the .encode()
method (read more here (Links to an external site.)). Once encoded, an integer represents each character.
Let's do this with the string hello
Now that we’ve converted our string into a series of integers, we can manipulate those integers somehow. For simplicity’s sake, we can use a simple accumulator pattern to get a sum of all the integer values.
Great! We turned a string into a number. Now, let's generalize this into a function.
We aren't done yet 🤪. As shown earlier, hello
returns 532
. But, what if our hash table only has ten slots? We have to make 532 a number less than 10 😱.
Remember the modulo operator %
? We can use that in our hashing function to ensure that the integer the function returns is within a specific range.
https://replit.com/@bgoonz/cs-unit-1-sprint-4-module-1-hash-function#main.py
We define a hash table as an empty array and hash function as a function that takes a value and converts it into an array index where you will store that value. Let's put the two together. Let's implement a HashTable
class where we can:
Insert values into a hash table
Retrieve values from a hash table
Delete values from a hash table
Let's start with the insert function. For an insert, I need to insert a value with an associated key. Let's store the instructors at Lambda and where they live. We want to store:
("Parth", "California")
("Beej", "Oregon")
("Dustin", "Utah")
("Ryan", "Utah")
Here's what our HashTable
class looks like right now:
Let's break this down a little bit. Our init
function takes in the length of a hash table and creates an empty array. Our hash_index
function takes a key and computes and index using the famous djb2
hash function. Let's implement the other functions (put
, delete
, get
).
put
MethodLet's create our put
function. Before we code, let's break down what needs to happen:
Given a key
and a value
, insert the respective value
into a hash table array using the hashed key
to determine the storage location index.
Let's think about what we need to do:
Hash the key
into an index using the hash function
Put the value
into that index
You might be thinking, "What if two keys hash to the same index?" That's a great question, and we will worry about that later. It's a nifty solution 🎉. But for now, let's worry about hashing a key and storing a value.
First, let's call the hash function and store the return value in index
:
Next, let's insert the value at that index:
There we go! Given a key, we hashed it and inserted a value. Again, we will worry about colliding indices later 😀.
delete
MethodNext, let's write our delete
method. What does this method need to do? We can think about it as the inverse of the put
method that we just defined. The function will receive a key
as its input, then pass that key
through the hash function to get the index where the hash table's value needs to be deleted.
Let's start by getting the index by passing the key
through the hashing function:
Next, we need to delete the value from that index in our storage by setting it to None
. Remember, we aren't dealing with collisions in this example. If we had to deal with collisions, this would be more complex.
get
MethodThe last method we need to deal with is our get
method. get
is a simple method that retrieves the value
stored at a specific key
. The function needs to receive a key
as an input, pass that key
through the hashing function to find the index where the value is stored, and then return the value
at that index.
Let's start by getting the index from the key
:
Next, we need to return the value that is stored at the index
.
In this sprint, you will learn about number bases, character encoding, and how values are stored in memory. We'll also talk all about hash tables, a very powerful data structure for getting more performance and flexibility out of your algorithms. Finally, we'll look at optimizing searching through data, as well as the powerful programming technique of recursion.
This module will teach about RAM, number bases, fixed-width integers, how arrays are stored in memory, and character encoding. These basics allow you to better understand how other data structures are stored in memory and have an intuitive sense of the time complexity of specific data structures' operations.
In this module, you will learn about the properties of hash tables, hash functions, and how to implement a hash table in Python. Hash tables are arguably one of the most important and common data structures you use and encounter to solve algorithmic coding challenges.
In this module, you will learn about hash collisions and implement a complete hash table in Python that takes collisions into account.
This module will teach about logarithms, linear search, binary search, and recursion. This material sets the stage for learning about traversal techniques of trees and graphs. Additionally, using a binary search is a common optimization technique that may come up during a technical interview.
Your computer has something called random access memory (RAM). Sometimes, people say "memory" when referring to RAM.
One thing that might come to your mind is that there are different types of memory on your computer. What about storing things like videos, text documents, and applications? Are those in "memory"? There is a distinction between "storage" and "memory". Things like videos and files are stored on a disc, not in RAM. RAM is faster than disc storage, but there isn't as much space available. Disc storage has more space, but it is slower.
Think of RAM like a set of numbered, sequential mailboxes. Just like a set of mailboxes with numbered addresses, RAM is also sequential and has numbered addresses.
Now, just like you can put something in a mailbox, you can also put something in RAM. Things that you put in RAM, we can call variables. Each "box" in RAM has an address.
Each one of the "boxes" (memory addresses) in our set of mailboxes (RAM) holds 8 bits. You can think of each bit like a tiny switch that can either be "on" or "off." "On" is represented by a 1
, and "off" is represented by a 0
.
Bits are often thought about in groups. A group of 8 bits is called a byte. Each "box" in RAM can hold 1 byte (8 bits).
Now, a computer has more than just disc storage and RAM inside of it. There is also a processor. And, in between the processor and the RAM is something called a memory controller. The memory controller can access each box in RAM directly. It is as if the memory controller had tubes connected to each box of the set of mailboxes. Through those tubes, the memory controller can send and receive information directly to each box in RAM.
Why is the direct connection between the memory controller and each box in RAM meaningful? It's so that the memory controller can jump around to which box it wants to communicate with quickly. Even though the boxes are in sequential order, the memory controller doesn't have to go through the boxes in order. It can access the first one, then jump to one somewhere in the middle, and then access one at the end. Because there is a direct connection, this is done quickly.
Whenever you use a computer, you are very concerned with the speed of the computer you are using. So, computer designers made a way to optimize for speed when accessing items in RAM. Whenever a processor accesses a box in RAM, it also accesses and stores the boxes near it. Often, if you are accessing one thing in RAM, it's likely that the next thing you need to access is nearby. That's why keeping a copy of nearby items in the cache speeds things up.
Whenever the processor reads something (say, the player's position in an old adventure game) out of RAM, it adds it to the cache to use it again in the future. Then, when it needs something else from RAM, it will go to the cache for it. As you can see, the cache helps the processor by saving execution cycles required to go out and read something from RAM.
The processor, not RAM, has the actual cache. The memory controller keeps track of what goes into and comes out of the cache.
We can think of it in several ways. Perhaps, the processor can use the cache as a temporary area to keep a copy of its last actions just in case it needs to reread them.
There is one caveat — it is not as if "everything" goes out to RAM and then gets inserted into the cache. In reality, the cache holds only a handful of memory addresses from RAM. Also, note that these few memory addresses in the cache can be accessed faster than other storage locations.
Draw a model of how a processor interacts with the cache, memory controller, and RAM whenever it needs to read or write from memory.
Computers use the binary number system, so we will represent all of our variables in the binary number system.
Instead of 10 digits like 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0, the binary number system only has two possible digits, 1 and 0. Another way to think of it is that computers only have switches (bits) that can be in an "off state" or an "on state."
Before we try to understand the binary number system, let's review how the decimal number system works. Let's look at the number "1001" in decimal.
Even though there are two "1" digits in this number, they don't represent the same quantity. The leftmost "1" represents one thousand, and the rightmost "1" represents one unit. The "0"s in-between represent the tens place and the hundreds place.
So this "1001" in base ten represents "1 thousand, 0 hundreds, 0 tens, and 1 one."
Each successive digit in the base 10 number system is a power of ten. The ones place is 10^0 = 1
. The tens place is 10^1 = 10
. The hundreds place is 10^2 = 100
. This pattern continues on and on.
This pattern holds for other number systems as well. In the binary system, each successive digit represents a different power of 2. The first digit represents 2^0 = 1
. The second digit represents 2^1 = 2
. The third digit represents 2^2 = 4
. Again, this pattern continues on and on.
So, what if the number "1001" was in binary and not decimal? What would it represent then? Well, if we read it right to left, we have a "1" in the ones place, a "0" in the twos place, a "0" in the fours place, and a "1" in the eights place. We add these values up (8 + 0 + 0 + 1) which equals 9.
Below, is a table that shows how to count up to 8 in binary and decimal:
Decimal
Binary
0
0000
1
0001
2
0010
3
0011
4
0100
5
0101
6
0110
7
0111
8
1000
Convert the following decimal numbers into binary numbers:
25
63
9
111
We now know that things are stored in RAM using binary, and each "box" in RAM holds 1 byte (8 bits). What does that mean for what we can store in RAM? Let's say we have 1 byte of RAM to use. How many different numbers can we represent using only this 1 byte?
Remember that each digit in a binary number is a successive power of 2. If we have 8 bits to use, we can store 2^8 = 256
different numbers in 1 byte.
Let's see if we can find a pattern:
With one bit, we can express two numbers (0
and 1
)
With two bits, for each of the first numbers (0
or 1
), we can put a 0
or a 1
after it, so we can express four numbers
With three bits, we can express eight numbers.
Every time we add a new bit, we double the number of possible numbers we can express in binary. This pattern can be generalized as 2^n
and 2^8 = 256
.
Often, computers use 4 bytes (32 bits) to represent our variables, meaning that we can express as many as 4 billion (2^32
) possible values. Similarly, computers may use 8 bytes (64 bits) to represent our variables and can express over 10 billion (2^64
).
The 2^X in the binary number system is called the bitsize. Eight bytes of memory are called "8-bit", and 16 bytes are called "16-bit," etc.
In theory, you could use less space to represent smaller integers. For instance, in binary, the number one is represented by 1
. So, technically, to store one in binary, you only need one bit. But computers don't usually do this. Many integers take a fixed amount of space, no matter what number they might have in them. So, even though you only need one bit to represent the number one, the computer would still use 32 or 64 bits to do so.
So, if a variable represents a fixed-width integer, it doesn't matter if it has the value 0
or 123,456
; the amount of space it takes up in RAM is the same.
The computer can store numbers like 3, 60000000, or -14 in 32 bits, one of the "fixed-width integers" we discussed earlier. All of these fixed-width integers take up constant space (O(1)).
Storing numbers as fixed-width integers introduces a trade-off. We have constant space complexity, and because each integer has a constant and expected number of bits, simple mathematical operations only take constant time. The cost of having an integer as fixed-width is that there is a limit to the number of integers you can represent.
What is the number of possible integer values you can store with 4 bytes? How did you make that calculation?
What is the number of possible integer values you can store with 8 bytes? How did you make that calculation?
When writing programs, you likely need to store several numbers, not just one integer.
So, let's say we wanted to write a program that allowed us to keep track of the number of hours we spent studying that day. We will round the number of hours to the nearest whole number to store them as fixed-width integers. Additionally, each day's hours will be represented by eight bits in binary.
So, we will start at memory address 0 in RAM, and each day, store the number of hours we studied in that "box" of RAM. For our first day that we are tracking, we store an 8-bit binary integer in "box" number 0. On the second day, we store an 8-bit binary integer in "box" number 1. This pattern continues.
Now, I'm sure you've already used an array when you are programming. An array is just an ordered sequential collection of data. Well, RAM is already structured like this. Right? Our days where we track the number of hours that we are studying are in sequential order in RAM.
Knowing this information, what can we do if we want to look up how many hours we studied on day 5 (index 4 because of zero-indexing)? Because all of the information is stored in sequential order, we can do simple math. If you are looking for the day 5 information (index 4), you need to know what the starting item address is 0 and then add 4 (the index). Or, if the starting address was 5 and you were looking for the 10th index, you'd go to memory address 15 (5 + 10).
This math works because we are using one "box" in memory for each day's record. If we were using 64-bit integers that take up 8 "boxes" in RAM, we would have to slightly adjust our math. In this case, we would have to multiply the index we were looking for by the number of bytes each record was stored in. So, if we were storing 64-bit integers (8 bytes) and wanted to find the item with index 4, and the starting index was 0, we would go to memory address 0 + (4 * 8) = 32
.
Because accessing information from a specific index involves this simple mathematical computation, accessing items in an array is a constant time operation. For the mathematical computations to be consistent and straightforward, arrays have to follow specific rules. Each item in the array has to take up the same number of bytes in RAM. Also, each item has to be stored right next to the previous item in RAM. If there are any gaps or interruptions in the array, then the simple mathematical computation for accessing a particular item no longer works.
Let's say you need to store an array of 64-bit integers. Your array needs to have enough capacity for 24 integers. How many 1-byte slots of memory need to be allocated to store this array?
In this example, we will store some strings. A string, as we know, is just a bunch of characters or letters. One straightforward way to store a string is an array, so let's see how we can define some mappings to make it easier to store strings in arrays.
To use our 8-bit slots in memory, we need a way to encode each character in a string in 8-bits. One common character encoding to do this is called "ASCII". Here's how the alphabet is encoded in ASCII:
Letter
Encoding
A
01000001
B
01000010
C
01000011
D
01000100
E
01000101
F
01000110
G
01000111
H
01001000
I
01001001
J
01001010
K
01001011
L
01001100
M
01001101
N
01001110
O
01001111
P
01010000
Q
01010001
R
01010010
S
01010011
T
01010100
U
01010101
V
01010110
W
01010111
X
01011000
Y
01011001
Z
01011010
Since we can express characters as 8-bit integers, we can express strings as arrays of 8-bit characters.
For example, we could represent LAMBDA like so:
Each character, once it was encoded, could be stored as one 8-bit slot in memory.
Draw out a model of a section of memory that stores the string "Computer Science"
as an array of 8-bit ASCII characters.
Searching & Recursion
Logarithms are a way of looking differently at exponentials. I know that this is a bit of a vague definition, so let's look at an example:
What does the mathematical expression above mean? It's an abbreviation for the following expression:
What we are looking at above is two different ways to express an object that doubles in size with each iteration.
Another way to think about 2^5 = 32
is that 2
is the "growth rate" and 5
is the number of times you went through the growth (doubling). 32
is the final result.
Let's look at a few more expressions:
Now, to begin looking at logarithms, let's rewrite the exponential expressions above in logarithmic form.
Notice how we have essentially just moved around different pieces of the expression.
For our first expression,
2
was the "growth rate", 5
was the "time" spent growing, and 32
was where we ended up. When we rewrite this logarithmically, we have
In this case, 2
still represents the "growth rate" and 32
still represents where we end up. The 5
also still represents the "time" spent growing.
So, the difference between when we would use a logarithm and when we use exponentiation depends on what factors we know ahead of time. If you know the growth rate and you know how long you are growing, you can use exponentiation (2^5
) to figure out where you end up (32
). However, if you know the growth rate and where you end up but do not know the time spent growing, you can use a logarithm (log_2 32
) to figure that out.
Logarithms have an inverse relationship with exponents, just like division and multiplication have an inverse relationship.
For example, if you know that you have one group of 5
items and you want to identify the total you would have if you had 4
of those groups instead of just one, you could express that with 5 * 4 = 20
. However, if you knew that you had a total of 20
items and you wanted to know how many groups of 5
you could make out your total, you could express that with 20 \ 5 = 4
.
In computer science, you often ask questions like "How many times must n
be divided in half before we get to one?" or "How many times will we halve this collection before the collection has only one item?" To answer these questions, you can use logarithms! Halving is like doubling, so we can say that log_2 n
would give us the answer we're seeking.
You will see this come up when analyzing the time complexity of specific algorithms. Any algorithm that doubles or halves a number or collection on each iteration of a loop is likely to have O(log n)
time complexity. You will see this come up specifically when we talk about binary search and its time complexity. You will also see this come up in specific sorting algorithms (like merge sort). In simple terms, merge sort divides a collection in half and then merges the sorted halves. The fact that the algorithm repeatedly halves something is your clue that it includes a logarithm in its time complexity. One last place you're likely to see logarithms come up is with a perfect binary tree. One property of these binary trees is that the number of nodes doubles at each level.
Write a logarithmic expression that is identical to this exponential expression:
Write an exponential expression that is identical to this logarithmic expression:
What keywords should you look out for that might alert you that logarithms are involved?
Imagine that I've chosen a random number from 1 to 20. Then, you must guess the number. One approach would be to start picking at 1 and increment your guess by 1 with each guess. If the computer randomly selected 20, then it would take you 20 guesses to get the correct answer. If the computer guessed 1, then you would have the right answer on your very first guess. On average, you will get the correct answer on the 10th or 11th guess.
If the collection we are searching through is random and unsorted, linear search is the most efficient way to search through it. Once we have a sorted list, then there are more efficient approaches to use.
We want to write a simple program to conduct a linear search on a collection of data. Let's write a function that takes a list (arr
) and an integer (target
) as its input and returns the integer idx
where the target is found. If the target
does not exist in the arr
, then the function should return -1
.
The word recursion comes from the Latin word recurrere, meaning to run or hasten back, return, revert, or recur. Here are some online definitions of recursion:
Notice how the concept that is being defined, ancestors, shows up in its own definition. This is a recursive definition.
In programming, recursion has a very precise meaning. It refers to a coding technique in which a function calls itself.
Most programming problems are solvable without recursion. So, strictly speaking, recursion usually isn’t necessary.
However, some situations particularly lend themselves to a self-referential definition—for example, the definition of ancestors shown above. If you were devising an algorithm to handle such a case programmatically, a recursive solution would likely be cleaner and more concise.
On the other hand, recursion isn’t for every situation. Here are some other factors to consider:
For some problems, a recursive solution, though possible, will be awkward rather than elegant.
Recursive implementations often consume more memory than non-recursive ones.
In some cases, using recursion may result in slower execution time.
Typically, the readability of the code will be the biggest determining factor. But it depends on the circumstances. The examples presented below should help you get a feel for when you should choose recursion.
The same holds true if multiple instances of the same function are running concurrently. For example, consider the following definition:
When function()
executes the first time, Python creates a namespace and assigns x
the value 10
in that namespace. Then function()
calls itself recursively. The second time function()
runs, the interpreter creates a second namespace and assigns 10
to x
there as well. These two instances of the name x
are distinct from each another and can coexist without clashing because they are in separate namespaces.
As written, function()
would in theory go on forever, calling itself over and over without any of the calls ever returning. In practice, of course, nothing is truly forever. Your computer only has so much memory, and it would run out eventually.
Technical note: You can find out what Python’s recursion limit is with a function from the sys
module called getrecursionlimit()
:>>>
You can change it, too, with setrecursionlimit()
:>>>
You can set it to be pretty large, but you can’t make it infinite.
There isn’t much use for a function to indiscriminately call itself recursively without end. It’s reminiscent of the instructions that you sometimes find on shampoo bottles: “Lather, rinse, repeat.” If you were to follow these instructions literally, you’d shampoo your hair forever!
This logical flaw has evidently occurred to some shampoo manufacturers, because some shampoo bottles instead say “Lather, rinse, repeat as necessary.” That provides a termination condition to the instructions. Presumably, you’ll eventually feel your hair is sufficiently clean to consider additional repetitions unnecessary. Shampooing can then stop.
Similarly, a function that calls itself recursively must have a plan to eventually stop. Recursive functions typically follow this pattern:
There are one or more base cases that are directly solvable without the need for further recursion.
Each recursive call moves the solution progressively closer to a base case.
You’re now ready to see how this works with some examples.
The first example is a function called countdown()
, which takes a positive number as an argument and prints the numbers from the specified argument down to zero:>>>
Notice how countdown()
fits the paradigm for a recursive algorithm described above:
The base case occurs when n
is zero, at which point recursion stops.
In the recursive call, the argument is one less than the current value of n
, so each recursion moves closer to the base case.
Note: For simplicity, countdown()
doesn’t check its argument for validity. If n
is either a non-integer or negative, you’ll get a RecursionError
exception because the base case is never reached.
The version of countdown()
shown above clearly highlights the base case and the recursive call, but there’s a more concise way to express it:
Here’s one possible non-recursive implementation for comparison:>>>
This is a case where the non-recursive solution is at least as clear and intuitive as the recursive one, and probably more so.
In other words, n! is the product of all integers from 1 to n, inclusive.
As with the example shown above, there are base cases that are solvable without recursion. The more complicated cases are reductive, meaning that they reduce to one of the base cases:
The base cases (n = 0 or n = 1) are solvable without recursion.
For values of n greater than 1, n! is defined in terms of (n - 1)!, so the recursive solution progressively approaches the base case.
The calculations of 4!, 3!, and 2! suspend until the algorithm reaches the base case where n = 1. At that point, 1! is computable without further recursion, and the deferred calculations run to completion.
Here’s a recursive Python function to calculate factorial. Note how concise it is and how well it mirrors the definition shown above:>>>
Notice how all the recursive calls stack up. The function gets called with n
= 4
, 3
, 2
, and 1
in succession before any of the calls return. Finally, when n
is 1
, the problem can be solved without any more recursion. Then each of the stacked-up recursive calls unwinds back out, returning 1
, 2
, 6
, and finally 24
from the outermost call.
Again, this shows that if a problem is solvable with recursion, there will also likely be several viable non-recursive solutions as well. You’ll typically choose based on which one results in the most readable and intuitive code.
Another factor to take into consideration is execution speed. There can be significant performance differences between recursive and non-recursive solutions. In the next section, you’ll explore these differences a little further.
timeit()
first executes the commands contained in the specified <setup_string>
. Then it executes <command>
the given number of <iterations>
and reports the cumulative execution time in seconds:>>>
Here, the setup
parameter assigns string
the value 'foobar'
. Then timeit()
prints string
one hundred times. The total execution time is just over 3/100 of a second.
The examples shown below use timeit()
to compare the recursive, iterative, and reduce()
implementations of factorial from above. In each case, setup_string
contains a setup string that defines the relevant factorial()
function. timeit()
then executes factorial(4)
a total of ten million times and reports the aggregate execution.
First, here’s the recursive version:>>>
Next up is the iterative implementation:>>>
Last, here’s the version that uses reduce()
:>>>
In this case, the iterative implementation is the fastest, although the recursive solution isn’t far behind. The method using reduce()
is the slowest. Your mileage will probably vary if you try these examples on your own machine. You certainly won’t get the same times, and you may not even get the same ranking.
Does it matter? There’s a difference of almost four seconds in execution time between the iterative implementation and the one that uses reduce()
, but it took ten million calls to see it.
If you’ll be calling a function many times, you might need to take execution speed into account when choosing an implementation. On the other hand, if the function will run relatively infrequently, then the difference in execution times will probably be negligible. In that case, you’d be better off choosing the implementation that seems to express the solution to the problem most clearly.
For factorial, the timings recorded above suggest a recursive implementation is a reasonable choice.
Perhaps it might interest you to know how this performs in the timing test:>>>
Wow! math.factorial()
performs better than the best of the other three implementations shown above by roughly a factor of 10.
A function implemented in C will virtually always be faster than a corresponding function implemented in pure Python.
Suppose you wanted to count the number of leaf elements in this list—the lowest-level str
objects—as though you’d flattened out the list. The leaf elements are "Adam"
, "Bob"
, "Chet"
, "Cat"
, "Barb"
, "Bert"
, "Alex"
, "Bea"
, "Bill"
, and "Ann"
, so the answer should be 10
.
Just calling len()
on the list doesn’t give the correct answer:>>>
len()
counts the objects at the top level of names
, which are the three leaf elements "Adam"
, "Alex"
, and "Ann"
and two sublists ["Bob", ["Chet", "Cat"], "Barb", "Bert"]
and ["Bea", "Bill"]
:>>>
What you need here is a function that traverses the entire list structure, sublists included. The algorithm goes something like this:
Walk through the list, examining each item in turn.
If you find a leaf element, then add it to the accumulated count.
If you encounter a sublist, then do the following:
Drop down into that sublist and similarly walk through it.
Once you’ve exhausted the sublist, go back up, add the elements from the sublist to the accumulated count, and resume the walk through the parent list where you left off.
Note the self-referential nature of this description: Walk through the list. If you encounter a sublist, then similarly walk through that list. This situation begs for recursion!
In the case of the names
list, if an item is an instance of type list
, then it’s a sublist. Otherwise, it’s a leaf item:>>>
Now you have the tools in place to implement a function that counts leaf elements in a list, accounting for sublists recursively:
If you run count_leaf_items()
on several lists, including the names
list defined above, you get this:>>>
2^n = 64 log_2(64)=n
Example
For n = 1
, the output should be fibonacciSimpleSum2(n) = true
.
Explanation: 1 = 0 + 1 = F~0~ + F~1~
.
For n = 11
, the output should be fibonacciSimpleSum2(n) = true
.
Explanation: 11 = 3 + 8 = F~4~ + F~6~
.
For n = 60
, the output should be fibonacciSimpleSum2(n) = true
.
Explanation: 60 = 5 + 55 = F~5~ + F~10~
.
For n = 66
, the output should be fibonacciSimpleSum2(n) = false
.
Input/Output
[execution time limit] 4 seconds (py3)
[input] integer n
Guaranteed constraints: 1 ≤ n ≤ 2 - 10^9^
.
[output] boolean
true
if n
can be represented as F~i~ + F~j~
, false
otherwise.
Given an integer array nums
sorted in ascending order, and an integer target
.
Suppose that nums
is rotated at some pivot unknown to you beforehand (i.e., [1,2,3,4,5,6,7] might become [4,5,6,7,1,2,3]).
You should search for target
in nums
and if found return its index, otherwise return -1.
Example 1: Input: nums = [6,7,1,2,3,4,5], target = 1 Output: 2
Example 2: Input: nums = [6,7,1,2,3,4,5], target = 3 Output: 4
Example 3: Input: nums = [1], target = 2 Output: -1
Your solution should have better than O(n) time complexity over the number of items in the list. There is an O(log n) solution. There is also an O(1) solution.
Note:
1 <= nums.length < 100
1 <= nums[i] <= 100
All values of nums
are unique.
Numbers from 1 up to the length of the list will be contained in the list.
[execution time limit] 4 seconds (py3)
[input] array.integer nums
[input] integer target
[output] integer
Number Bases & Characters
Original:
Remember when we wondered what would happen if multiple keys hashed to the same index, and we said that we would worry about it later? Whelp, it's later 🤪.
Let's say we were given the key-value pair ("Ryan", 10)
. Our hash code then maps "Ryan" to index 3. Excellent, that works!Now let's say after we inserted ("Ryan", 10)
, we have to insert ("Parth", 12)
. Our hash code maps "Parth" to index 3. Uh oh! Ryan is already there! What do we do?? 😱
Ok, let's stop freaking out, and let's think about this. If we don't do anything, the value stored at index 3 will just get overwritten. Meaning if we try to retrieve the value associated with "Ryan"
, 12 will be returned instead of 10. That might not seem like a big deal, but what if we were returning passwords based on a user ID, and we returned someone else's password. That would be horrible.
Let's fix this problem. The most common way to solve this is with chaining. If we see multiple values hashed to an index, we will chain them in a some data structure that can hold multiple items. In our case, we'll use Python's list
type, but a more typical solution would use a linked list. We'll cover linked lists in a future module.
Ok, sounds ideal? But how does this work in code? Let's write some of it together.
Below is a partially filled out hash table class where we will be using HashTableEntry
as our chain entries.
Take a look at the code below.
Let's implement the put
method with collision resolution by chaining. What are the two cases we need to handle?
There are no entries at the index. Great! We can initialize the entry to a list with the new HashTableEntry
in it.
There are multiple entries at the index. We need to check every entry in the chain. If the key in one of the entries is equal to the key we are passing in, we need to replace it. For instance, let's say we pass in ("Ryan", 12),
and then we later pass in ("Ryan", 15)
. We would need to replace "Ryan"'s old value with 15. If there are no entries that match, we create a new entry at the end of the chain.
Ok, that might sound confusing. Let's start breaking it down into code.
First, we need to hash the key and start with the first entry at that index.
Next, we need to go through the chain. We need to check two conditions:
The current entry is not empty.
The key or the current entry is not equal to the key we are passing in.
Sweet! Now we need to check what happens when the loop breaks. It would only break for two reasons:
We reached an entry with the same key and need to replace the value.
We reached the end of the chain and need to create a new entry.
Let's write that in code!
Great! We created the put
method.
What does runtime look like with linked list chaining?
The performance of hash tables for search, insertion, and deletion is constant time (O(1)
) in the average case. However, as the chains get longer and longer, in the worst case, those same operations are done in linear time (O(n)
). The more collisions that your hash table has, the less performant the hash table is. To avoid collisions, a proper hash function and maintaining a low load factor is crucial. What is a load factor?
The load factor of a hash table is trivial to calculate. You take the number of items stored in the hash table divided by the number of slots.
Hash tables use an array for storage. So, the load factor is the number of occupied slots divided by the length of the array. So, an array of length 10 with three items in it has a load factor of 0.3, and an array of length 20 with twenty items has a load factor of 1. If you use linear probing for collision resolution, then the maximum load factor is 1. If you use chaining for collision resolution, then the load factor can be greater than 1.
As the load factor of your hash table increases, so does the likelihood of a collision, which reduces your hash table's performance. Therefore, you need to monitor the load factor and resize your hash table when the load factor gets too large. The general rule of thumb is to resize your hash table when your load factor is greater than 0.7. Also, when you resize, it is common to double the size of the hash table. When you resize the array, you need to re-insert all of the items into this new hash table. You cannot simply copy the old items into the new hash table. Each item has to be rerun through the hashing function because the hashing function considers the size of the hash table when determining the index that it returns.
You can see that resizing is an expensive operation, so you don’t want to resize too often. However, when we average it out, hash tables are constant time (O(1)
) even with resizing.
The load factor can also be too small. If the hash table is too large for the data that it is storing, then memory is being wasted. So, in addition to resizing, when the load factor gets too high, you should also resize when the load factor gets too low.
One way to know when to resize your hash table is to compute the load factor whenever an item is inserted or deleted into the hash table. If the load factor is too high or too low, then you need to resize.
We added a get_load_factor
and resize
method to calculate the load factor and resize the hash table with a new capacity when necessary.
Let's change our put
method to resize when the load factor gets too high. Here's how our current put
method looks:
To know when to resize, we need to correctly increment the count whenever we insert something new into the hash table. Let's go ahead and add that.
Next, we need to check if the load factor is greater than or equal to 0.7. If it is, we need to double our capacity and resize.
Fantastic, we did it!
What does runtime look like with linked list chaining?
The performance of hash tables for search, insertion, and deletion is constant time (O(1)
) in the average case. However, as the chains get longer and longer, in the worst case, those same operations are done in linear time (O(n)
). The more collisions that your hash table has, the less performant the hash table is. To avoid collisions, a proper hash function and maintaining a low load factor is crucial. What is a load factor?
The load factor of a hash table is trivial to calculate. You take the number of items stored in the hash table divided by the number of slots.
Hash tables use an array for storage. So, the load factor is the number of occupied slots divided by the length of the array. So, an array of length 10 with three items in it has a load factor of 0.3, and an array of length 20 with twenty items has a load factor of 1. If you use linear probing for collision resolution, then the maximum load factor is 1. If you use chaining for collision resolution, then the load factor can be greater than 1.
As the load factor of your hash table increases, so does the likelihood of a collision, which reduces your hash table's performance. Therefore, you need to monitor the load factor and resize your hash table when the load factor gets too large. The general rule of thumb is to resize your hash table when your load factor is greater than 0.7. Also, when you resize, it is common to double the size of the hash table. When you resize the array, you need to re-insert all of the items into this new hash table. You cannot simply copy the old items into the new hash table. Each item has to be rerun through the hashing function because the hashing function considers the size of the hash table when determining the index that it returns.
You can see that resizing is an expensive operation, so you don’t want to resize too often. However, when we average it out, hash tables are constant time (O(1)
) even with resizing.
The load factor can also be too small. If the hash table is too large for the data that it is storing, then memory is being wasted. So, in addition to resizing, when the load factor gets too high, you should also resize when the load factor gets too low.
One way to know when to resize your hash table is to compute the load factor whenever an item is inserted or deleted into the hash table. If the load factor is too high or too low, then you need to resize.
We added a get_load_factor
and resize
method to calculate the load factor and resize the hash table with a new capacity when necessary.
Let's change our put
method to resize when the load factor gets too high. Here's how our current put
method looks:
To know when to resize, we need to correctly increment the count whenever we insert something new into the hash table. Let's go ahead and add that.
Next, we need to check if the load factor is greater than or equal to 0.7. If it is, we need to double our capacity and resize.
Fantastic, we did it!
Do we need to modify our delete
and get
methods to account for the new get_load_factor
and resize
methods? Why or why not?
The act or process of returning or running back
The act of defining an object (usually a function) in terms of that object itself
A method of defining a sequence of objects, such as an expression, function, or set, where some number of initial objects are given and each successive object is defined in terms of the preceding objects
A recursive definition is one in which the defined term appears in the definition itself. Self-referential situations often crop up in real life, even if they aren’t immediately recognizable as such. For example, suppose you wanted to describe the set of people that make up your ancestors. You could describe them this way:
Traversal of is another good example. Because these are nested structures, they readily fit a recursive definition. A non-recursive algorithm to walk through a nested structure is likely to be somewhat clunky, while a recursive solution will be relatively elegant. An example of this appears later in this tutorial.
When you call a function in Python, the interpreter creates a new so that names defined within that function don’t with identical names defined elsewhere. One function can call another, and even if they both define objects with the same name, it all works out fine because those objects exist in separate namespaces.
Unfortunately, running function()
as it stands produces a result that is less than inspiring, as the following shows:>>>
Python doesn’t allow that to happen. The interpreter limits the maximum number of times a function can call itself recursively, and when it reaches that limit, it raises a RecursionError
, as you see above.
The next example involves the mathematical concept of . The factorial of a positive integer n, denoted as n!, is defined as follows:
Factorial so lends itself to recursive definition that programming texts nearly always include it as one of the first examples. You can express the definition of n! recursively like this:
For example, recursive computation of 4! looks like this:Recursive Calculation of 4!
A little embellishment of this function with some statements gives a clearer idea of the call and return sequence:>>>
Recursion isn’t necessary here. You could implement factorial()
iteratively using a loop:>>>
You can also implement factorial using Python’s , which you can from the functools
module:>>>
To evaluate execution time, you can use a function called from a module that is also called timeit
. This function supports a number of different formats, but you’ll use the following format in this tutorial:
Frankly, if you’re coding in Python, you don’t need to implement a factorial function at all. It’s already available in the standard :>>>
Technical note: The fact that math.factorial()
is so much speedier probably has nothing to do with whether it’s implemented recursively. More likely it’s because the function is implemented in rather than Python. For more reading on Python and C, see these resources:
The next example involves visiting each item in a nested list structure. Consider the following Python :
As the following diagram shows, names
contains two sublists. The first of these sublists itself contains another sublist:
Recursion fits this problem very nicely. To solve it, you need to be able to determine whether a given list item is leaf item or not. For that, you can use the built-in Python function .
As with the factorial example, adding some statements helps to demonstrate the sequence of recursive calls and values:
For a given positive integer n
determine if it can be represented as a sum of two (possibly equal).