Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Understanding Search Algorithms

Creating a Binary Search in Python Liam Pulsifer 10:27

Transcript
Discussion (1)

00:00 In this lesson, I’ll go through some actual search algorithms. Let’s get started.

00:06 Taking off our expert programmer hats for a moment, if I asked you to search something like a full backpack for an eraser, how would you do it? You might take something out, check to see if it’s the eraser, and then toss it back in if it’s not, and repeat. That’s the exact idea behind random search, which has an implementation like one you’ll find here on the right.

00:28 The idea is simple: just loop, get a random index and the random element at that index, and if the element equals what you’re searching for, return the random index.

00:37 You could also return the element or you could return True to indicate that that element is in fact in the list of items that you’re looking through.

00:45 Let’s check it out in some actual code.

00:49 I’ve got a file here where I’m going to put in the various search algorithms that I’ll demonstrate, and as you can see, it has a basic function and a couple of calls to that function to load the names from names.txt and sorted_names.txt.

01:02 So you can see the different algorithms’ behavior on some really long lists of actual items. So, let’s type this thing out. from random import randint because that’s the only function I’ll need.

01:14 And then I’ll define random_find() with some items and a search_val.

01:21 So, while True: rand_dex = randint(), and randint() is actually inclusive of both of the end points, so I’ll need to say 0 to the len(items) - 1, because these are the valid possible indices.

01:38 Now, rand_el (random element) equals items at rand_dex, and then if rand_el is equal to what you’re searching for in the first place, then return rand_dex.

01:55 Let’s see this thing work in the Python REPL. If you’re following along in a REPL, just remember to import as needed, so something like from search_algorithms import random_find, and this will make the function available to you in your REPL.

02:11 So here I am in the REPL and I’ve defined a familiar list of elements here. Let’s test random_find() with this els and I’ll just search for, let’s say, 4. It seems to run pretty quickly and it does return the index of what you’re looking for, so that’s good. Now let’s try it with names—basic_names and "Fred Astaire".

02:36 Now, this isn’t quite so good, because I’ve hit Enter but it seems to be hanging for a while. And this illustrates one of the really basic problems with this random_find(), which is that random_find() doesn’t actually narrow the search space at all, so it could be testing some of these different indices hundreds or even thousands of times before it gets to the one that it needs to actually find what it’s looking for. So, there it was—and it did find it after a long time—but another problem here is that if I call this even on a short array with an element that’s not actually in it, it will just hang forever.

03:13 So as you can see, random_find() is a really flawed algorithm here because it doesn’t actually allow you any opportunity to determine if something is or isn’t in the list of items unless it finds that element, because there’s no narrowing of the search space.

03:29 And that’s born out by the benchmark data.

03:33 Looking at these data, you can confirm what I already showed you in the demonstration, which is that random search just does not perform very well on large lists.

03:44 It seems to do okay on really short lists, but that’s really just a matter of chance. And even worse than its slow runtimes—which are really, really slow—are two things.

03:53 First, that it’s inconsistent—sometimes it might be fast, while other times it might be really slow. And the worst thing of all is that when something is missing, the only bound on the time it takes to run is whether you actually put a stop condition in there, which is what happens in this benchmark.

04:09 So, random search is really flawed. Let’s move over to linear search, which is a little bit more intuitive and it also fixes some of these disadvantages.

04:20 So, what I’m sure all of you intelligent listeners were thinking about as I talked about random search was that the problem with random search is that you don’t narrow the search space at all. Going back to the backpack analogy, once you’ve taken a ruler out of your backpack and eliminated it from contention because you know it’s not an eraser, you should never have to look at it again.

04:40 This is the basic idea of linear search. You just have to process the elements in your collection in a linear order so that you can’t repeat elements. That way, the worst thing that can happen is if the element isn’t in your list, you look at all of the elements in the list once, whereas with random search, you might look at all of the different elements many different times before you find what you’re looking for.

05:00 So, this code is really simple. Let’s take a look at it in actual Python files.

05:07 Okay, so, defining linear_find() with some items and a search value.

05:14 This code is super simple and it’s also going to give us a huge speed-up in performance. So, for dex, val in enumerate(items):

05:24 what you can just say is if val is equal to the actual search value, return dex, and then if none of those work, you can just return None.

05:34 You could also do -1 or any other value that works for you to actually indicate that the value was not found. Let’s see it work. If I run this on the familiar list of elements, it works just fine—pretty much instantaneous. And if I try it on something a little bit longer, let’s say looking through the basic_names list, it takes a second—it seemed like about a second to me—but it got me this huge index of 10 million.

06:00 So it was much faster than the random search. Now, of course, if you try this with a name that’s earlier in the list, just due to how linear search works, "Fred Astaire" will be found much more quickly than "Baoyin Liu" because the index is so much smaller. Now, one more thing to note is that something like "Baoyin Liu" in basic_names, using the in operator, will run much faster than the Python version because these basic operators like in are implemented in C under the hood, and so they run a lot faster than this pure Python implementation could ever work.

06:36 So, remember that there’s no need to reinvent the wheel for things like linear search, because it’s already implemented in the basic libraries of Python.

06:44 And the same thing goes for basic_names.index("Baoyin Liu")—it’ll find it much faster than this linear_find() function ever could.

06:53 Let’s take a look at some benchmark data. You can see right away from these data how much faster linear search is than a random search by Average Time, especially by Worst Time.

07:05 Something else you’ll notice is that the best average and worst time are much, much closer together than they were for random search, which is great for predictability and for being able to write your code confident in knowing how long certain operations will take. Something else you might notice, though, is that the element index seems to have an effect on how long the search actually takes, and if you plot these on a graph, you’ll see that there is exactly a linear correlation between the element index and the amount of search time.

07:35 This makes perfect sense because you’re iterating through the things you’re searching, so the number of comparisons is correlated directly with the element index.

07:44 This isn’t always great because even though this still runs pretty quickly for 10 million items, when you start to get on the data scale of a social media network or something like that, you can be talking about billions and even trillions of items, and so this linear algorithm will eventually grind to a halt when you add too much data. Now let’s look at binary search, which doesn’t fall victim to that same problem to quite the same degree.

08:10 Binary search lends itself to a bit of a different analogy here. Binary search is kind of like flipping through the pages of a book, because what it allows you to do is if the elements of your collection are sorted, it lets you eliminate about half of the possibilities with each comparison.

08:26 So, flipping through a book, if your current page is earlier than the one you want to be on, you can just discard all of the earlier pages and vice versa.

08:35 So if I were looking through this basic array here for the number 3, the way this would work is I’d inspect the middle element and see whether the middle element is less than, greater than, or equal to what I’m looking for.

08:47 And since it happens to be greater than, I can discard this element and anything that’s greater than it, because by the transitive property, anything greater than this element must also be greater than 3.

08:56 Then I either recurse or simply modify my bounds so that I can consider the three remaining elements. Then I check the middle again—it’s less than 3, I can discard this and anything less than it.

09:09 And then my last array here is actually just the one element array 3, and so I’ve found what I’m looking for. Now, the awesome thing about this is that the number of comparisons is so much smaller than it would be with a linear search. So for example, think about the worst case behavior for a linear search of this array.

09:28 If I’m searching for 7, I’ll need to do seven comparisons before I get to it, but to find 3—and actually to find 7 as well—with binary search, I’ll only do three comparisons.

09:39 So it’s really much easier in pretty much all cases. Looking at the benchmark data, you can see that that’s totally the case.

09:48 So, binary search has, frankly, amazing speed. No matter where in the actual input array that you’re looking, the average time is always on the order of around six or seven microseconds, which is just tiny! I mean, that’s just a unbelievably small fraction of a second.

10:03 And the number of comparisons is pretty much static. No matter what you’re actually looking for—and even if the item is missing—you still do about the same number of comparisons.

10:12 So, binary search uses the fact that the array is sorted to be able to give you just incredible speed. In the next lesson, I’ll get into how you can start to actually use binary search in your code.

Levi on Jan. 29, 2021

Liking the course so far! For anyone else following the video with the provided code, it seems that the dataset has changed since the video and one of the names contains a character that the load_names function wasn’t able to handle, namely Antonio Badú. I added encoding="utf8" to with open and it fixed the problem.

def load_names(path: str) -> List[str]:
    print(f"Loading names from path `{path}`...", end="", flush=True)
    with open(path,encoding='utf8') as text_file:
        names = text_file.read().splitlines()
        print("ok")
        return names

Become a Member to join the conversation.