Sets

Python Coding Interviews: Tips & Best Practices James Uejio 06:40

In this lesson, you’ll learn about sets. Sets are unsorted collections that allows for adding, removing, and checking membership in constant time. They do not allow duplicates. Here’s an example:

Python
      
        
      
    
>>> s = set()
>>> s.add('h')
>>> s.add('h')
>>> s.add('i')
>>> s
{'h', 'i'} # Could appear in any order
>>> set("hello")
{'e', 'h', 'l', 'o'}

If you want to learn more, check out Sets in Python.

You also heard about Big O analysis, which is a way to analyze the speed and memory usage of a function or block of code. See Wikipedia Time complexity for more information.

00:00 Let’s start this section off with a very useful data structure, the set. Sets are an unordered collection that can only contain unique values. Before I show you an example of a set, let’s look at a solution not using a set and see why sets are better.

00:15 Here’s a function count_unique() that takes in a string and counts the number of unique characters in s. So if we call count_unique() with the string "aabb", it’ll return 2, because there are two unique characters, and then if we have "abcdef" it will return 6, because are no duplicates.

00:31 Let’s solve this using a list. We will have seen_characters—or, I guess seen_c is better because it’s a shorter variable name, so you don’t have to type as much—which will be the characters that we’ve already seen in our string, and then at the very end, we’ll take the length of that. So, for c in s: if c not in seen_c: seen_c.append(c).

00:54 Then, return len(seen_c). I will run this using the built-in doctest module. This will look at the functions in the file, look at the docstring, look at the doctests, run them, and compare the output of the code that you wrote to the result, here.

01:09 If they are different, it will fail. And if they are the same, then it will return nothing. So, we run this, and it does nothing, which means our code worked.

01:19 You’ll learn about the doctest module more in-depth in a couple of videos. Let’s take a look at the Big O analysis of our code.

01:26 So, seen_c, instantiating a list, is constant time—I guess, there’s just constant time. for c in s will be O(n) time, where n is the length

01:36 of s. if c not in seen_c—okay. This is where it gets a little tricky. You might be thinking, “Okay, this is constant time,” because seen_c is always going to have a list—and we’re going to look inside that list—but you have to think about as s gets longer, does the length of seen_c get longer? Well, yes. Right here, seen_c only had 1 or 2 elements. Here, it could have up to 6.

02:00 So really, this line of code takes O(n) time, and that’s because as s grows, seen_c grows at some sort of proportion to that length.

02:10 So the final run time will be O(n * n), because you’re doing something that

02:16 takes O(n) time, O(n) times. This will be n * n, which is n squared. You can think of it like this. So, when we execute this for loop, c is the first character in s. We check

02:30 if c not in seen_c. Well, this is going to take zero calls, because seen_c is empty, so it won’t even have to loop through anything in seen_c.

02:40 Then, we go to the second character. Then, you have to loop through seen_c, compare it to the first element, check if they’re equal. Then, when we go to the second character—well, it depends.

02:50 If it’s all just the same character repeated, then you’ll still find it in seen_c, and that will only take one time. But most of the case, you’re not going to just have a string that’s all one character.

03:00 You might have a couple duplicates, then a single character, and more duplicates, or maybe the duplicates will be all spread out. And so seen_c will most likely have two elements.

03:11 Then, when you go to the fourth character, then you have to loop through seen_c, which most likely will have 3 characters already in it. Maybe it could have 2, because maybe you had found a duplicate beforehand, but either way—3 or 2, or something like that.

03:26 And then, when you get to the next character, it will be 4 or 3. And this sequence will go all the way n - 1, because once you get to the very last character, you will have to search through n - 1 values in seen_c.

03:41 So this sequence adds up to something—I’ll use, like, a curly—

03:48 something that is roughly n^2. I think it’s like (n^2 + 1) / 2, or something? But we only care about the largest polynomial, which will always be n^2.

04:00 So, sure. “Best case,” you could tell the interviewer, “Hey, best case, you have all repeated elements, so seen_c will only contain 1 character, so this will always take constant time and you’ll only take n time.

04:12 But for most cases, seen_c will be proportional to the length of s, which is n.” Okay. I know that was sort of a tangent, but it’s really important to see why we need to use sets.

04:23 So, let’s rewrite all this using a set.

04:27 Here, we’ll have seen_c—it will still be a variable that we will create, but it will be a set. So, sets are—again—an unordered collection of unique values.

04:36 So whenever any question talks about unique characters or getting rid of duplicates, your first instinct should be, “I need to use a set.”

04:44 So, let’s loop through s. if c not in seen_c—same code there—seen_c.add(c), and then return len(seen_c). So let’s save it, just make sure it works. It works! So doing the Big O analysis—O(1) time, O(n) time—didn’t change.

05:08 Now, this changed. So, to check membership in a set will always be O(1) time. That is because sets are similar to dictionaries, and so they hash the value and it can check if a value exists in O(1) time.

05:21 .add() also always takes O(1) time. So, the runtime now will always be O(n), and this is no matter what our string is. I’m going to show you one way to clean this code up even further by using set comprehensions.

05:33 Set comprehensions are exactly the same as list comprehensions, but because they’re sets, it will remove duplicates. Let’s comment this out. Let’s just add some lines, so you can sort of see the difference between the two, and then return len() of a list comprehension {c for c in s}.

05:54 Cool. So, this looped through each character in s, and then added c into our set, and then took the len(). So, the runtime would still be O(n), but this is objectively much cleaner than writing five lines of code.

06:06 There’s also one more solution that takes advantage of the fact that the set constructor can take in an iterable. return len(set(s))—so, s is a string, and strings are iterable, and this will convert the iterable to a set, and then take the len().

06:22 The runtime is still O(n), because this is the same as this code, except a little bit cleaner. There’s many more use cases of sets, and I’ll link a Real Python video below that will go into those details. In the next video, you’ll learn about generators, which are a great way to save memory in Python.

James Uejio RP Team on April 27, 2020

If you want to learn more, here is a Real Python walkthrough video on sets: Sets in Python

I also talk about Big-O analysis which is a way to analyze the speed and memory usage of a function or block of code. See Wikipedia Time complexity for more information.

drawdoowmij on May 3, 2020

Great information on Sets James – thanks!

Abhishek on May 10, 2020

Nice lesson James.

I have a question. Wouldn’t seen_c have an upper bound at 26, meaning all characters are already encountered at least once and the runtime would then be 26 * n which is essentially O(n)?

Thanks for your efforts, learning a lot!

James Uejio RP Team on June 27, 2020

Hi @Abhishek you’re absolutely right, I can’t believe I missed that. I’m so used to dealing with numbers that it slipped my mind. If you for example had a list of numbers, then it would be O(n^2) but with only letters, you’re right it would be O(n)

Gino Mempin on July 4, 2020

It might be good to show timeit outputs for simple examples:

In [4]: def count_unique_1(s): 
   ...:     seen_c = [] 
   ...:     for c in s: 
   ...:         if c not in seen_c: 
   ...:             seen_c.append(c) 
   ...:     return len(seen_c) 

In [5]: def count_unique_2(s): 
   ...:     seen_c = set() 
   ...:     for c in s: 
   ...:         if c not in seen_c: 
   ...:             seen_c.add(c) 
   ...:     return len(seen_c) 

In [6]: def count_unique_3(s): 
   ...:     return len({c for c in s}) 

In [7]: %%timeit 
   ...: count_unique_1("abcdef"*1000) 
   ...:  
448 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %%timeit 
   ...: count_unique_2("abcdef"*1000) 
   ...:  
184 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [9]: %%timeit 
   ...: count_unique_3("abcdef"*1000) 
   ...:                                     
139 µs ± 458 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

James Uejio RP Team on July 11, 2020

Hi Gino, thank you for posting that great example! Hopefully people will see it.

nhojman on Oct. 28, 2020

Thanks for a great tutorial James!

A short note, wouldn’t it be an even cleaner approach to write the below instead of the set comprehension?

def count_unique(s):
    return len(set(s))

nhojman on Oct. 28, 2020

Please excuse my previous comment - I didn’t watch the video to the end. My bad!

James Uejio RP Team on Oct. 30, 2020

No problem Nick! Glad you’re thinking ahead. Thanks for kind words :).

avinashk2 on April 11, 2021

Why above timeit comment gave slower time for set than list? I mean count unique1 is faster than count unique2.

James Uejio RP Team on April 11, 2021

@avinashk2 I believe count_unique_1 is 450 µs per loop and count_unique_2 is 184 µs per loop so count_unique_1 is slower.

Become a Member to join the conversation.