Parse HTML Code With Beautiful Soup (Recap)

Web Scraping With Beautiful Soup and Python Martin Breuss 06:20

00:31 So, you can inspect it by clicking through or using your developer tools, et cetera. And you understand something about the structure of your website and then head over with this information to your code and type it in, see if it gets you the information that you need.

00:46 But at any time in that process of parsing, you want to be able to switch back over to your browser and use the tools of inspecting your website to figure out what are the exact things that you need.

00:57 This is really an iterative process, so you want to keep switching from one to the other and back again. Just keep in mind that both of these parts are really important and part of the parsing process, because you need to understand your website in order to be able to fetch the specific pieces of information that you need. Okay.

01:16 Then we started talking about actually parsing the HTML with Beautiful Soup, and before you’re able to do anything with that, at first you need to create the Beautiful Soup object.

01:26 You do that by importing the BeautifulSoup class from the bs4 library—this is the name of Beautiful Soup—and then you pass in the HTML content into the object constructor.

01:37 So, I grayed out the things that are related to scraping so that you can focus here on what are the parts that you need to do in order to create this soup object, which is a Beautiful Soup object and just a default name that is often used if you’re working with Beautiful Soup. Once you have the soup object, you’re ready to do some intuitive parsing using the Beautiful Soup library.

01:58 The first thing you learned about was finding an element by ID. You did that with .find() method and then passing in the id argument.

02:06 Now, you could pass in anything else here as well. You could pass in the name of an HTML element, for example, but you have to keep in mind that if you use the .find() method, it only returns the first element that matches whatever criteria you put in here.

02:53 Now, if you would want to have, for example, all of the <div> elements on the page, you would just say .find_all('div')—close the brackets here.

03:01 But obviously, you can specify a bit more and very often this makes sense if you want to scrape specific information from your page. In your example here on indeed.com, we searched for all the <div> elements that have a specific class, which was called 'result'.

03:17 Now, also keep in mind that you need to have this little underscore (_) here at the end of class_, to avoid any troubles with the reserved Python keyword class.

03:28 After finding HTML elements by class name, you next learned about how can you actually extract the text from an HTML element using Beautiful Soup. And you can do that with the attribute .text on any Beautiful Soup element.

03:42 Remember that any of these methods that we talked about before, .find() or .find_all(), they return Beautiful Soup elements. Well, .find_all() returns a list of Beautiful Soup elements, so you can have to be careful with that.

03:54 But all of these Beautiful Soup elements have the attribute .text, which gives you the text content of the specific element. And this really works on any Beautiful Soup object, which makes it a very convenient way of accessing the text, which is often what you’re looking for when you’re scraping the website.

04:11 Now, there’s also other pieces of information that you might want. For example, there is information that is hidden inside of HTML attributes, so next you learned about how can you extract this information from an HTML attribute. And you can do this using the square-bracket notation.

04:28 We did it, for example, on a title_link element—that was this <h2> element that contained the link. And then inside of that link, you had this href attribute, which is where the URL is contained inside of a HTML link element.

04:43 This is the most common HTML attribute that you’re going to scrape because often you’re looking for a URL, for example, if you want to then navigate forward from those new URLs that you’re fetching from a page.

04:54 But it could be anything that any HTML element that you’re calling this on has. If you’re searching for an HTML attribute that doesn’t exist on that element, then you would get a KeyError.

05:06 So, keep in mind that this works for any HTML attribute, as long as the element actually has this attribute.

05:13 And then, a final thing I want to imprint on you here in this recap video is that every page is special, which means that you have to individually work with a page and understand its individual structure in order to be able to successfully scrape it.

05:27 You need to know where is the information located that you’re interested in, how can you navigate the structure of the HTML of that page to target specifically that information that you want.

05:38 And this is really a thing that you need to keep in mind. You can’t just write one web scraper to scrape any page that you’re interested in, because every page is special.

05:48 They’re just these little special snowflakes, and you need to specifically get to know them so that you can then extract the information that you’re interested in.

05:57 So, keep that in mind when working on web scraping, and that about wraps up our Part 3, parsing. In the next part, I want to show you a couple of ways that you can work forward with this project and a couple of ideas that you can follow up in order to make this project your own and make it relevant for yourself. See you in the next part!

Become a Member to join the conversation.