Extract Text From HTML Elements

Web Scraping With Beautiful Soup and Python Martin Breuss 04:37

Transcript
Discussion (4)

00:00 In this lesson, you want to dig deeper into the HTML that you got returned from the previous lessons and extract just a specific piece of text from it.

00:11 Again, let’s start off by exploring a bit. Now we have access to one of these cards, and now let’s see if I can find the title. The title seems to be nested inside of an <h2> element, so a second-level heading, and then there’s a link in here and it seems like the link has some content—

00:32 there it is—which is the actual text that makes up the heading. Okay. So inside of the card, there is a <h2> element. Inside of the <h2> element, there’s a link element, and the link element contains the text.

00:51 With this understanding, I’m heading back to the code and let’s just go for the first one, the one we inspected before. Remember the jobs from up here.

01:00 Because we used .find_all(), jobs is a list. So if I want to access one single Beautiful Soup element, I can access it via the index on that list.

01:09 I could also save that to a variable, but we’re just exploring here, so I’m saying, “Give me the first Beautiful Soup object that got returned from before, and in there, find an <h2> element.” Okay!

01:21 So this slims it down quite a bit, but as we saw before, <h2> still contains a link and a bunch of other attributes on that link. So this is still not quite what we’re looking for, but because always the thing that gets returned from a call like that is another Beautiful Soup element, I can just keep calling .find() and it’s going to dig deeper. So now on the title, I’m going to say .find() the link element. I’m going to print it out.

01:52 You see it cuts off these parts and anything that happens after the link, and returns to me only the link element here…

02:01 which obviously is still way too much. But now comes the helpful attribute on every Beautiful Soup object which is just .text, which gives you the content—so anything that’s in between the tags.

02:16 So, it cuts off all of these attributes in here—and you’re going to learn later how to specifically pick something out of the attributes, if that’s the information you want.

02:25 But very often all you want is the text, so if you run .text on an element, you get the text! And this looks already much more similar to the title that we’re looking for, and you can clean it up a bit with just a normal Python string method here.

02:40 I’m calling .strip() on it, which takes off the newline character here. And I think that’s all, yeah. But if there would be something at the end, it would also take that off. And here we are!

02:50 We, got the string—the actual title of the job posting. Let’s see which one it is—'Data Engineer'.

03:00 So, by just searching for it, I can say “data”—Engineer Summer Internship, and there it is. So, that’s the element that we’re currently looking at…

03:10 and we’re correctly getting its title.

03:13 So, what I did afterwards is just write a list comprehension for doing all of these steps. So, first finding <h2>, finding the link, and then getting the .text of it, and then also cleaning it up for each of the jobs inside of the job list.

03:30 And like this, I could run the thing to get all of the job titles from that specific page. Run this, and here’s the output. You can see, these are all of the jobs that are currently listed on this one search result page. Nice!

03:48 So, the take away from this one is that, first of all, you can keep drilling down because Beautiful Soup keeps returning Beautiful Soup objects. So, all of those methods that you’re going to learn that work on one of them are going to work on the next-down level of Beautiful Soup as well.

04:04 So, that’s very helpful, and you can search for an HTML element by just passing in the type, which is similar to what you did up here by passing in the type, but here we specified it a bit more—that it’s only the ones of that type with a specific class. Here, I want to find all of the <h2> elements in there because there is only one inside of each of those cards.

04:25 And then I just keep drilling down. Next, I want to only get that link element, and then I want to call this useful attribute .text on it that just gives me the text output.

jcool78758 on Dec. 15, 2020

I am coding along with the lecture. When I pull code from the indeed website with the proper URL using the git.url command. It is pulling a posting that is not on the page. Any idea of what this could be?

Martin Breuss RP Team on Dec. 16, 2020

Hi @jcool78758, I don’t really know what you mean with:

using the git.url

It’s hard to say what you are seeing just from what you wrote. You could post your code and the results you are getting that are surprising, then I might be able to help out.

My suggestion is to try inspecting the indeed website and searching for some text that is in the posting you didn’t expect. See if you can find it in the original HTML, it should be there somewhere :)

jcool78758 on Dec. 16, 2020

My educated guess is there is something in the indeed website since there the code is working properly on the monster site. Is there a way I could attach files of the screenshot and the .py files from the Jupyter notebook to these comments?

Martin Breuss RP Team on Dec. 22, 2020

Hi again, you can’t attach images here, but you can upload them to a hosting service, e.g. Google Drive, and then add the link in this comment.

You could also add the link to your GitHub repo where you’re keeping the code, or add snippets of the code in here. If you do, remember to format them as code!

Did you follow my suggestion and searched the content of the indeed website to see whether the surprising content shows up, and where? What was the result?

Also keep in mind that the exact code that works for the Monster site will definitely not work for the Indeed site. They are built with different HTML, so each scraper needs to work for the specific site structure.

Become a Member to join the conversation.