Parse HTML Code With Beautiful Soup (Introduction)

Web Scraping With Beautiful Soup and Python Martin Breuss 01:52

Transcript
Discussion (2)

00:00 So far, all you’ve done is just get the information from the web onto your computer and into your script. And now, with parsing in Part 3, you’re getting to the actual meat of web scraping, where you decide what is the information that you want to keep and that you want to pick out of that big, big, big soup of text that you received back from a website.

00:22 So, in this part of the course, you’re going to look at a couple of different ways of sifting through this information and getting out bits and pieces that are interesting for you. First, you’re going to look at finding elements by ID—and with ID here, I’m talking about the id of an HTML element.

00:41 Then, you’re also going to look at finding elements by their HTML class name, which can be multiple elements. For example, if you think of the Indeed website, there’s going to be these cards that have a certain class associated to them.

00:53 So, we’re going to learn how to receive the information for all of those. Then, you want to be able to extract text from the HTML elements because the elements themselves—it’s still a bunch of code that is not actually what you’re interested in that’s just the structure of the website.

01:09 So you want to be able to pick out the text—for example, the title of the job posting. You will also learn how to extract attributes from HTML elements. For example, that could be the link that leads you forward to another page where you can actually apply for the job, which is not part of the content of an HTML element but which lives inside of one of its attributes.

01:32 And with going through these, you’re going to have a good overview and a set of tools that you can use in order to extract the relevant information for you from the website content that you scraped in Part 2.

01:45 So, let’s get started and find the main element we’re interested in in the next lesson.

samsku1986 on Nov. 2, 2020

I want to get information from request_url but session.get is getting content from login_url.

What do I have to do to make it work?

from requests_html import HTMLSession 
import urllib.parse

login_url = 'http://l42-harmony-01.video54.local/login'
request_url = 'http://l42-harmony-01.video54.local/release_planners/969/results'
login_payload = {"username": "ssugatha", "password": "Elamsheri1!"}

session = HTMLSession() 
p = session.post(login_url, data=login_payload)

>>> print(p.html)
<HTML url='http://l42-harmony-01.video54.local/login'>
>>> print(p.history)
[]
>>> r = session.get(request_url, data=login_payload) 
>>> print(r.html)
<HTML url='http://l42-harmony-01.video54.local/login'>  >>>>> What is it going on here?
>>> print(r.history)
[<Response [302]>]

Ricky White RP Team on Nov. 4, 2020

Hi @samsku1986. I think you may have accidentally posted this question on the wrong course.

If you have a question that it not about the course lesson itself, then I would recommend you post in the Community Slack for a higher chance of getting the help you need. Good luck.

Become a Member to join the conversation.