Loading video player…

Setting Up Your Work Environment

00:00 To get started on your pandas data cleaning adventure, you’re going to have to first set up your work environment. That means creating a folder to work from, setting up a virtual environment within that folder, installing pandas and Jupyter into that virtual environment.

00:15 While this course is about pandas, Jupyter is a great tool to be able to interact with pandas and your data and explore it interactively. You’ll be setting up VS Code to work with all this.

00:26 And then finally, you’ll be downloading the datasets. In this video, these steps are going to be gone through quite quickly in a step-by-step fashion, but there will be links below if you want to delve deeper into any of these topics.

00:39 This is a PowerShell prompt on a Windows 10 machine. The first thing you want to do is to navigate to a folder like you see here: rp/ for Real Python, but this can be any folder that you want.

00:50 Maybe it’s your user’s folder, just a folder where you’re going to create another folder, which will contain all the materials for this course. Once you’re in this folder, you can make a directory,

01:05 and then you can go into the directory. Once you’re there, you are ready to create your virtual environment, but first you want to check your Python version.

01:19 Note that some of the examples in this course will only work with Python 3.9 and above. It doesn’t have to be 3.10.3. It just needs to be 3.9 and above.

01:34 With your virtual environment created, you can activate it. On Windows, this is done by running the Activate.ps1 script in \venv\Scripts\. On a Linux machine or mac, this will be source ./venv/bin/activate. In any case, make sure that the virtual environment’s activated.

02:01 With your virtual environment created and activated, now you can lay down some of the basic structure for what’s going to be covered in the course.

02:23 Okay, so that’s created a folder called data-sets/, where you’re going to be downloading the datasets. And it’s also created three files: books.py, olympics.py, towns.py, where you’ll write the code for each data cleaning script for each of the three datasets. Before starting up VS Code, you want some things in your virtual environment.

03:03 That’s a lot of packages. The install process might look slightly different and might take a while on your machine. But with that all successfully installed, now you can move on and start up VS Code.

03:20 On the left, you’ll see the empty folder and the empty files that you created. You’ll see the virtual environment. Ignore this .vscode/ folder.

03:30 These are just some settings to make it easier to record this video.

03:35 The first extension that you want to install is this official Microsoft Python extension. When you install this extension, it will come included with another extension, Pylance.

03:47 The other extension that you’ll want to install is the Jupyter extension, again by Microsoft. This will allow you to render Jupyter notebooks in VS Code. Note the version and note that in this video, the pre-release version is being used because of some issues encountered on mac M1 processors. Switching to the pre-release version fixed some issues that were had with that.

04:12 This extension comes bundled with the Jupyter Keymap and Jupyter Notebook Renderers. You should only need to install the main Jupyter extension, and everything else will come included. To get the datasets, navigate to the realpython/python-data-cleaning GitHub repository.

04:34 Once there, you’ll see that there is a folder called Datasets/, and in there, you can click on each of them with the middle mouse button, which will open three new tabs.

04:47 And once in one of the tabs, you can press the Download button here to download them individually. You can also clone the whole repository and move the datasets over to your local project.

04:59 Once you have downloaded the datasets, place them into the data-sets/ folder. Here on the left-hand side, you’ll see the names of the files. Make sure your files have the same names as you can see on the left here.

05:12 If you haven’t used Jupyter notebooks in VS Code before, here’s a quick walkthrough of how that works. The reason you might want to use VS Code over Jupyter notebooks is because you can write your cleanup script and interact with your data in a very similar way to Jupyter notebooks all in one place, so you don’t have to have two windows open.

05:30 You can just split the screen within Visual Studio. Then you can also run portions of your cleanup script interactively and then interact with it like a notebook. Open up any of the .py files, write a comment, and in the comment, put two percentage signs (%%).

05:47 If you’ve installed the extension correctly, VS Code should be able to detect this, and it will give you some options to Run Cell | Run Below | Debug Cell. Now, within this cell, you can write a Python statement.

06:05 You can create more cells with the comment and percentage signs

06:13 and write different code there. Try clicking on the Run Cell button of one of these cells.

06:21 That should open up a window to the side, which is your interactive Jupyter kernel. It will try to connect to the Jupyter kernel, and then it will execute that cell.

06:33 Likewise, you can run the cell below that, and it will run everything in this cell. The shortcut for this is also Shift + Enter or Control + Enter. They both work.

06:45 So this allows you to write fleeting code in the bottom right-hand corner of the screen, which you can also run with Control + Enter or the play sign (), which will allow you to interact with the code that’s already been written.

06:58 If you’re having trouble getting your interactive window to connect to your virtual environment, you can try clicking the button on the top right that says venv (Python 3.10), or it might say something else, depending on what your Python version is.

07:12 This will bring up the option to change the kernel for the interactive window. Make sure that your virtual environment is selected here and that you have installed pandas and Jupyter in the virtual environment.

07:22 There have been issues on mac M1 processors with selecting the right kernel, but updating the Jupyter extension and moving it to the pre-release version has fixed this in our tests.

07:33 Another solution for mac is installing Conda. In our tests, using Conda seemed to work around this issue.

07:41 That was setting up your work environment. You created your folder, virtual environment, you installed pandas and Jupyter, you set up VS Code with its extensions, and you downloaded the datasets.

07:51 You’re now ready to get to some cleaning. In the next lesson, you’ll be exploring the first dataset, the Olympic data.

Avatar image for James

James on May 31, 2022

Hi, If I haven’t got the latest version of Python on my Qindows laptop :/ Is it possible to create a new environment with the latest version or do I have to download & reinstall Python with the latest version? Would this cause any problems for other packages? Thanks

Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on June 1, 2022

@James The venv module that allows for managing virtual environments is part of the Python distribution, so in order to use a more recent version of Python, you’d have to upgrade it first. That said, there’s a neat tool called pyenv for managing multiple Python versions using a single command. Unfortunately, it’s not available on Windows, but you can check out a similar project called pyenv-win, which might do the trick for you.

Become a Member to join the conversation.