Understanding Text and Binary Files
00:00 In this lesson, you’re going to understand what it means to open a file in text or binary mode in Python. Broadly speaking, files on our computer can contain either human-readable text or binary data designed for machines, even when they both represent the same piece of information.
00:19 Some examples of text files might include your Python source files, HTML files, or CSV data files exported from a spreadsheet program. To give you an idea of binary files, think of audio and video data images or executable machine code, none of which are text. These can be sound waves, pixels, or instructions for a computer processor. By the way, don’t diffuse plain text files with rich-text format documents such as Microsoft Word, LibreOffice Writer, or Google Docs.
01:03 Those elements don’t usually have a meaningful representation in text, as they take the form of numbers meant to be read by a computer program that knows how to display them. So even though what you’re looking at consists of text, primarily it is not considered a plain text file.
01:22 Here, over on the left, you have a sample text file stored in the comma-separated value format. It contains some personal expenses. When you import that file into the office software of your choice and save it as a spreadsheet, then you’ll end up with a binary file whose content under the surface might look similar to the one on the right.
01:42 These are numbers without any meaningful textual representation. When you try to open such a binary file in a text editor, then a few things can happen. First, your editor might recognize it’s dealing with a binary file.
01:56 It’ll just refuse to open it. Alternatively, it may try to map each number into a character, which will almost certainly result in a bunch of gibberish that doesn’t make any sense. Finally, your editor can display the values of the individual bytes, for example, using hexadecimal digits like here on the slide.
02:26 It’s only a matter of how you and your software decide to interpret these numbers, which to some extent is arbitrary. However, this also means that you can get things wrong in binary mode unless you know the underlying file structure.
02:40 Many commercial programs deliberately use proprietary binary file formats without disclosing their internal structure to lock you into a particular product. As a result, it becomes difficult, if not impossible, to open your files using unofficial software unless someone successfully reverse-engineers the file format at hand. If you zoom in on the word cash, for example, in the text file on the left, then you won’t see any numbers just yet.
03:43 ASCII stands for American Standard Code for Information Interchange, and it’s by far the most common character encoding system used for English text documents. It’s also one of the oldest and not the only character encoding in use today.
Note that you can use Python’s
chr() built-in functions to double-check if this number-character relationship holds.
ord() returns the character’s ordinal value, while
chr() returns the corresponding character.
When you open a file in Python, either with a built-in
open() function or the
Path.open() method, you have the choice of specifying whether you want Python to treat the file as a stream of human-readable characters or generic bytes. In other words, you can read the same file using either text or binary mode in Python.
04:37 In the text mode, Python will automatically take care of translating the sequences of bytes into meaningful characters wrapped in Python string objects, and it will let you read the text line by line, which, although possible, doesn’t make much sense in binary mode.
04:53 On the other hand, binary mode lets you read the raw bytes as integers from the file without any translation. This can be convenient if you want to manipulate the bytes directly, for example, when processing an image. Now, how do you specify which mode to open the file in Python?
By default, if you don’t pass any arguments to
.open(), Python will open the file in text mode for reading. You can verify the file mode by inspecting the return file object’s
.mode attribute, and you can find out if it’s readable or writable by calling the corresponding methods. When you execute this code, you’ll see the letter
r, which stands for readable, appear in the output.
It is the default value for the
mode argument, which you can set explicitly when calling the
.open() method or function. When the
mode attribute doesn’t say otherwise, the file will be opened in text mode. Although text mode is assumed implicitly, you can include the extra letter code
"t" to indicate the text mode more explicitly if you really want to. However, because the letter
"t" is implied, you can leave it out and almost never use it again in practice.
06:12 Note that you can’t have both text and binary modes set at the same time because they’re mutually exclusive. You’ll learn about a few other letter codes for the available file modes in Python and when to use them in an upcoming lesson. Also, from now on, you’ll only be considering text files in this course, so you won’t have to worry about the binary mode anymore.
06:46 These default values can be different for different people depending on their operating system. Specifically, the two parameters that can cause problems are the file’s character encoding and line ending. Python will make a best guess when you don’t specify them, but it’s generally recommended to set them by hand.
Become a Member to join the conversation.