Using mmap

Python mmap: Doing File I/O With Memory Mapping Christopher Trudeau 11:41

Transcript
Discussion

00:00 In the previous lesson, I explained the vast difference in speed between the various parts of your computer, gave an overview on different kinds of memory, and briefly touched on file I/O. In this lesson, I’ll show you how all those things come together when you use the mmap module to map file contents into memory blocks.

00:20 If you were to write a function that did edits to a file, that function would likely read the file into a Python object (a string, a byte array, or something similar), make changes to the object, then serialize that object back onto the disk.

00:35 It might be into the same file, overwriting its contents. It might be to a new file or to a temp file that then gets renamed. But the results would be similar: new bytes on disk.

00:46 Instead, the mmap module provides an alternative way. It reads the file into a block of memory, which is abstracted by an mmap object, then operates directly on that object, meaning both the memory representation and the disk representation change.

01:05 This is both kind of simpler and kind of more complicated. You’ve got less steps happening, so you might get a performance gain, but you’re a little more restricted on what kinds of things you can do.

01:15 Let’s go play with mmap in the REPL, and I’ll show you what I mean. In the top window here, I have a simple function that reads a file and reports how many characters were read.

01:27 All of this is being done inside of a context manager—that’s the with statement—so that the file automatically will be closed upon exiting the context block. Now, into the REPL in the bottom window.

01:44 I’ve put a filename into a variable. This quixote file is a text file containing ten copies of the History of Don Quixote pulled off of Project Gutenberg.

01:55 All told, it’s twenty-four megabytes of data uncompressed. It’s ten copies because I wanted something a bit larger than a single copy. Now I’ll import the function in the top window …

02:10 and call it. So far, so good. Twenty-three million characters. Now let’s look at the mmap equivalent. New function up top. The first thing you’ll probably notice is there are two context managers here.

02:28 Like before, the file to be read is opened. That’s line 5. The new bit is where the file handle from the open file is used in a call to create an mmap object from the mmap module. Like files, this object has to be closed. So like files, it gets put in a context manager to make sure everything is cleaned up automatically. mmap doesn’t use a file handle.

02:52 It uses a file number, which you can get from the file handle itself. In addition to the file number, it also takes a size and an access flag. Giving a length of zero, like I did here, you will get back a block of memory the same size as the file being mapped.

03:09 The access flag is similar to the mode indicator in opening a file. I’ll go into much more detail about this flag later. Inside of the mmap context block, I’m doing pretty much the same thing as I did in the file_read() function.

03:25 I’m reading the whole thing into a variable then figuring out how long it is. All right, let’s do this.

03:34 Imported it in …

03:39 and called it. Pretty similar. You’ll notice the amount of data looks different. mmap objects represent bytes, not strings. Python strings are in Unicode, and that means they may take up more than a single byte for a character.

03:54 That’s why the size is different. One of the key reasons for using mmap over vanilla file operations is performance, but there’s a big asterisk beside that special offer.

04:06 Let’s time the two functions and see the difference. I’m going to import the timeit library to do the timings. And now I’ll time the file_read() function.

04:25 Using timeit, the function got run three times, returning the results in the list printed at the bottom here. Pretty consistent: 0.067 seconds twice, and slightly faster the third time.

04:39 Let’s do it again with mmap.

04:51 That’s all over the shop, isn’t it? The first time is far worse than the vanilla code, but the second and third are significant improvements. This is where it gets messy.

05:00 There are a bunch of variables impacting the outcome. First, you’ll get different performance based on file size. Second, you’ll get different performance on different hardware due to what kinds of caches you have.

05:14 Third is how your OS has implemented the mmap call. Depressingly for me, there is a known issue in the macOS mmap call that makes it significantly slower than running Linux on the same hardware.

05:27 A colleague of mine running the same code on Windows was consistently getting ten times improvement. Do note that what I’m doing here is just using mmap to read some data and stuff it into a Python object. Although this might get you a performance boost, it is still stuffing things into a Python object.

05:45 Depending on what you’re doing, you may be able to stay inside of the mapped block, and that is where you’ll see better gains. More on that later as well.

05:56 In that little demo, I yada yadaed the whole characters and bytes thing. Let’s dig into it a bit more. The mmap call uses a byte array representation. That means it sees everything as the bytes that make up the block, regardless of what the data represents. In the case of a Unicode string, a single character may be more than a single byte.

06:18 That means you have to be careful how you read or write your data. The boundaries between characters might not be what you expect. If you’re dealing with text data that is pure ASCII, you can get away with a one character-one byte assumption, but otherwise need to be careful. If you’d asked me before running the previous code, I would’ve sworn the Don Quixote file was pure ASCII.

06:42 But the character count didn’t match the byte count, so there’s something in there outside of the ASCII range—over seventy kilobytes of something, in this case. Let’s go back into the REPL and see how this can mess you up.

06:57 I have three functions for you to compare. The first one reads the file as text and prints out some data. Let me just import it and run it.

07:13 Okay. I ran it on monty.txt, which has 39 characters of content. The first character is N, the sixth character is y, and the whole string is Nobody expects the Spanish Inquisition. Watch out. I hear they tickle. Now for function number two.

07:35 This one’s similar to the text case, but this time I’m reading the file as binary. Importing it …

07:46 running the new function on the same monty.txt file …

07:54 Okay. 39 bites, just like the 39 characters. That’d be that ASCII thing. First byte is 4e, which is the hex code for capital N.

08:04 And the sixth byte is hex 79, which I then conveniently show as a character and which is still the letter Y. And finally printing it out, you get a string representation of the bytes.

08:17 Because this is a chunk of binary rather than a string, Python prints it using the byte notation, a quoted value with a b prefix. And you can see the newline at the end of it. That pesky newline?

08:32 Yeah, it was there before, but in the string version, it caused the gap between the output and the next REPL prompt. Subtle. You could easily miss that. Let’s try these two functions with some different data.

08:45 I have another file that I’m going to load.

08:51 Looking at snake.txt as a string, this time the length is 26. Remember that’s in characters. The first character is a cute little snake, the sixth is an e, and the whole thing is filled with emoji goodness.

09:07 Before showing you the binary, I want show you some info about the file.

09:19 This rather lengthy bit of code creates a Path object, calls the .stat() method on that Path object, and then gets the size value out of the resulting object.

09:29 The 39 here is how many bytes the file is. monty.txt had 39 characters and was 39 bytes. Now let’s try the snake file.

09:43 Hmm, it says 35 bytes, but text_pieces() said it was 26 characters. That’s important. The emojis in the file take up more than one byte each, making the total different.

09:58 Let’s try reading the byte-reading function on snake.txt.

10:06 There are 35 bytes, which maps what the Path object said. The first byte is f0. What’s an f0? Well, the first character is the Python emoji, which takes up four bytes.

10:19 The f0 is just the first of those four. The sixth bite is an ASCII character though, so you can see the m. The string representation of this gets quite messy because the bytes are printed as, well, bytes instead of the ASCII equivalents.

10:35 The xf0, x9f, x90, and x8d all get combined to make the snake character. All right, onto our third function. This is a course on mmap, after all.

10:53 The new function up top here accomplishes the same thing as the binary reader, but it uses mmap . I’ll import it …

11:06 and run it …

11:11 and you see the same kind of result as the binary reader. To recap, when you’re using mmap, you’re in byte land. Everything you do is operating on a giant byte array.

11:22 If the data you’re playing with is a Unicode string, you need to be careful.

11:29 There are even more Monty Python quotes coming your way. Next up, I’ll dive deeper into the mmap call and show you many of the operations you can do on your mapped block.

Become a Member to join the conversation.