Doing mmap Operations

Python mmap: Doing File I/O With Memory Mapping Christopher Trudeau 12:24

Transcript
Discussion

00:00 In the previous lesson, I gave you a rough introduction to using mmap and some of the consequences of interacting with files that way. In this lesson, I’ll dive deeper, covering most of mmap’s methods. Mentally, I think of the mmap call as a function that returns a handle.

00:17 But I think that’s just my bias when it comes to things named with small letters, or it could be, I used to do this in C, which is the case there. It’s actually a class, and you’re constructing an mmap object. There is something a little unusual about the mmap object, though.

00:35 It has different __init__() calls, depending on what operating system you’re using. There are some common values that work on all platforms, and if you’re worried about code compatibility, you should stick to these.

00:47 The common arguments to the initializer are fileno. You saw this one. It is the file number you get out of the file handle to indicate what you are mapping. length.

00:58 This is the size of the memory block you want to create. You can use 0 to indicate you want a block the same size as the file that fileno points to. access.

01:09 This is one of several kinds of flags that affect how the mapping is done. This is the only flag that is common to both Windows and Unix. The choices for this flag are: ACCESS_READ, the block is to be read read-only.

01:22 ACCESS_WRITE, you can read and write from the block, and when you write to the block, the underlying file gets updated. ACCESS_COPY, you can read and write from the block, but when you write to the block, you are only impacting the memory copy, not the corresponding file. And ACCESS_DEFAULT.

01:42 On Windows, this is the same as ACCESS_WRITE. On Unix, it is a reference to another flag, which I’ll touch on in a minute. The final argument is offset. This defaults to 0, and if you set it higher, the memory block will start mapping that many bytes from the start of the file.

02:02 If you’re on Windows, there is an additional argument called tagname. Windows allows you to map the same file many times, each instance getting a different tag.

02:12 This allows you to construct objects with different parameters—say, different offsets—each with its own tag. You should avoid this if you want cross-OS compatible code. And as I mentioned before, the ACCESS_DEFAULT flag means the same as ACCESS_WRITE when you’re in Windows land.

02:33 Unix has other things you can configure. The first is flags. This is an alternate to the access argument, and you can only use one or the other.

02:44 The choices for flags are MAP_SHARED—this means blocks can be shared across processes—and MAP_PRIVATE. This means blocks are private to the process and copy on write.

02:57 There are other flags as well, but they aren’t common across all Unix variants. Details are available in the documentation. Like with tag names in Windows, you’re better off not using this if you can avoid it. The prot flag indicates the protection mode.

03:13 The choices for it are PROT_READ, meaning the block can be read from, and PROT_WRITE, meaning the block can be written to. These flags can be or’ed together.

03:25 If you’re using ACCESS_DEFAULT on a Unix box, it behaves based on what flags you’ve given the prot argument. As prot defaults to read or write, this is similar to the Windows behavior unless you change the value of prot.

03:43 You’ve seen the basics of mapping a file to a memory block and getting at its contents, but you can also search within the block. It even a search-from-the-right variant. You can use regular expressions. And you can do file-like operations: readings bytes or lines, writing bytes or lines, and managing the file pointer using seek.

04:10 Another demo, another Monty Python quote. You’d think it had something to do with the name of the language.

04:17 The text above is in a file that I’m going to map to. Through the magic of television or whatever this is, I’ll show you what would be in the file as I do things to it in the REPL.

04:26 Let me get ready to do some mapping by importing. As I want to see what is happening in the REPL as I go along, I’m not going to use a context manager. This is bad practice in code.

04:39 You should use a with statement if you’re programming.

04:55 All right, the file is open and mapped. Let’s quickly review the arguments to the object being constructed. The first argument is the file number, which is achieved by calling .fileno() on the handle returned by the open call. Setting length to 0 says to get a block the same size as the parrot.txt file.

05:17 And as I’m going write to this block, the access is set accordingly. Speaking of, notice that the open call uses "r+", meaning I can read and write to the file.

05:29 If I tried to create an mmap object in write mode, and the file was only open for reading, I’d get an exception. With everything open, let’s go looking for some stuff.

05:42 The .find() method searches the memory block for the bytes you give it. Remember everything is a byte array here, so you have to search for bytes, not a string. If I hadn’t used the b prefix here, I would get an exception.

05:57 The 5 in response tells me that I found "no" in the fifth position. That’s the piece that I have conveniently highlighted for you. Again, television magic, or whatever this is. As this is a byte array, it is zero-indexed. So 5 means the sixth position.

06:18 The .tell() method indicates where the file pointer is currently positioned. The .find() method doesn’t move the file pointer, so it is still at its default position at the front of the file. I can move the file pointer by using the .seek() method.

06:35 This has put it in the eleventh position. Calling .tell() shows you that it got moved. Now, if I call .find() again,

06:47 the finding starts at the file pointer position. The second instance of the "no" are at position 48. Because the file pointer was to the right of the first instance, the second instance is what was found.

07:03 You can also use .rfind() to search from the

07:10 right-hand side instead. That would be the last instance of the word "parrot".

07:19 Let me reset the file pointer. And I’m going to look for "parrot" from the left.

07:24 It indicates that "parrot" is at position 38. Let’s take a look at that position. You can use the mmap object like the byte array that it represents, grabbing position 38 directly. A quick look at an ASCII table will tell you that 112 is the letter p.

07:44 You can also do slices. Notice that what is coming back is a binary value. It’s nice and readable because this binary contains ASCII codes, but it would be a bunch of hex values if you got out of the printable ASCII range.

08:00 You may recall that I opened the file for writing. Well, let’s write something.

08:09 By directly addressing this slice, I can overwrite the bytes. The parrot is now a dead magpie instead. Mm, pie. Anyhow, if you want more complex searching behavior, the mmap object can work with the

08:33 regex library. This regex—they’re so friendly and readable—looks for a five-letter word.

08:42 The .findall() operation on the regex returns a list of each of the five-letter words in the block.

08:48 Let’s do some file-like operations. First, I’ll reset the pointer to the beginning.

08:58 I can read some bytes. Note that this has moved the file pointer forward.

09:07 Reading again gives you the subsequent seven bytes.

09:11 Two times seven is fourteen, so I’m now at position 14. You can use .readline() instead of .read(). .readline() reads from the current position to the next newline character.

09:24 I find this a bit weird. I have to keep remembering I’m dealing with binary, but then here’s a handy thing that is usually for text situations. .readline() also advances the file pointer, so you could loop on this and process your block this way if you wanted to.

09:40 How about writing? Let’s go back to the start … and calling the .write() method

09:52 overwrites the block. I appear to have my sketches intermixed here. Like .read(), .write() moves the file pointer, this time by the number of bytes written.

10:06 As I’m not using a context manager, I now need to clean up after myself. mmap doesn’t guarantee to write immediately when you call its methods. It might get buffered. If you’re doing an active process where you want to make sure things get written, you call .flush(), and when you’re done with the memory block, you close it and then the file as well. You don’t have to flush if you’re closing, like I did here. Closing will automatically flush any buffers.

10:36 I just wanted to show you the .flush() method.

10:40 In that demo, you saw me use .write() to overwrite portions of the file. You can also write a single byte value using the .write_byte() method.

10:49 There is also a .move() call that copies bytes from one location to another in the buffer. So far, everything I’ve shown you was within the boundary of the block.

11:00 If you want to go past the boundary, you would use the .resize() method, or you would, if your OS supports it. The .resize() call corresponds to a different underlying C method, which isn’t available in most operating systems. Linux implements it, though, in some cases.

11:18 This is the problem of coding with these kinds of low-level primitives: you’re bound by what is there. If your operating system doesn’t support the .resize(), then you’re stuck.

11:28 You have to stay with the boundaries of the block that you loaded. You might have noticed that I didn’t do any mid-block cutting. Everything was an overwrite. You can’t just snip out part of the block.

11:41 If your OS supports the .resize() call, you can achieve the same thing by copying everything after the clipping point to where you want to start the clipping, and then shrink the block.

11:53 If your OS doesn’t support .resize(), then you’re stuck with copying the content somewhere else then overwriting the file with your new content.

12:02 If the kind of editing you’re doing sticks to straight overwrites, you’re going to get a performance boost through the use of mmap. The more you have to pull parts out and create copies, the more this advantage will diminish.

12:16 There’s one more use of mmap, and that’s to share memory between processes. Next up, I’ll show you how that is done.

Become a Member to join the conversation.