2012-12-05

Writing a stream to a zipfile in Python, harder than you think!

So here's the problem, you have a stream (a file-like object) in Python and you want to spool the contents of it into a zip archive. Sounds like a common requirement? It turns out to be very hard. I propose a solution here with hooks.

There are two methods for writing data to a zip file in the Python zipfile module.

ZipFile.write(filename[, arcname[, compress_type]])

and

ZipFile.writestr(zinfo_or_arcname, bytes[, compress_type])

The first takes the name of a file, opens it and spools the contents in to the archive in 8K chunks. Sounds like a good fit for what I want except that I have a file-like object, not a file name, and ZipFile.write won't accept that. I could create a temporary file on disk and write my data to that, then pass the name of the file instead but that supposes (a) that I have access to the file system for writing and (b) I don't mind spooling the data twice, once to the disk and once back out again for storage in the archive.

Before you protest, the ZipFile object only requires a file-like object with support for seek and tell, it doesn't actually have to be a file in the file system so (a) is still a valid scenario. We will have to ditch any clever ideas of spooling a zip file directly over network connections though. A closer look at the implementation shows us that once the data has been compressed and written out to the archive the stream is wound back to the archive entry's header to update information about the compressed and uncompressed sizes. Still, even if you are buffering the output at least you are dealing with the smaller compressed data and not the original uncompressed source.

So if ZipFile.write doesn't work for streams what about using ZipFile.writestr instead? This takes the data as a string of bytes (in memory). For larger files this is unlikely to be practicable. I did wonder about tricking this method with a string-like object but even if I could do this the method will still attempt to create an ordinary string with the entire compressed data which won't work for large streams.

Solution 1

The first solution is taken from a suggestion on StackOverflow. The idea is to wrap the ZipFile object and write a new method. Clearly that would be something good for the module maintainers to consider but it requires considerable copying of code. If I'm going to be so dependent on the internals of the ZipFile object implementation I might as well look to see if there is a better way.

Solution 2

Looking at the ZipFile implementation the write method is clearly very close to what I want to do. If only it would accept a file-like object! A closer look reveals that it only does two things with the passed filename. It calls os.stat and then, shortly afterwards, calls open to get a file-like object.

This got me thinking whether or not I could trick the write method in to accepting something other than the name of a file. I created an object (which I called a VirtualFilePath) and gave it a stat and open method. The implementation is not important, but this object essentially wraps my file-like object simulating these two operating system functions.

Unfortunately, I can't pass a VirtualFilePath to the operating system open function. I'll get an error that it wasn't expecting an instance. The same goes for os.stat. However, I can write hooks to intercept these calls and redirect the calls to my methods if the argument is a VirtualFilePath. This is basically what my solution looks like:

import os,__builtin__

stat_pass=os.stat
open_pass=__builtin__.open

def stat_hook(path):
 if isinstance(path,VirtualFilePath):
  return path.stat()
 else:
  return stat_pass(path)

def open_hook(path,*params):
 if isinstance(path,VirtualFilePath):
  return path.open(*params)
 else:
  return open_pass(path,*params)

class ZipHooks(object):
 hookCount=0
 
 def __init__(self):
  if not ZipHooks.hookCount:
   os.stat=stat_hook
   __builtin__.open=open_hook
  ZipHooks.hookCount+=1
  
 def __enter__(self):
  return self
 
 def __exit__(self, type, value, traceback):
  self.Unhook()
      
 def Unhook(self):
  ZipHooks.hookCount-=1
  if not ZipHooks.hookCount:
   os.stat=stat_pass
   __builtin__.open=open_pass

This code adds hooks which detect my VirtualFilePath object when it is passed to open or stat and redirects those calls. To make it easier to manage the hooks we create a ZipHooks object with __enter__ and __exit__ methods allowing it to be used in a 'with' statement like this:

with ZipHooks() as zh:
 # add stuff to an archive using VirtualFilePath here

There's one final detail to clear up. stat is supposed to return the size of the file but what if I don't know it because I'm reading data from a stream? In fact, closer inspection of the ZipFile.write method's implementation shows that it doesn't really rely on the size returned by stat as it monitors both compressed and uncompressed sizes and re-stuffs the header when it back-tracks.

The only other bits of stat that ZipFile.write is interested in is the modification date of the file and the mode (which it uses to determine if the file is really a directory). So if your file-like object isn't very file-like at all it won't matter too much because you only have to fake these fields in the stat result.