Fast zips
The goal: “process” a zip file as fast as possible.
- Loop over every file in the zip
- Read the file into a byte array or something
- Mutate it
- Write it back to the zip
- Ideally the destination zip should contain files in the same order
There are 3 ways to read zip files in the java standard library
ZipInputStream, dates from Java 1.1
- Reads the zip file forwards, looking at the Local File header (LOC).
- Ignores the zip central directory
- Usage: read the next local file entry, then pull the file bytes directly from the zipinputstream. Cute
- Serial behavior. There is only one ZipInputStream and it does all the inflating. You can pass it between threads, I guess, but you can’t read more than one file at a time
ZipFile, dates from Java 1.1
- Starts out by reading the Central Directory, located at the end of the file
- Usage: init the zip file, call entries() and pick out input streams you want
- Provides random access
- Maybe parallelizable? Try entries().parallelStream()…
- Note that getInputStream has a synchronized(this) for some reason
- ONLY works with on-disk zips! AAAAAAA
the “ZipFileSystem”, dates from Java 7 I think
- Fairly complicated and “heavyweight”
- Allows treating an on-disk zip like a file system
- Only works with on-disk zips!
- Parallel behavior: reading operations put the inflater input stream on the thread that’s doing the reading
There are 2 ways to write them?
ZipOutputStream
- Simple
- Serial beahvior :pensive: there is only one Deflater
- Writes the central directory after writing all the files
ZipFileSystem (again)
- Parallel behavior: writing operations put the defalter output stream on the thread that’s doing the writing
- Spills to disk if crossing a threshold on the size of the file
This thing
Someone figured out how to reach into ZipOutputStream internals so you can parallelize the deflating procedure
https://github.com/gregsh/parallel-zip
Basically the trick:
- map: compress files individually with little ZipOutputStreams, but don’t close them (so the central-directory stays unwritten)
- reduce: make a zipoutputstream, dump the compressed bytes directly into it while manually adding them to the zipoutputstream’s list of already-written entries, then close the zip (so it writes the central directory)
How can it be improved ?
- Maybe reimplement the zip header writing code (instead of leveraging ZipOutputStream) and use DeflaterOutputStream
- Compute the adler32 or whatever in one pass instead of using CheckedOutputStream… the idea is that we context-switch less, i dunno if it makes sense though
- Tried profiling this and pretty sure the difference was within the margin of error
On zip
java’s ZipInputStream is not actually conforming, because the central directory is supposed to be the source of truth for zip files. like, if the central directory doesn’t refer to a file, it’s supposed to be “deleted”. In practice, though: Meh.