Fast zips

The goal: “process” a zip file as fast as possible.

Loop over every file in the zip
Read the file into a byte array or something
Mutate it
Write it back to the zip
Ideally the destination zip should contain files in the same order

There are 3 ways to read zip files in the java standard library

ZipInputStream, dates from Java 1.1

Reads the zip file forwards, looking at the Local File header (LOC).
Ignores the zip central directory
Usage: read the next local file entry, then pull the file bytes directly from the zipinputstream. Cute
Serial behavior. There is only one ZipInputStream and it does all the inflating. You can pass it between threads, I guess, but you can’t read more than one file at a time

ZipFile, dates from Java 1.1

Starts out by reading the Central Directory, located at the end of the file
Usage: init the zip file, call entries() and pick out input streams you want
Provides random access
Maybe parallelizable? Try entries().parallelStream()…
- Note that getInputStream has a synchronized(this) for some reason
ONLY works with on-disk zips! AAAAAAA

the “ZipFileSystem”, dates from Java 7 I think

Fairly complicated and “heavyweight”
Allows treating an on-disk zip like a file system
Only works with on-disk zips!
Parallel behavior: reading operations put the inflater input stream on the thread that’s doing the reading

There are 2 ways to write them?

ZipOutputStream

Simple
Serial beahvior :pensive: there is only one Deflater
Writes the central directory after writing all the files

ZipFileSystem (again)

Parallel behavior: writing operations put the defalter output stream on the thread that’s doing the writing
Spills to disk if crossing a threshold on the size of the file

This thing

Someone figured out how to reach into ZipOutputStream internals so you can parallelize the deflating procedure

https://github.com/gregsh/parallel-zip

Basically the trick:

map: compress files individually with little ZipOutputStreams, but don’t close them (so the central-directory stays unwritten)
reduce: make a zipoutputstream, dump the compressed bytes directly into it while manually adding them to the zipoutputstream’s list of already-written entries, then close the zip (so it writes the central directory)

How can it be improved ?

Maybe reimplement the zip header writing code (instead of leveraging ZipOutputStream) and use DeflaterOutputStream
Compute the adler32 or whatever in one pass instead of using CheckedOutputStream… the idea is that we context-switch less, i dunno if it makes sense though
- Tried profiling this and pretty sure the difference was within the margin of error

On zip

java’s ZipInputStream is not actually conforming, because the central directory is supposed to be the source of truth for zip files. like, if the central directory doesn’t refer to a file, it’s supposed to be “deleted”. In practice, though: Meh.