Saturday, March 7th, 2009...20:05


Jump to Comments

I was recently cc’d on an e-mail containing a code review request, just so I’d be aware of an upcoming change to the data returned from an API used by my code. A little while later one of my co-workers commented, “he wrote a whole parser, very C like.” Prior to that, I hadn’t planned on looking at the code since I just had to be on the lookout for the new data returned; however, the API is implemented in Java so writing a parser that was “very C like” seemed odd.

The job of the new code was to, one time, at application start-up read and organize a binary file of delimited data into memory so it could be queried by ID. The file consisted of multiple records delimited by a special character. Each record in turn had fields delimited by another special character. The file in question currently stood at about 2MB in size.

Looking at the code, two things caught my eye. This person had written their own functions to chunk an InputStream into sections of 1k byte arrays (including sleeps and retries if an IOException occurred) and then written more code to build up chunks of byte arrays between field delimiters for conversion into Strings. This code will be running in a 32bit JVM with access to 1.5GB of memory. When I asked him why he didn’t just read the entire file into memory, he cited performance/efficiency concerns along with the possibility that the file might grow too large to fit in memory. Believe me when I tell you this file will NEVER be too large to fit into memory of a relatively current computer (if you must know it contains biographies of musicians in Spanish and German [and it’s not user generated data]).

He refused to budge on the philosophical belief that simpler code is better than over-engineered/pre-optimized code, I tried the, “don’t reinvent the wheel” argument, mentioning that he’d basically written part of the standard After some research, he agreed and poof, 30 lines of cruft turned into 5. I tried to get him to stop the other buffering/memory allocation he was doing for each field and just read the data into an ArrayList or some other object that grows automatically and again, he played the performance card. This time he claimed the conversion from byte to Byte was wasteful and an ArrayList was not a natural object to use when reading a stream of bytes. I don’t know about you, but an ArrayList seems to be a perfectly natural object to use when collecting, in order, an unknown number of anything. He also stated, that the ArrayList was, internally, doing the same thing he was with regard to resizing arrays. I thought I’d ask him why he’s reinventing the wheel, but I knew he’d just fall back on the byte vs Byte auto-boxing performance issue.

I’ve yet to decide if/how to enlighten him that an ArrayList is not doing the same thing, internally, that his code was. His code was allocating a byte array of size 1024, reading bytes into it and when full, creating a new byte array of current size + 1024, and merging the data. An ArrayList uses a non-linear growth calculation when it fills up ((current size * 3) / 2 + 1) I believe at the time of writing). I was too busy and didn’t really care to discuss further with him the performance impact of his simplistic memory allocation scheme. I had 3000 line configuration files to attend to 🙁

Comments are closed.