another data point.
When I first experimented with adding DWA to Nuke using OpenEXR 2.2.0 I had to patch the configure so I could enable f16c instructions for gcc 4.1.2, after doing so vtune pointed to the copyFromFrameBuffer function when going from half to float for ~30+% of the CPU when reading files from local SSD. (Aside, there were a number of other namespace related fixes that were needed too, all of these are in the latest OpenEXR versions). I came to the conclusion that to make the performance any better it would need a f16c based half to float conversion function rather than going via the LUT, at least for those CPUs supporting those instructions. I also have some notes about testing memory mapped reading, but no conclusions.
This was not the case when f16c were disabled as other functions appeared higher in the profile - the total performance was lower without f16c (no surprise), it was only because the other functions got reduced by the f16c that bubbled copyFromFrameBuffer to the top.
I didn't try RLE compression.
Kevin