So it's holding two tiles, per thread, per open tiled input file!
2 x RGBA half 64^2 tiles -> 64k per thread per file x 1000 files x 16 threads -> 1 GB, just for this source of overhead, not counting anything else like header data or other allocations
For 64k (two reasonably sized tiles), maybe it would be better to do a stack allocation just when the extra decode buffer is needed, so there would be no call to malloc/free and no retained memory. Switch back to a true malloc only for the rare case of huge tiles where it doesn't seem safe to do a stack allocation.\?
|