[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Openexr-devel] UTF-8
From: |
David Aguilar |
Subject: |
Re: [Openexr-devel] UTF-8 |
Date: |
Wed, 14 Nov 2012 18:52:47 -0800 |
On Wed, Nov 14, 2012 at 2:57 PM, Florian Kainz <address@hidden> wrote:
>
> The problem is that a channel or attribute name such as
> "grün" could be represented as the character sequence
>
> 0067 0072 0075 0803 006E
> (g, r, u, combining diaeresis, n)
>
> or as
>
> 0067 0072 00FC 006E
> (g, r, u with diaeresis, n).
>
> Typographically the representations look identical, but
> string comparisons would treat them as different.
> I can't imagine users being happy to be told that a file
> contains, for example, a "grün" channel of type HALF, and
> a "grün" channel of type FLOAT, where the only difference
> between the names is how they are represented as Unicode.
>
> As far as I can tell, either string comparison needs to
> perform some normalization on the fly, or the strings that
> are compared must already be normalized.
>
> Yes, normalization is a headache, but with Unicode there is
> not a one-to-one correspondence between the character sequence
> stored in a string and the typographical representation of
> that string.
I understand the point of normalization,
but I do not think it is the responsibility of the library.
>From the POV of an application -- if they are handing one unicode
representation to the library, and then ask the library for what
it has it will then give a different answer.
That would be a hard bug to track down.
Similarly, someone who stores filenames in headers and expects to get
back byte-for-byte identical strings will run into problems when they
find that the filenames do not exist (because they use a different
form).
Is it not easier to treat the data like raw bytes and not care?
I'm in favor of UTF-8 as a recommendation.
I'm on the fence about enforcing it in the library (it couldn't hurt).
I am not overly excited about pushing normalization issues into the library.
What's the driving benefit of forcing a particular normalization?
The user used a particular form. Why not use it as-is?
Presumably the rest of their app uses it too, so leaving data as-is
lets them make the call.
> Florian
>
>
>
> David Aguilar wrote:
>>
>> On Wed, Nov 14, 2012 at 11:47 AM, Florian Kainz <address@hidden> wrote:
>>>
>>> The ACES image container specification, meant to be compatible OpenEXR,
>>> prescribes UTF-8 for the representation of strings. Therefore I suggest
>>> that OpenEXR adopt the following rules:
>>>
>>> - All text strings are to be interpreted as Unicode, encoded as UTF-8.
>>> This includes attribute names and strings contained in attributes,
>>> for example, as channel names.
>>>
>>> - Text strings stored in files must be in Normalization Form C (NFC,
>>> canonical decomposition followed by canonical composition).
>>
>>
>> I would stay far away from dealing with normalization issues.
>>
>> Poke around on OS X and its broken HFS filesystem to see why:
>>
>> http://radsoft.net/rants/20080405,00.shtml
>>
>> If the library verified utf-8 that would be enough IMO.
>>
>> Imagine some poor sucker who goes and stores unicode filenames in a
>> header. It's not fun to have a library silently "fix" things for you.
>>
>> What's the upside of doing the normalization? How about just leave it
>> as-is? That way the code can stay simple. Whatever you put in can be
>> byte-for-byte identical to what you get out.
>>
>> Other then that, UTF-8 all the way as the "recommended" encoding.
>>
>>> - Where text strings need to be collated, strcmp() is used to compare
>>> the corresponding char sequences: string A comes before (or is less
>>> than) string B if
>>>
>>> strcmp(A,B) == -1
>>>
>>> (Note: this is not ambigous; the C99 standard specifies that strcmp()
>>> interprets the bytes that make up a string as unsigned.)
>>>
>>> - Text strings passed to the IlmImf library must be encoded as UTF-8
>>> and in Normalization Form C.
>>>
>>> As far as I can tell, these rules are entirely compatible with all
>>> existing versions of the IlmImf library. Users whose writing system
>>> includes non-ASCII Unicode characters can continue to employ the
>>> existing library versions without change.
>>>
>>> Future versions of the library should verify that text strings are
>>> valid UTF-8. In addition, the library should either verify that
>>> strings are normalized to NFC, or normalize to NFC on the fly.
>>
>>
>> If we treat them like raw bytes then we really don't care about the
>> encoding, do we? (that's why I said, "recommended")
>>
>> It would be nice if the thing stayed agnostic.
>>
>> Is there a reason why it needs to enforce the encoding,
>> or is a strong recommendation to use UTF-8 good enough?
--
David
- [Openexr-devel] UTF-8, Brendan Bolles, 2012/11/13
- [Openexr-devel] UTF-8, Hồ Châu, 2012/11/14
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/14
- Re: [Openexr-devel] UTF-8, David Aguilar, 2012/11/14
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/14
- Re: [Openexr-devel] UTF-8,
David Aguilar <=
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/15
- Re: [Openexr-devel] UTF-8, David Aguilar, 2012/11/15
- Re: [Openexr-devel] UTF-8, Jim Atkinson, 2012/11/15
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/15
- Re: [Openexr-devel] UTF-8, Jim Atkinson, 2012/11/15
- Re: [Openexr-devel] UTF-8, Florian Kainz, 2012/11/15
- Re: [Openexr-devel] UTF-8, Jim Atkinson, 2012/11/16
- Re: [Openexr-devel] UTF-8, Larry Gritz, 2012/11/16
- Re: [Openexr-devel] UTF-8, Britton, Andrew D, 2012/11/16
- Re: [Openexr-devel] UTF-8, Brendan Bolles, 2012/11/15