[CF-metadata] Pre-proposal for "charset"

Chris Barker chris.barker at noaa.gov
Wed Feb 22 11:03:48 MST 2017

NOTE: this looks like it got tacked on to another thread -- please start a
new thread for a new topic. (or gamil messed up...)

Sorry for being dense here, but I'm confused. I see in the netCDF(4) spec:

The atomic external types supported by the netCDF interface are:
NC_CHAR 8-bit character byte
NC_STRING variable length character string *

So shouldn't one use a 2-D (or higher dim) array of NC_CHAR type if that's
indeed what you have?

Or is this about supporting netcdf3, which doesn't (I don't think) have a
string type?

It does have a BYTE type, which I would be inclined to use for a CHAR. But
then I suppose you'd need to tell readers that it was intended to be a

Other notes:

Do folks want/need to support full Unicode characters? If so I think you'd
need a 4 byte type -- cal it NC_UCHAR? -- and anything else would be
variable-length, which would kind of kill the whole point of a character

Small note: I'd prefer "encoding" to "charset" -- at least if you want to
support "full" unicode, rather than only one-byte-per-char encodings.

> > The only charsets which are recommended are "ISO-8859-1" and "UTF-8".
UTF-8 is problematic because it uses a variable number of bytes per
character (codepoint?).

If we want to support proper Unicode, then we need to either:

use a variable-length string type (the netcdf 4 NC_STRING type?)


Use 4 bytes per char.

Since UTF-* is a superset of ascii, it can be dangerous -- folks can say
"this is UTF-*", and if they only happen to use the ASCII subset, al works
fine, and then someone goes and tries to put a weird high-codepoint
character in there, and all goes to heck.

I see that netcdf4 supports UTF-8 for names within the file (variable
names, dimension names, etc), but that works because the number of bytes is
known and constant once created.

Again, I'm maybe speaking from ignorance, I haven't dug into Unicode and CF
And netcdf in any depth at all.

> > --- An Example: Encoding three Strings: "It", "Book", and "5 €".

> > > The Unicode code point for the Euro symbol is 20AC (in hexadecimal),
>> > > which is 8364 (in decimal).
>> > > The Euro symbol is encoded in UTF-8 as 3 bytes: E2 82 AC (in
>> hexadecimal).
>> > > So a file would store these strings in a char array as:
>> > >   dimensions
>> > >     words = 3;
>> > >     strLen = 5;
>> > >   char myWords[words][strLen] = "It[0][0][0]", "Book[0]", "5
>> [E2][82][AC]";
>> > >     charset = "UTF-8";
this is tough -- how do you know what strLen should be? You could get UTF-8
characters chopped off if it was too short.

Though I suppose that's a problem for the file writer to figure out.



