[CF-metadata] Pre-proposal for "charset"
chris.barker at noaa.gov
Wed Feb 22 12:06:03 MST 2017
On Wed, Feb 22, 2017 at 10:38 AM, Bob Simons - NOAA Federal <
bob.simons at noaa.gov> wrote:
> As for needing a different subject for the email: I'm lumping together 2
> new related attribute names: "charset=..." and "data_type=string|char" so
> that the information stored in char variables in netcdf-3 files can be
> easily and unambiguously interpreted.
somehow it got smashed in with the thread about geometries.. maybe that was
my email client. But anyway, away we go!
> You are correct. My proposal is for netcdf-3 files since they only support
> chars, not true strings.
so maybe make it clear that for netcdf4, one should use strings? I'm not
sure if there is anything in CF now that is 3 vs 4 specific...
> As for "encoding" vs "charset", I'm open to different names. I chose
> "charset" because that is the name used in HTML and is widely used in other
> places. Yes, XML uses "encoding". To me, the word "charset" seems
> preferable because it is more specific than "encoding" (which also has a
> more general purpose meaning).
not a biggie -- +0 for encoding from me.
> As for full Unicode support via UTF-8 vs UTF-16:
well, UTF-16 is the worst option -- let's never use that! UCS-4 is the way
to go if you want full unicode support and constant bytes per charactor.
though "wastes" space.
> Since netcdf-3 only supports 8bit chars, the 16bit UTF-16 is not an option.
well, sure, but at the binary level a CHAR is simply an unsigned 8-bit
integer -- so you could stuff any encoding into an array of CHAR.
But UTF-8 is the only way I know of to support full Unicode using only
> 8bit chars for the underlying storage.
see above, but:
> It is very widely used. Every modern piece of software that can read or
> write text files supports it. It is the default for both XML and HTML 5.
yeah, it really is the best compromise -- and becoming the universal form
for data interchange.
> If the file writer doesn't need full Unicode, they can use "ISO-8859-1"
> (which is compatible with 7bit ASCII)
I'd vote for ASCII and ISO-8859-1 as the only options (Or the HIGHLY
RECOMMENDED options, at least).
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the CF-metadata