GRIB renaming in 4.3

What's in a name?
[Read More]

Unidata Users Committee GRIB recommendations for WMO consideration

The Unidata Users Committee has asked the WMO to consider establishing a web registry of GRIB and BUFR tables. Your comments on this idea would be helpful. And if you have any influence with the WMO, please use it![Read More]

DAP4 Commentary: DAP4 On-The-Wire Format

Background

The current DAP2 clients use two different approaches to managing the packet of data that is sent by the server.

The C++ libdap library uses what I will call an "eager" evaluation method. By this I mean that the whole packet is processed when received, is decomposed into its constituent parts (e.g. data arrays, sequence records, etc) and those parts are used to annotate the parsed DDS.

[Read More]

netCDF Identifiers and Character Escape Mechanisms (sigh!)

netCDF Identifiers and Character Escape Mechanisms (sigh!)

Ideally, netCDF should allow any printable UTF-8 character to be used in an identifier. Currently, that is almost the case, with forward slash being the exception because of the syntax of HDF5 identifiers.

More and more, the netCDF API is being used as wrapper for a wide variety of other formats: HD5, HDF4, GRIB, BUFR, DAP2, DAP4, etc. During the process of defining translations to/from netCDF and these other format, it is necessary to implicitly or explicitly define netCDF identifiers from the schemas of these other formats.

The canonical example is HDF5. In HDF5, many API functions take a path, which is a sequence of identifiers separated by '/'. A path may be absolute ("/g1/g2/x") or relative ("y"). It appears to be the case that there is no way in HDF5 to specify an identifier containing '/', such cases are always interpreted as paths. So, if one naively defined, thru the netcdf-4 API, a variable named "/x/y", there is no apparent way to actually get this defined properly in HDF5. It is this fact that has led to the current, IMO undesirable, restriction that netCDF identifiers may not contain '/'.

Super Escapes

This situation is going to recur as the netcdf API is used to wrap other data formats. What we will need is a mechanism by which we can convert an identifer containing arbitrary UTF-8 characters into another identifier in some rather restricted set of legal identifier characters. In addition, I would impose the rule that the conversion is invertible.

This kind of "super-escaping" is very hard because in the worst case, we are likely to encounter the situation where legal identifier characters are restricted to something like the alphanumerics plus underscore.

DAP4 Commentary: DDX Lexical Elements

This document describes the lexical elements that occur in the DAP4 grammar.

Within the Relax-NG (rng) DAP4 grammar, there are markers for occurrences of primitive type such as integers, floats, or strings. The markers typically look like this when defining an attribute that can occur in the DAP4 DDX.

<attribute name="namespace"><data type="string"/></attribute>
The "<data type="string"/>" specifies the lexical class for the values that this attribute can have. In this case, the namespace attribute is defined to have a String value. Similar notation is used for values occurring as text within an xml element. The lexical specification later in this document defines the legal lexical structure for such lexical items. Specifically, it defines the format of the following lexical items.
  1. Constants, namely: string, float, integer, and character.
  2. Identifiers

The specification is written using the ISO/IEC 9945-2:2003 Information technology -- Portable Operating System Interface (POSIX) -- Part 2: System Interfaces. This is the extended Posix regular expression specification.

I have augmented it in the following ways.

  1. Names are assigned to regular expressions using the notation
    name = regular-expression

  2. Named expressions can be used in subsequent regular expressions by using the notation {name}. Such occurrences are equivalent to textually substituting the expression associated with name for the {name} occurrence: More or less like a macro.

DAP4 Lexical elements

Notes:
  1. The definition of {UTF8} is deferred to the next section.

  2. Comments are indicated using the "//" notation.

  3. Standard xml escape formats (&xDD) are assumed to be allowed anywhere.

Basic character set definitions

CONTROLS   = [x00-x1F] // ASCII control characters
WHITESPACE = [ f]+
HEXCHAR = [0-9a-zA-Z]
// ASCII printable characters
ASCII = [0-9a-zA-Z !"#$%&'()*+,-./:;<=>?@[]^_`|{}~]

Ascii characters that may appear unescaped in Identifiers
This is assumed to be basically all ASCII printable characters except the characters ' ', '.', '/', '"', ''', and '&'. Occurrences of these characters are assumed to be representable using the standard xml '&xx;' notation.

IDASCII    = [0-9a-zA-Z!#$%'()*+,-:;<=>?@[]^_`|{}~]

The numeric classes: integer and float

INTEGER    = {INT}|{UINT}|{HEXINT}
INT = [+-][0-9]+{INTTYPE}?
UINT = [0-9]+{INTTYPE}?
HEXINT = {HEXSTRING}{INTTYPE}?
INTTYPE = ([BbSsLl]|"ll"|"LL")
HEXSTRING = (0[xX]{HEXCHAR}+)

FLOAT = ({MANTISSA}{EXPONENT}?)|{NANINF}
EXPONENT = ([eE][+-]?[0-9]+)
MANTISSA = [+-]?[0-9]*.[0-9]*
NANINF = (-?inf|nan|NaN)

The Character classes

STRING     = ([^"&]|{XMLESCAPE})*
CHARACTER = ([^'&]|{XMLESCAPE})

Note that the character type only supports ASCII characters because it can only hold a single 8-bit byte.

The Identifier class

ID         = {IDCHAR}+
IDCHAR = ({IDASCII}|{XMLESCAPE}|{UTF8})
XMLESCAPE = &x{HEXCHAR}{HEXCHAR};

Note that the above lexical element classes are not disjoint. For example, the sequence of characters 1234 can be either an identifer,a float, or an integer. So the order of testing is assumed to be this.

  1. INTEGER
  2. FLOAT
  3. ID
  4. STRING

UTF-8 Character Encodings

We discuss UTF-8 character encoding in the context of this document. https://www.w3.org/2005/03/23-lex-U.

The most correct (validating) version of UTF8 character set is as follows.

UTF8 =   ([xC2-xDF][x80-xBF])     
| (xE0[xA0-xBF][x80-xBF])
| ([xE1-xEC][x80-xBF][x80-xBF])
| (xED[x80-x9F][x80-xBF])
| ([xEE-xEF][x80-xBF][x80-xBF])
| (xF0[x90-xBF][x80-xBF][x80-xBF])
| ([xF1-xF3][x80-xBF][x80-xBF][x80-xBF])
| (xF4[x80-x8F][x80-xBF][x80-xBF])
The lines of the expression cover the UTF8 characters as follows:
  1. non-overlong 2-byte
  2. excluding overlongs
  3. straight 3-byte
  4. excluding surrogates
  5. straight 3-byte
  6. planes 1-3
  7. planes 4-15
  8. plane 16

Note that ASCII and control characters are not included.

The above reference also defines some alternative regular expressions.

The most relaxed version of UTF8 is this.

UTF8 = ([xC0-xD6].)
|([xE0-xEF]..)
|([xF0-xF7]...)

The partially relaxed version of UTF8 is this.

UTF8    = ([xC0-xD6][x80-xBF])        
| ([xE0-xEF][x80-xBF][x80-xBF])
| ([xF0-xF7][x80-xBF][x80-xBF][x80-xBF])

We deem it acceptable to use this last relaxed expression for validating UTF-8 character strings.

Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« July 2025
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today