Quincey Koziol wrote:
Hi John,
Hi Quincey, some thoughts on your proposal:
1. A few notes on naming differences between the netCDF and HDF5 data model:
A netCDF *Variable* is a multidimensional array of primitive
values, roughly corresponding to a HDF5 *Dataset.*
Yup.
A netCDF *Dimension *is a named array index. They are globally
scoped, so can be shared. A Variable specifies its dimensionality by
referencing a set of Dimensions, this set corresponds to an HDF5
*Dataspace. *There is no exact equivilence to a Dimension as i
understand it. The fact that Variables can share Dimensions adds an
important meaning to netCDF files.
This document introduces dimensions as an optional method of composing
a dataspace in HDF5, so they ought to be completely analogous to netCDF
dimensions.
sorry, i didnt realize you were defining dimensions seperately from
dimension scales. thats very good, from my POV.
One possible difference is that I wasn't planning on naming the dimensions
within a dataspace. They were just going to be indexed by their rank within
the dataspace (i.e. the 0th dimension, the 1st dimension, etc). This could
reference a named dimensions through an indirect dimension (see the shareability
document), but the actual dimensions in the dataspace weren't planned on having
names associated with them.
only shared dimensions need be named.
Do you think this is an important requirement? Does the netCDF API
require that the dimensions in a dataspace for a dataset have names, or
will having shared dimensions using the names of dimension objects in the
grouping hierarchy be sufficient?
netcdf only has shared dimesnions, so they are always named.
A netCDF *Coordinate Variable* is a 1D Variable whose name matches
its dimension's name, and whose values are monotonic. This corresponds
to your proposed *Dimension Scale*. Note that a netCDF Dimension
describes array indices, whereas a Coordinate Variable / Dimension Scale
describe coordinates values assigned to each index of the corresponding
Dimension.
Yes, I designed the new HDF5 Dimension Scale model to be compatible
with netCDF Coordinate Variables (ideally, Dimension Scales will be a superset
of Coordinate Variables). I'm still not totally pleased with the term "scale"
and somewhat lean toward using netCDF's "coordinates" term since that more
accurately describes their true meaning, but since HDF4 used "scale", I may end
up sticking with the term... :-/
2. So, generally I like your Dimension Scale proposal. The main things
we need are 1) shared Dimensions even when theres not a coordinate
variable (perhaps a Dimension Scale without the values?),
Actually, the HDF5 Dimensions will be able to be shared by different
dataspaces without involving any Dimension Scales.
good
2) each Dimension Scale must have a name;
Yes, that's the primary method of indexing them from a dimension. I
imagine we may have an API function to get the n'th scale, but that's not
a requirement at this point.
good
and 3) a Variable/Dataset can specify
its dimensionality/Dataspace by listing the Dimensions (or their names).
I'm planning on adding API functions for "composing" a dataspace from
dimensions and then that "composed" dataspace could be used to create datasets.
good
3. While 1D Coordinate Variables / Dimension Scales are the common case,
there are also datasets that need different kinds of coordinate systems,
including multidimensional coordinate variables. I am eager that netCDF
/ HDF5 can support these, but I think they can be built on top of the
current functionality, and so we can leave them out of this discussion
so as to keep things from getting too complicated. (for more details on
those ideas, see chapter 3.1 of the java-netcdf user manual).
As I mentioned to Russ and Ed last week, I think that having support for
coordinate systems (I was calling them "multi-dimensional scales" at the time)
is an important feature to include. I've printed the java-netcdf user
manual and will be using it for reference during further iterations on the
HDF5 dimension scale design to try to include this concept. I imagine that I'll
associate them with the dataspace directly instead of hanging them off the
dimensions (since the dataspace can be multi-dimensional and the dimensions are
1-D by definition).
Also, I was considering cutting the ability of dimensions to have multiple
scales associated with them (to simplify things), but glancing through the
java-netcdf information, it looks like that may be an important feature.
What's your opinion about how critical that is and how often it is used?
Quincey
i think there are 2 interesting examples if you try to handle coordinate
systems in a general way:
1. float lat(x,y) and float lon(x,y) assign latitude and longitude
coordinates to points on a projection plane. this is the
"multidimenensional case"
2. lat(sample), lon(sample), altitude(sample) might be a coordinate
system for variable O3(sample). this is the "1D trajectory" case.
So, what i came up with is that a coordinate system for a
variable/dataset is a collection of "coordinate axes" which can have any
dimensionality, but whose dimensions must all appear in the set of
dimensions used by the variable/dataset. Adding this info to the
dataspace is exactly right.
Because the common case is that all or most of the variables/datasets in
a file use the same coordinate system, its nice to factor this
information out. So if the dataspace can be shared and the coordinate
system can be associated with the dataspace, that would be party time
most excellent.
BTW, a mathematical formulation behind this (a little out of date but
useful if you like formalisms) is at
http://www.unidata.ucar.edu/staff/caron/papers/CoordMath.htm
theres still one piece that you *might* want to tackle. the above is a
framework for general coordinate systems. our users generally want
georeferencing coordinate systems. this involves identifying which of
the coordinate axes correspond to the x,y,z, and t coordinates. this can
be a big can of worms, eg is youve ever looked at GIS specs, they are
complex. We have developed a set of very simple specs that so far have
satisfied most of our datasets, using "attribute conventions" outside
any explicit library support. I can understand if you dont want to add
any more complications. However I will say that IMHO getting
georeferencing coodinate systems clearly specified (ie not having to use
attribute Conventions) would be a huge win for our communities, and one
thats really doable.