================ Here's the ncsa reply to the Unidata reply ========
>To: Russ Rew <russ@xxxxxxxxxxxxxxxx>
>From: mfolk@xxxxxxxxxxxxx
>Subject: Re: Comments on netCDF/HDF Draft Design
>Cc: netcdf-hdf@xxxxxxxxxxxxxxxx
>Bcc:
>X-Attachments:
>
>netcdf-hdf group:
>
>This is a response to the response from the
>Unidata team to the netCDF/HDF Design Document.
>The original response was posted to
>netcdf-hdf@xxxxxxxxxxxxxxxx on April 21.
>
>Mike and Chris
>
>=============================================================
>
>Russ et al:
>
>Thanks for your response to the "netCDF/HDF Design Document." Now that
>we have to really get the project going, things aren't nearly so simple,
>and this kind of feedback is extremely useful.
>
>We have gone over your response, and I've put together some
>responses and clarifications, which follow.
>
>Mike & Chris
>
>>Mike,
>>
>...
>>
>>First, it is not clear from the draft what you intend with regard to data
>>stored in the current netCDF format. More specifically, will it be possible
>>to use tools written to the planned netCDF/HDF interface on archives of
>>current netCDF files? Or will such archives have to be converted to HDF
>>first? We would naturally prefer that the implementation be able to
>>recognize whether it is dealing with HDF or netCDF files. Symmetrically,
>>you should expect that if a program uses the netCDF/HDF interface
>>implementation to create a file, then our library should be able to deal
>>with it, (though we currently don't have the resources to be able to commit
>>to this). In fact this goal could be stated more strongly:
>>
>> Data created by a program that uses the netCDF interface should be
>> accessible to other programs that use the netCDF interface without
>> relinking.
>>
>>This principle covers portability of data across different platforms, but
>>also implies that a consolidated library must handle both netCDF/HDF and
>>netCDF file formats and must maintain backward compatibility for archives
>>stored using previous versions of the formats. This level of
>>interoperability may turn out to be impractical, but we feel it is a
>>desirable goal.
>
>We agree that it is desirable that users not have to think about (or even
>know) how their data is organized. The difficulties involved in
>maintaining two or more storage formats are ones we already have to
>deal with just within HDF. There are instances where we've
>developed newer better ways of organizing a particular object.
>It isn't fun, but so far it's been managable.
>
>What worries me about this policy over the long term is the
>cumulative work involved as new platforms get introduced, and
>new versions of operating systems and programming languages
>are introduced. As these sorts of things happen, we would
>like to not be committed to supporting all old "outdated"
>formats.
>
>Initially we definitely will support both old and new netCDF formats.
>We just don't want to guarantee that we will carry it over to new
>platforms and machines.
>
>There is another issue that has to do with supporting "old"
>things. Based on feedback we're getting from loyal HDF users, we'll
>probably want to extend that idea to data models, too. For example,
>some heavy users would rather stick with the predefined SDS model
>than the more general netCDF model. In a sense, that's no problem
>since netCDF provides a superset of SDS. We might define SDS as a
>standard netCDF data abstraction for a certain range of applications.
>The same has been suggested of raster images.
>
>Still, this kind of thing could be very confusing to users trying to
>decide whetherto use one or the other interface. In addition we would
>want all software to know that it could treat something stored as an
>SDS they same way they tread and equivalent netCDF.
>
>I suspect you people have already faced this problem with differently
>defined netCDFs. My guess would be that the problem is managable if
>the number of different abstractions is small. I'd be interested in
>your observations.
>
>> It seems to be implied by the sentence on page 4:
>>
>> A hybrid implementation will also give HDF users the power of the netCDF
>> tools while at the same time making the HDF tools available to netCDF users.
>>
>>Note that one possible way to achieve this goal is to recognize the file
>>type when a file is opened, and use the current implementations of the HDF
>>and netCDF libraries as appropriate. A new flag used when creating a file
>>could specify which type of file representation was desired.
>
>Yes, this would be a way to do it. I would like to encourage one
>format only, however, because in the long run it would make for
>greater interoperability among programs.
>
>>
>>This use of two different representations for data accessed by the same
>>interface can be justified if each representation has clear benefits;
>>otherwise, we should agree on using the single superior representation and
>>relegating the other to read-only support as long as useful archives in that
>>form exist. If a representation based on VSets is superior to the current
>>netCDF representation in some ways and inferior in other significant ways,
>>then the use of both representations is likely to continue. For example, it
>>may be possible to support variable-length records with the VSet
>>implementation at the cost of slower hyperslab access. In such a case,
>>users would benefit most if the alternative tradeoffs captured in the two
>>different representations were available from a single library at file
>>creation time.
>
>Good example. I think there will be times when the current netCDF
>format is definitely superior. For example, suppose I have three variables
>with the unlimited dimension they are stored in an interleaved fashion.
>If I access a hyperslab of "records", taking the same slab from all
>three variables, I might be able to avoid the three seeks I would have
>to make using the Vset approach (as currently designed--could change).
>
>Another option would be to implement the netCDF physical format as an
>option within HDF, so that the basic file format would still be HDF
>but the physical storage would follow the old netCDF scheme. (This
>is a little tricky for the example I've given, and may be really dumb.)
>We already have the option of different physical storage schemes for
>individual objects (contiguous, linked blocks, and external), so the
>concept is there, sort of.
>
>>Although it may be too early to determine the advantages or
>>disadvantages of one representation over the other, perhaps it needs to be
>>made more clear how the benefits of the VSet-based implementation compare
>>with the implementation costs and the potential space and performance
>>penalties discussed in section 3.
>
>Good idea. We will try to expand that section. Meantime, it would
>help us if you could share with use anything you've written on why
>you chose the format you did. We have tried to determine the strengths
>and weaknesses of the current format, but you have certainly thought
>about it more than we have.
>
>>
>>We could not determine from the draft whether this project includes
>>resources for rewriting existing HDF tools to use the netCDF/HDF
>>interface.
>
>That isn't covered in the draft, but in the NSF proposal we say we'll do
>that during the second year of the project. With the EOS decision and
>possible extra funding, we may do it sooner. It depends a lot
>on what EOS decides should be given priority.
>
>We've already had meetings with our tool developers and others about
>doing this, and it seems pretty straightforward, especially if we
>ignore attributes that NCSA tools don't yet know about.
>By the way, Ben Domenico mentioned some time ago that he might
>assign somebody the task of adapting X-DataSlice to read netCDF.
>Did that ever happen?
>
>>If so, will these tools also use other HDF interfaces or low-level HDF
>>calls? If so, they may not be very useful to the netCDF user community.
>
>Good point. We now have a situation in which any of a
>number of different types of data can be usefully read by the
>same tool. 8-bit raster, 24-bit raster, 32-bit float, 16-bit
>integer, etc., all can be thought of as "images." How we sort
>this out, or let the users sort it out, is going to be tricky.
>
>>This is a question of completeness of the interface. If the netCDF/HDF
>>interface is still missing some functionality needed by the tools and
>>requiring the use of other HDF interfaces, perhaps it would be better to
>>augment the netCDF/HDF interface to make it completely adequate for such
>>tools.
>
>This is an issue that we now need to really tackle. It highlights
>the fact that HDF has a number of interfaces (and correspondingly a
>number of data models, I guess), whereas netCDF presents a single
>data model (I guess). There are pros and cons to each approach,
>which we probably should explicate at some point. Pros and cons
>aside, netCDF seems to cover a goodly portion of what
>the other HDF interfaces cover. The SDS interface obviously
>fits well into netCDF. The raster image interface can
>be described in terms of netCDF (8-bit for sure, 24-bit
>less well), though it seems to work so well with its current organization
>that we'll have to think hard about whether to convert it to netCDF.
>Palettes, same. Annotations, maybe not as well, especially when we
>support appending to annotations and multiple annotations per
>object.
>
>What's left is Vsets, which we put in to support unstructured
>grids, as well as providing a general grouping structure. Vsets
>have become very popular, and seem to fill a number of needs. I
>think the SILO extensions to netCDF may actually give us a nice
>"extended" netCDF that will cover many of the high level applications
>of Vsets.
>
>We never did think of Vsets as being a high level interface,
>but rather as a collection of routines that would
>facilitate building complex organizations for certain applications,
>such as graphics and finite element applications. SILO appears
>to give us that higher level extension.
>
>
>>
>>Here are some more specific comments on the draft design document, in order
>>of appearance in the draft document:
>>
>>On page 1, paragraph 1, you state:
>>
>> [netCDF] has a number of limiting factors. Foremost among them are
>> problems of speed, extensibility and supporting code.
>>
>>If the netCDF model permitted more extensibility by allowing users to define
>>their own basic data types, for example, it might be impractical to write
>>fully general netCDF programs like the netCDF operators we have specified.
>>There is a tradeoff between extensibility and generality of programs that
>>may be written to a particular data model. The ultimate extensibility is to
>>permit users to write any type of data to a file, e.g. fwrite(), but then
>>no useful high-level tools can be written that exploit the data model; it
>>becomes equivalent to a low-level data-access interface. The lack of
>>extensibility may thus be viewed as a carefully chosen tradeoff rather than
>>a correctable disadvantage.
>
>Good point. Highlights the fact that HDF concentrated in its early
>days on providing a format that would support a variety of data models,
>whereas CDF went for a single, more general model, takinng the position
>that the file format was not nearly as important. Also highlights
>the fact that, for the time being at least, we feel there is enough
>value in the multiple-model/extensibility aspects of HDF that we
>want to keep them. netCDF would be one of several data models
>supported in HDF, at least initially.
>
>>
>>On page 2, paragraph 2:
>>
>> The Unidata implementation only allows for a single unlimited dimension
>> per data set. Expectations are that the HDF implementation will not have
>> such a limitation.
>>
>>We are somewhat skeptical about the practicality of supporting both multiple
>>unlimited dimensions and efficient direct-access to hyperslabs. Consider a
>>single two-dimensional array with both dimensions unlimited. Imagine
>>starting with a 2 by 2 array, then adding a third column (making it 2 by 3),
>>then adding a third row, (making it 3 by 3), then adding a fourth column
>>(making it 3 by 4), and so on, until you have an N by N array. Keeping the
>>data contiguous is impractical, because it would require about 2*N copying
>>operations, resulting in an unacceptably slow O(N**3) access algorithm for
>>O(N**2) data elements. The alternative of keeping each incremental row and
>>column in its own VData would mean that accessing either the first row or
>>the first column, for example, would require O(N) reads, and there would be
>>no easy way of reading all the elements in the array by row or by column
>>that did not require multiple reads for many of the data blocks. With the
>>current implementation, each row requires only 1 read and all the elements
>>in the array may be read efficiently from the N row records.
>
>Yes, this was less clear in the paper than it should have been. For
>exactly the reasons you have outlined above, the restriction that
>any variable could only have a single unlimited dimension would have
>to remain. However, it should be possible to have a variable X
>dependent on unlimited dimension 'time' and a variable Y dependent on
>unlimited dimension 'foo' in the same file.
>
>>
>>
>>Most netCDF programs we have seen use direct access to hyperslabs, and we
>>think maintaining efficient direct access to hyperslabs of multidimensional
>>data should be an important goal. If you can eliminate the current netCDF
>>restriction of only a single unlimited dimension while preserving efficient
>>hyperslab access, we would be very impressed.
>
>So would we :-).
>
>>
>>Page 2, paragraph 5:
>>
>> One of the primary drawbacks of the existing Unidata implementation is
>> that it is based on XDR.
>>
>>This is another case where a particular tradeoff can be viewed as a drawback
>>or a feature, depending on the requirements. Use of a single specific
>>external data format is an advantage when maintaining the code, comparing
>>files written on different platforms, or supporting a large number of
>>platforms. Use of native format and converters, as in HDF, means that the
>>addition of a new platform requires writing conversions to all other
>>existing representations, whereas netCDF requires only conversion to and
>>from XDR. The performance of netCDF in some common applications relates
>>more to the stdio layer below XDR than to XDR: the buffering scheme of stdio
>>is not optimal for styles of access used by netCDF. We have evidence that
>>this can be fixed without abandoning XDR or the advantages of a single
>>external representation.
>
>Just one clarification here: HDF offers native mode only on the
>condition that there will be no conversion. Some day we might
>offer conversions from and to all representations, but not now. We've
>only gotten a little flack about that.
>>
>...
>
>>Page 4, paragraph 6:
>>
>> For instance, it will then be possible to associate a 24-bit raster image
>> with a [netCDF] variable.
>>
>>We're not sure how it would be possible to access such data using the
>>existing netCDF interface. For example, if you used ncvarget(), would you
>>have to provide the address of a structure for the data to be placed in? If
>>other new types are added, how can generic programs handle the data? What
>>is returned by ncvarinq() for the type of such data? Do you intend that
>>attributes can have new types like "24-bit raster image" also? As for
>>storing 24-bit data efficiently, we have circulated a proposal for packed
>>netCDF data using three new reserved attributes that would support this.
>>
>
>Yeah. Good questions. We haven't tackled them yet.
>
>
>...
>
>>Page 8, paragraph 1:
>>
>> The current VGroup access routines would require a linear search through
>> the contents of a VGroup when performing lookup functions. ... Because a
>> variable's VGroup may contain other elements (dimensions, attributes, etc.
>> ...) it is not sufficient to go to the Xth child of the VGroup when
>> looking for the Xth record.
>>
>>As stated above, we think it is very important to preserve direct access to
>>netCDF data, and to keep hyperslab access efficient.
>>
>
>For the time being, we have decided to place all of a record
>variable's data into a single VData. In doing so, we have retained
>fast hyperslab access (in fact it is even faster because all of a
>variable's data is contiguous). As a side note, VDatas are able to
>efficiently store and retrieve 8-bit data.
>
>It is not yet clear whether people will require the flexibility of
>storing data in separate objects. If it does seem that users wish to
>be able to store data distributedly, we will add that capability
>later. Rather than using a 'threshold' as outlined in the draft you
>received, we are now leaning towards providing a reserved attribute
>that the user can set to indicate whether they require all of the data
>to be in a single VData or in multiple ones.
>
>The problem with representing this information at the level of an
>attribute is how to differentiate between "user" and "system"
>attributes. For instance, if someone writes out some data, goes
>into redef() mode and changes the "contiguousness" / packing /
>fill-values and tries to write more data things are going to be all
>messed up.
>
>Are there plans to logically separate the two types of attributes
>(i.e. define_sys_attr() and define_user_attr())? Or is the distinction
>just based on syntactic convention (i.e. names with leading
>underscores...)? What happens when the user wants a mutable attribute
>whose name has a leading underscore?
>
>>Page 8, paragraph 6:
>>
>> Furthermore, Unidata is in the process of adding operators to netCDF,
>> which may be lost by adopting SILO as a front-end.
>>
>>The netCDF operators do not currently involve any extensions to the netCDF
>>library; they are written entirely on top of the current library interface.
>>It is possible that we will want to add an additional library function later
>>to provide more efficient support for some of the netCDF operators (e.g.
>>ncvarcpy() which would copy a variable from one netCDF file to another
>>without going through the XDR layer). We agree with your decision to use
>>the Unidata netCDF library rather than SILO as the "front-end".
>
>Because SILO was developed at Lawrence Livermore, it will be
>impossible to use the existing SILO code in any public domain
>software. We are currently investigating whether we will even be able
>to use the *ideas* developed within the Lab in the public domain.
>
>We plan to release a description of the SILO interface over the netCDF
>/ HDF mailing list in the near future to see if anyone has different
>suggestions about how to model mesh data within the context of netCDF.
>
>
>
>>
>>We have set up a mailing list here for Unidata staff who are interested in
>>the netCDF/HDF project: netcdf-hdf@xxxxxxxxxxxxxxxx. Feel free to send
>>additional responses or draft documents to that address or to individual
>>Unidata staff members.
>>
>>----
>>Russ Rew russ@xxxxxxxxxxxxxxxx
>>Unidata Program Center
>>University Corporation for Atmospheric Research
>>P.O. Box 3000
>>Boulder, Colorado 80307-3000
>
>