Unidata Developer's Blog

« Previous page | Main | Next page »

Developments in NetCDF C Library For 4.1.2 Release

24 September 2010

There have been many performance improvements in the upcoming netCDF-4.1.2 release.

One improvement is a complete refactor of all netCDF-4 memory structures. Now the metadata of a netCDF file occupies the smallest possible amount of memory. I have added many more Valgrind tests, and the HDF5 team has worked hard to track down memory issues in HDF5. (Most were not really bugs, but just doing things that Valrgrid doesn't like.)

It's particularly important on high performance platforms that memory used be minimized. If you run a program with 10,000 processors, and each of them uses too much memory for the metadata, that adds up to a lot of wasted memory. And in HPC they have better uses for their memory.

The biggest improvement in performance came from a rewrite of the way that netCDF-4 reads the HDF5 file. The code has been rewritten in terms of the H5LIterate() function, and this has resulted in a huge performance gain. Here's an email from Russ quantifying this gain:

From: Russ Rew <russ-AT-unidata.ucar-DOT-edu>
Subject: timings of nc_open speedup
To: ed-AT-unidata.ucar-DOT-edu
Date: Thu, 23 Sep 2010 15:23:12 -0600
Organization: UCAR Unidata Program
Reply-to: russ-AT-unidata.ucar-DOT-edu

Ed,

On Jennifer Adam's file, here's the before and after timings on buddy (on the file and a separate copy, to defeat caching):

real 0m32.60s
user 0m0.15s
sys   0m0.46s

real 0m0.14s
user 0m0.01s
sys   0m0.02s

which is a 233x speedup.

Here's before and after for test files I created that have twice as many levels as Jennifer Adam's and much better compression:

real 0m23.78s
user 0m0.24s
sys   0m0.60s

real 0m0.05s
user 0m0.01s
sys   0m0.01s

which is a 475x speedup. By using even more levels, the speedup becomes arbitrarily large, because now nc_open takes a fixed amount of time that depends on the amount of metadata, not the amount of data.

--Russ

As Russ notes, this is a speedup that can be defined as arbitrarily large, if we tailor the input file correctly. But Jennifer's file is a real one, and at18.4 giga-bytes (name: T159_1978110112.nc4) this file is a real disk-buster. Yet it has a simple metadata structure. At a > 200 times speedup is nice. We had been talking about a new file open mode which would not open the file and read the metadata, all because it was taking so long. I guess I don't have to code that up now, so that's a least a couple of weeks work saved by this fix! (Not to mention that now netCDF-4 will work much better for these really big files, which are becoming more and more common.)

Here's the ncdump -h of this lovely test file:

netcdf T159_1978110112 {
dimensions:
        lon = 320 ;
        lat = 160 ;
        lev = 11 ;
        time = 1581 ;
variables:
        double lon(lon) ;
                lon:units = "degrees_east" ;
                lon:long_name = "Longitude" ;
        double lat(lat) ;
                lat:units = "degrees_north" ;
                lat:long_name = "Latitude" ;
        double lev(lev) ;
                lev:units = "millibar" ;
                lev:long_name = "Level" ;
        double time(time) ;
                time:long_name = "Time" ;
                time:units = "minutes since 1978-11-01 12:00" ;
        float temp(time, lev, lat, lon) ;
                temp:missing_value = -9.99e+08f ;
                temp:longname = "Temperature [K]" ;
                temp:units = "K" ;
        float geop(time, lev, lat, lon) ;
                geop:missing_value = -9.99e+08f ;
                geop:longname = "Geopotential [m^2/s^2]" ;
                geop:units = "m^2/s^2" ;
        float relh(time, lev, lat, lon) ;
                relh:missing_value = -9.99e+08f ;
                relh:longname = "Relative Humidity [%]" ;
                relh:units = "%" ;
        float vor(time, lev, lat, lon) ;
                vor:missing_value = -9.99e+08f ;
                vor:longname = "Vorticity [s^-1]" ;
                vor:units = "s^-1" ;
        float div(time, lev, lat, lon) ;
                div:missing_value = -9.99e+08f ;
                div:longname = "Divergence [s^-1]" ;
                div:units = "s^-1" ;
        float uwnd(time, lev, lat, lon) ;
                uwnd:missing_value = -9.99e+08f ;
                uwnd:longname = "U-wind [m/s]" ;
                uwnd:units = "m/s" ;
        float vwnd(time, lev, lat, lon) ;
                vwnd:missing_value = -9.99e+08f ;
                vwnd:longname = "V-wind [m/s]" ;
                vwnd:units = "m/s" ;
    float sfp(time, lat, lon) ;
                sfp:missing_value = -9.99e+08f ;
                sfp:longname = "Surface Pressure [Pa]" ;
                sfp:units = "Pa" ;

// global attributes:
                :NCO = "4.0.2" ;
}

Special thanks to Jennifer Adams, from the GrADS project. Not only did she provide this great test file, but she also built my branch distribution and tested the fix for me! Thanks Jennifer! Thanks also to Quincey of HDF5 for helping me sort out the best way to read a HDF5 file.

Now I just have to make sure that parallel I/O is working OK, and then 4.1.2 will be ready for release!

Posted by $entry.creator.screenName

Email this

NetCDF-4.1 Released!

10 February 2010

Now we wait for any bugs to be reported...

It is always a tremendous amount of effort to make a netCDF release. There are so many people depending on it, and I really don't want to mess up up.

The 4.1 release, which I did last Friday, was my sixth netCDF release. This release has a lot of new features - more than usual, since we have had more than the usual number of programmers working on netCDF. It hasn't been just me and (part-time) Russ, as in previous releases. This is the first release that contains work from Dennis, and what a large piece of work it does contain: the new opendap client.

In addition, the 4.1 release contains a new utility, nccopy, which copies netCDF files, changing format on the way if desired. (Russ did nccopy). There is also a new ncgen, finally up to speed on all netCDF-4 extensions, which Dennis did.

As for me, I put in a layer that can read (many but not all) HDF4 files, a layer that can use parallel-netcdf for parallel I/O to classic/64-bit offset files, a modification to allow most existing HDF5 data files to be read by netCDF. Finally, and at the last minute, I changed the default settings of caching and chunking so that the users producing the giant AR-5 datasets would get good performance. Also, I added the nc-config utility to help users come up with correct compiler flags when building netCDF programs, and two new libraries are part of the distribution, UDUNITS (Steve Emmerson) and libcf (yours truly).

Probably we should have released 6 months ago, and held some of those features back for another release. The more new features, the harder it is to test and release.

But it's out now, and we have started working on release 4.2. For that release we have more limited ambitions, and I hope it will be out this year.

Posted by $entry.creator.screenName [ Comments [4] ]

Email this

Proof New Default Chunk Cache in 4.1 Improves Performance

14 January 2010

A last minute change before the 4.1 release ensures that this common case will get good performance.

There is a terrible performance hit if your chunk cache is too small to hold even one chunk, and your data are deflated.

Since the default HDF5 chunk cache size is 1 MB, this is not hard to do.

So I have added code such that, when a file is opened, if the data are compressed, and if the chunksize is greater than the default chunk cache size for that var, then the chunk cache is increased to a multiple of the chunk size.

The code looks like this:

/* Is this a deflated variable with a chunksize greater than the                                                                                               
 * current cache size? */
if (!var->contiguous && var->deflate)
{
   chunk_size_bytes = 1;
   for (d = 0; d < var->ndims; d++)
     chunk_size_bytes *= var->chunksizes[d];
   if (var->type_info->size)
     chunk_size_bytes *= var->type_info->size;
   else
     chunk_size_bytes *= sizeof(char *);
#define NC_DEFAULT_NUM_CHUNKS_IN_CACHE 10
#define NC_DEFAULT_MAX_CHUNK_CACHE 67108864
   if (chunk_size_bytes > var->chunk_cache_size)
   {
     var->chunk_cache_size = chunk_size_bytes * NC_DEFAULT_NUM_CHUNKS_IN_CACHE;
     if (var->chunk_cache_size > NC_DEFAULT_MAX_CHUNK_CACHE)
        var->chunk_cache_size = NC_DEFAULT_MAX_CHUNK_CACHE;
     if ((retval = nc4_reopen_dataset(grp, var)))
        return retval;
   }
}

I am setting the chunk cache to 10 times the chunk size, up to 64 MB max. Reasonable? Comments are welcome.

The timing results show a clear difference. First, two runs without any per-variable caching, but the second run sets a 64MB file level chunk
cache that speeds up timing considerably. (The last number in the row is the average read time for a horizontal layer, in miro-seconds.)

bash-3.2$ ./tst_ar4_3d  pr_A1_z1_256_128_256.nc 
256     128     256     1.0             1       0           836327       850607


bash-3.2$ ./tst_ar4_3d -c 68000000 pr_A1_z1_256_128_256.nc 
256     128     256     64.8            1       0           833453       3562

Without the cache it is over 200 times slower.

Now I have turned on automatic variable caches when appropriate:

bash-3.2$ ./tst_ar4_3d  pr_A1_z1_256_128_256.nc 
256     128     256     1.0             1       0           831470       3568

In this run, although no file level cache was turned on, I got the same response time. That's because when opening the file the netCDF library noticed that this deflated var had a chunk size bigger than the default cache size, and opened a bigger cache.

All of this work is in support of the general netCDF user writing very large files, and specifically in support of the AR-5 effort.

The only downside is that, if you open up a file with many such variables, and you have very little memory on your machine, you will run out of memory.

Posted by $entry.creator.screenName [ Comments [3] ]

Email this

Could Code Review Work at Unidata?

09 January 2010

Recently I was reminded that I used to be a proponent of code review...

I am (still) a big fan of code review. But code review only works under certain circumstances, and I think those circumstances are hard to find at Unidata.

There must be 3 or more reviewers (in addition to programmer who wrote the code). They must all be programmers in daily practice with the same language and environment as the reviewed code, and working on the same project.
Code is judged against written, reviewed requirements.
Code that satisfies requirements with least number of functions, variables, lines of code (within reason) is best. (Occum's Razor for code.)
Must be full buy-in from the programmers and all their management. This is a very expensive process.
All released code (except test code) must be reviewed.
Major (user-affecting) and minor (bad coding) defects are identified but not fixed at review. There is no discussion about potential fixes - a problem is identified and the review moves on.
All major and minor problems must be fixed by original programmer, and the code re-submitted for re-review. (Usually much quicker, but still necessary.)
Review should be one hour at a time, with one or two hours more mandatory preparation by all reviewers.
Records are kept, project manager follows-up.
Unanimous decision required for all defects and to pass review.
No supervisors or spectators ever.

As you can see, to meet all these conditions is no small feat. In fact, I have hardly ever seen it done. When I have seen it done, it has worked very well. Ideally, it becomes a place where all the project programmers learn from each other. The best practices spread through the project code, and bad practices wither away.

How can this work at Unidata? We don't have the programmers!

If we got more programmers, I would still think that other, more inexpensive software process improvements should take place first (like written requirements, and requirements review).

However, I would be willing to participate in any code review that anyone organizes at Unidata as long as two conditions are met:

It seems to be part of a serious effort - not just a casual thing. There must be feedback of review into product. (That is, something must be done with results of the review.) And there must be written requirements so I know what the code is supposed to do.
My project(s) must be compensated in some way with time. I need someone else to do that quarter-day a week of work I would give up to have time for review. For example, I would do plenty of good reviewing for anyone who answered 3 netCDF support questions a week!

Posted by $entry.creator.screenName [ Comments [1] ]

Email this

Why I Don't Think the Number of Processors Affects the Wall Clock Time of My Tests...

08 January 2010

Using the taskset command I demonstrate that my benchmarks run on one processor.

Recently Russ raised a question: are my wall clock times wrong because I have many processors on my machine. I believe that the answer is no.

Firstly, without special efforts, linux has no way of breaking a process onto more than one processor. When I run my benchmarking program, tst_ar4, I see from the top command than one processor goes to > 90% use, and all the others remain at 0%.

Secondly, I confirmed this with the taskset command, which I had never heard of before. It limits a process to one (or any other number) of processors. In fact, it lets you pick which processors are used. Here's some timing results that show I get about the same times using the taskset command as I do without it, on a compressed file read:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h -c 35000000 
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       304398           4568

bash-3.2$ sudo ./clear_cache.sh && taskset -c 4 ./tst_ar4 -h -c 35000000 
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       306810           4553

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h -c 35000000 
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       292707           4616

bash-3.2$ sudo ./clear_cache.sh && taskset -c 4 ./tst_ar4 -h -c 35000000 
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       293713           4567