Kurt,
thanks for the complete bug report, great job.  the "no decoding" pragma
was inserted into the perl decoders but it is commented out because the
earlier version of perl < 5.8.0  Can't locate encoding.pm.  in the next
release tomorrow, there is a note about uncommenting the "no decoding"
pragma.
thanks,
robb...
On Wed, 20 Apr 2005, Unidata Support wrote:
>
> ------- Forwarded Message
>
> >To: <support@xxxxxxxxxxxxxxxx>
> >From: "Hanson, Kurt" <khanson@xxxxxxx>
> >Subject: Unidata decoders - syn2nc bug re Unicode
> >Organization: UCAR/Unidata
> >Keywords: 200504201720.j3KHK5v2016989
>
> This is a multi-part message in MIME format.
>
> ------_=_NextPart_001_01C545CD.2E77045F
> Content-Type: multipart/alternative;
>       boundary="----_=_NextPart_002_01C545CD.2E77045F"
>
>
> ------_=_NextPart_002_01C545CD.2E77045F
> Content-Type: text/plain;
>       charset="Windows-1252"
> Content-Transfer-Encoding: quoted-printable
>
> I imagine this will end up in Rob Kambic's inbox... if so, hello again
> Rob.
>
> We've been occasionally experiencing issues with syn2nc. The problem is
> that once every week or two, the syn2nc log will suddenly begin filling
> with messages about Unicode:
>
> Malformed UTF-8 character (unexpected continuation byte 0x8e, with no
> preceding start byte) in index at /dicast2-papp/DICAST/tmp/syn2nc_308
> line 618, <STDIN> chunk 1.
> Malformed UTF-8 character (unexpected non-continuation byte 0x2a,
> immediately after start byte 0xf6) in index at
> /dicast2-papp/DICAST/tmp/syn2nc_308 line 618, <STDIN> chunk 1.
>
> The log file grows without bound until finally the disk partition fills,
> hobbling the entire system.
>
> I think I understand the problem and have a fix. The problem appears to
> be due to some garbage characters in several of the synoptic messages
> from today for a single site -- FBSK in Botswana. (I imagine that all of
> the problems we've ever seen are from this site.)
>
> The issue is that since 5.8.0, Perl has some automatic support for
> handling Unicode characters. Once Perl sees a character outside of the
> range [0,127], it assumes that the text data is Unicode rather than
> ASCII. Since the garbage characters from today's FBSK data did not
> conform to Unicode rules, Perl itself (rather than syn2nc) generated the
> messages.
>
> So the magic fix I installed is to put a "no encoding;" line (pragma)
> into the syn2nc script. This ensures that Perl doesn't try to guess what
> sort of character set the text is in -- it just passes the data up to
> the application level in raw form. That's what we need with syn2nc.
>
> Scope:
> * We experience this problem on a Linux RedHat Enterprise 3.0 Athlon
> system running Perl 5.8.0.
> * We do not experience it on a Solaris 8 system running Perl 5.8.0.
>
> Testing:
> When I pipe the attached synoptic file
> synoptic.20050420.1200.asc.FBSK_210 into the pristine syn2nc on the
> Linux system, the log file grows without bound. When I pipe it into my
> patched version, the log file size remains stable, and the file never
> gets any Unicode error messages.
>
> Discussion:
> Thinking beyond the low-level Perl issue, I'm not sure what syn2nc
> should do when it encounters the garbage characters... Nor am I sure
> what it actually does -- I'd dig into the script itself to find that out
> but I'm running short on time. What do you think?
>
> Also, I'd be curious to hear whether you see the garbage characters in
> your FBSK synoptics for today. Its possible but unlikely that the
> garbage is not due to the FBSK sensor itself but due to some
> communications issue that is WSI-specific.
>
> I'm attaching a few things:
> * syn2nc.new -- an updated version of syn2nc from the 3.0.9 version of
> the decoders package.
> * syn2nc.patch -- a diff of my version vs the pristine 3.0.9
> * synoptic.20050420.1200.asc.FBSK_210 -- message #210 from today's
> synoptic feed, containing garbage characters 0x8e and others in line 6
> of the file.
>
> Relevant Perl references:
> * Unicode intro: http://perldoc.perl.org/perluniintro.html
> * encoding pragma: http://perldoc.perl.org/encoding.html
>
> Whew. I think that's about everything! Feel free to contact me.
>
> Kurt Hanson
> Senior Software Engineer & Scientific Analyst
> WSI Corporation
> 400 Minuteman Rd.
> Andover, MA 01810
> my phone: 978.983.6549
> www.wsi.com
>
>  <<syn2nc.new>>  <<syn2nc.patch>> 
> <<synoptic.20050420.1200.asc.FBSK_210>>
>
==============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
rkambic@xxxxxxxxxxxxxxxx                   WWW: http://www.unidata.ucar.edu/
==============================================================================