bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: stripping of CR characters in --csv mode


From: Ed Morton
Subject: Re: stripping of CR characters in --csv mode
Date: Tue, 4 Apr 2023 13:48:50 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1

Andy - got it, thanks for clarifying. Yes, that does look like a bug to me too FWIW.

Talking of bugs - cph1968, in addition to your sample input and output piped to `cat -v` or similar, please also let us know which version of gawk you're running as there were some FPAT bugs in older versions so if you are encountering a problem using FPAT maybe it's due to one of those.

    Ed.

On 4/4/2023 11:23 AM, Andrew J. Schorr wrote:
Hi Ed,

I think the intent is merely to strip and ignore carriage returns that appear
just before a LF record terminator. So it should all work painlessly regardless
of whether the file's records are terminated with only LF or the combination CR
LF.

However, the current code appears to have a bug whereby it strips and
removes CR characters regardless of where they appear in the file.

Here's some sample input where the first field contains an embedded CR inside
quotes:

bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1, "afterCR"}' 
| hexdump -vC
00000000  22 62 65 66 6f 72 65 43  52 0d 61 66 74 65 72 43  |"beforeCR.afterC|
00000010  52 22 0a                                          |R".|
00000013

And when I run it through gawk --csv, the CR is unceremoniously dropped:

bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1, "afterCR"}' 
| ./gawk --csv '{print $1}' | hexdump -vC
00000000  62 65 66 6f 72 65 43 52  61 66 74 65 72 43 52 0a  |beforeCRafterCR.|
00000010

That seems like a bug to me, but perhaps I am confused.

Regards,
Andy

On Tue, Apr 04, 2023 at 11:05:48AM -0500, Ed Morton wrote:
Andy - I know that's whathttps://www.rfc-editor.org/rfc/rfc4180  says but
that's just one CSV "standard" and in practice most CSVs created/used on Unix
end with LF alone and if there's a CR before the LF then it's just another
character unless you write code to remove it.

If the CSV file format as used by --csv defines the record terminator as CR LF
and --csv strips the CRs then it's output would no longer be valid CSV by that
same definition so that's a surprising choice. Does that mean it'll fail if the
input is just LF-terminated as most Unix files are (and in which case you
couldn't write `awk --csv 'foo' input | awk --csv 'bar'`)?

     Ed.

On 4/4/2023 10:48 AM, Andrew J. Schorr wrote:

     Hi Ed,

     The CSV file format defines the record terminator as CR LF, so the new 
--csv
     option does in fact strip CRs.

     Regards,
     Andy

     On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:

         Are you sure in the FPAT output you're not just seeing the expected
         effects of there being a CR in your data? The `--csv` output is the
         one that looks wrong to me if you have `CR`s at the end of each
         line, unless `--csv` is documented to strip `CR`s from the output.

         Please provide the input file you used as it's hard to tell what's
         going on from just the output. Also pipe the output to `cat -v` or
         `od -c` or similar so we can see where the CRs are in the output but
         my best guess right now is that `FPAT` is retaining the CRs as
         expected while `--csv` is stripping them (which may or may not be
         expected - I'm not familiar with that option).

             Ed.

         On 4/4/2023 5:12 AM,cph1968@proton.me  wrote:

             the regex fp[2] in section 4.7.1 (below) don't quite cut it if the 
CSV file records end in both CR and NL [0H0D 0H0A]. I believe this is a common 
feature of Windows files.
             A simple fix is however to use the gawk --csv option.

             ❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk

                 ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
                 F = 1 
<ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
                 1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
                 F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE

             note here that the last '>' is first character on the next line.

             output using the --csv option:
             ❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk
             <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
             NF = 10 
<ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
             <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
             NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>

             much better :-)

             ❯ cat print-fields.awk
             {
                 print "<" $0 ">"
                 printf("NF = %s ", NF)
                 for (i = 1; i <= NF; i++) {
                     printf("<%s>", $i)
                 }
                 print ""
             }



         >from section 4.7.1:

             BEGIN {
                  fp[0] = "([^,]+)|(\"[^\"]+\")"
                  fp[1] = "([^,]*)|(\"[^\"]+\")"
                  fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
                  FPAT = fp[fpat+0]
             }



             kind regards,

             cph1968





reply via email to

[Prev in Thread] Current Thread [Next in Thread]