Re: stripping of CR characters in --csv mode

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: stripping of CR characters in --csv mode

From:	Ed Morton
Subject:	Re: stripping of CR characters in --csv mode
Date:	Tue, 4 Apr 2023 16:49:23 -0500
User-agent:	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1

FWIW with this input file ("^M" represents the CRs courtesy of `cat -v`):

---
$ cat -v file
ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY^M
1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE^M
---

and this script:

---
$ cat tst.awk
BEGIN {
     FPAT = "([^,]*)|(\"([^\"]|\"\")+\")"
}
{
    print "<" $0 ">"
    printf("NF = %s ", NF)
    for (i = 1; i <= NF; i++) {
        printf("<%s>", $i)
    }
    print ""
}
---

I get this output using gawk 5.2.1 in bash 5.2.15 on cygwin:

---
$ awk -f tst.awk file | cat -v
<ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY^M>

NF = 10<ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY^M>

<1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE^M>
NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE^M>
---

which is the correct expected output.

    Ed.

On 4/4/2023 1:48 PM, Ed Morton wrote:

Andy - got it, thanks for clarifying. Yes, that does look like a bugto me too FWIW.
Talking of bugs - cph1968, in addition to your sample input and outputpiped to `cat -v` or similar, please also let us know which version ofgawk you're running as there were some FPAT bugs in older versions soif you are encountering a problem using FPAT maybe it's due to one ofthose.
    Ed.

On 4/4/2023 11:23 AM, Andrew J. Schorr wrote:
Hi Ed,
I think the intent is merely to strip and ignore carriage returnsthat appearjust before a LF record terminator. So it should all work painlesslyregardlessof whether the file's records are terminated with only LF or thecombination CR
LF.

However, the current code appears to have a bug whereby it strips and
removes CR characters regardless of where they appear in the file.
Here's some sample input where the first field contains an embeddedCR inside
quotes:
bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1,"afterCR"}' | hexdump -vC00000000 22 62 65 66 6f 72 65 43 52 0d 61 66 74 65 72 43|"beforeCR.afterC|
00000010  52 22 0a |R".|
00000013

And when I run it through gawk --csv, the CR is unceremoniously dropped:
bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1,"afterCR"}' | ./gawk --csv '{print $1}' | hexdump -vC00000000 62 65 66 6f 72 65 43 52 61 66 74 65 72 43 52 0a|beforeCRafterCR.|
00000010

That seems like a bug to me, but perhaps I am confused.

Regards,
Andy

On Tue, Apr 04, 2023 at 11:05:48AM -0500, Ed Morton wrote:
Andy - I know that's whathttps://www.rfc-editor.org/rfc/rfc4180 says butthat's just one CSV "standard" and in practice most CSVscreated/used on Unixend with LF alone and if there's a CR before the LF then it's justanother
character unless you write code to remove it.
If the CSV file format as used by --csv defines the recordterminator as CR LFand --csv strips the CRs then it's output would no longer be validCSV by thatsame definition so that's a surprising choice. Does that mean it'llfail if theinput is just LF-terminated as most Unix files are (and in whichcase you
couldn't write `awk --csv 'foo' input | awk --csv 'bar'`)?

     Ed.

On 4/4/2023 10:48 AM, Andrew J. Schorr wrote:

     Hi Ed,
The CSV file format defines the record terminator as CR LF, sothe new --csv
     option does in fact strip CRs.

     Regards,
     Andy

     On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:
Are you sure in the FPAT output you're not just seeing theexpected effects of there being a CR in your data? The `--csv`output is the one that looks wrong to me if you have `CR`s at the end ofeach line, unless `--csv` is documented to strip `CR`s from theoutput.
Please provide the input file you used as it's hard to tellwhat's going on from just the output. Also pipe the output to `cat-v` or `od -c` or similar so we can see where the CRs are in theoutput but
         my best guess right now is that `FPAT` is retaining the CRs as
expected while `--csv` is stripping them (which may or maynot be
         expected - I'm not familiar with that option).

             Ed.

         On 4/4/2023 5:12 AM,cph1968@proton.me  wrote:
the regex fp[2] in section 4.7.1 (below) don't quitecut it if the CSV file records end in both CR and NL [0H0D 0H0A]. Ibelieve this is a common feature of Windows files.
             A simple fix is however to use the gawk --csv option.

             ❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk

ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
F = 1<ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
                 1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
                 F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
note here that the last '>' is first character on thenext line.
             output using the --csv option:
❯ head -n 2 TSCAINV_022023.csv| gawk --csv -fprint-fields.awk
<ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
NF = 10<ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
             <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
             NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>

             much better :-)

             ❯ cat print-fields.awk
             {
                 print "<" $0 ">"
                 printf("NF = %s ", NF)
                 for (i = 1; i <= NF; i++) {
                     printf("<%s>", $i)
                 }
                 print ""
             }



         >from section 4.7.1:

             BEGIN {
                  fp[0] = "([^,]+)|(\"[^\"]+\")"
                  fp[1] = "([^,]*)|(\"[^\"]+\")"
                  fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
                  FPAT = fp[fpat+0]
             }



             kind regards,

             cph1968

[Prev in Thread]

Current Thread

[Next in Thread]

manual section 4.7.1, cph1968, 2023/04/04
- Re: manual section 4.7.1, arnold, 2023/04/04
  - Re: manual section 4.7.1, cph1968, 2023/04/04
- Re: manual section 4.7.1, Ed Morton, 2023/04/04
  - Re: manual section 4.7.1, Andrew J. Schorr, 2023/04/04
    - Re: manual section 4.7.1, Ed Morton, 2023/04/04
    - Re: stripping of CR characters in --csv mode, Andrew J. Schorr, 2023/04/04
    - Re: stripping of CR characters in --csv mode, Ed Morton, 2023/04/04
    - Re: stripping of CR characters in --csv mode, Ed Morton <=
    - Re: stripping of CR characters in --csv mode, cph1968, 2023/04/05
    - Re: stripping of CR characters in --csv mode, arnold, 2023/04/05
    - Re: manual section 4.7.1, cph1968, 2023/04/05
    - Re: manual section 4.7.1, Manuel Collado, 2023/04/05

Prev by Date: SYMTAB array does not work properly. It is semi-read-only.
Next by Date: Re: manual section 4.7.1
Previous by thread: Re: stripping of CR characters in --csv mode
Next by thread: Re: stripping of CR characters in --csv mode
Index(es):
- Date
- Thread