[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: stripping of CR characters in --csv mode
From: |
arnold |
Subject: |
Re: stripping of CR characters in --csv mode |
Date: |
Wed, 05 Apr 2023 04:11:27 -0600 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
The stripping of \r whether or not followed by \n is on purpose.
It keeps the code simple and in practice isn't likely to be a
problem.
Let's close out the discussion of CSV, FPAT, Life, the Universe
and Everything, please.
Thanks,
Arnold
"Andrew J. Schorr" <aschorr@telemetry-investments.com> wrote:
> Hi Ed,
>
> I think the intent is merely to strip and ignore carriage returns that appear
> just before a LF record terminator. So it should all work painlessly
> regardless
> of whether the file's records are terminated with only LF or the combination
> CR
> LF.
>
> However, the current code appears to have a bug whereby it strips and
> removes CR characters regardless of where they appear in the file.
>
> Here's some sample input where the first field contains an embedded CR inside
> quotes:
>
> bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1,
> "afterCR"}' | hexdump -vC
> 00000000 22 62 65 66 6f 72 65 43 52 0d 61 66 74 65 72 43 |"beforeCR.afterC|
> 00000010 52 22 0a |R".|
> 00000013
>
> And when I run it through gawk --csv, the CR is unceremoniously dropped:
>
> bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1,
> "afterCR"}' | ./gawk --csv '{print $1}' | hexdump -vC
> 00000000 62 65 66 6f 72 65 43 52 61 66 74 65 72 43 52 0a |beforeCRafterCR.|
> 00000010
>
> That seems like a bug to me, but perhaps I am confused.
>
> Regards,
> Andy
>
> On Tue, Apr 04, 2023 at 11:05:48AM -0500, Ed Morton wrote:
> > Andy - I know that's what https://www.rfc-editor.org/rfc/rfc4180 says but
> > that's just one CSV "standard" and in practice most CSVs created/used on
> > Unix
> > end with LF alone and if there's a CR before the LF then it's just another
> > character unless you write code to remove it.
> >
> > If the CSV file format as used by --csv defines the record terminator as CR
> > LF
> > and --csv strips the CRs then it's output would no longer be valid CSV by
> > that
> > same definition so that's a surprising choice. Does that mean it'll fail if
> > the
> > input is just LF-terminated as most Unix files are (and in which case you
> > couldn't write `awk --csv 'foo' input | awk --csv 'bar'`)?
> >
> > Ed.
> >
> > On 4/4/2023 10:48 AM, Andrew J. Schorr wrote:
> >
> > Hi Ed,
> >
> > The CSV file format defines the record terminator as CR LF, so the new
> > --csv
> > option does in fact strip CRs.
> >
> > Regards,
> > Andy
> >
> > On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:
> >
> > Are you sure in the FPAT output you're not just seeing the expected
> > effects of there being a CR in your data? The `--csv` output is the
> > one that looks wrong to me if you have `CR`s at the end of each
> > line, unless `--csv` is documented to strip `CR`s from the output.
> >
> > Please provide the input file you used as it's hard to tell what's
> > going on from just the output. Also pipe the output to `cat -v` or
> > `od -c` or similar so we can see where the CRs are in the output but
> > my best guess right now is that `FPAT` is retaining the CRs as
> > expected while `--csv` is stripping them (which may or may not be
> > expected - I'm not familiar with that option).
> >
> > Ed.
> >
> > On 4/4/2023 5:12 AM, cph1968@proton.me wrote:
> >
> > the regex fp[2] in section 4.7.1 (below) don't quite cut it if
> > the CSV file records end in both CR and NL [0H0D 0H0A]. I believe this is a
> > common feature of Windows files.
> > A simple fix is however to use the gawk --csv option.
> >
> > ❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk
> >
> > ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> > F = 1
> > <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> > 1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> > F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> >
> > note here that the last '>' is first character on the next line.
> >
> > output using the --csv option:
> > ❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk
> > <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
> > NF = 10
> > <ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
> > <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
> > NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>
> >
> > much better :-)
> >
> > ❯ cat print-fields.awk
> > {
> > print "<" $0 ">"
> > printf("NF = %s ", NF)
> > for (i = 1; i <= NF; i++) {
> > printf("<%s>", $i)
> > }
> > print ""
> > }
> >
> >
> >
> > >from section 4.7.1:
> >
> > BEGIN {
> > fp[0] = "([^,]+)|(\"[^\"]+\")"
> > fp[1] = "([^,]*)|(\"[^\"]+\")"
> > fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
> > FPAT = fp[fpat+0]
> > }
> >
> >
> >
> > kind regards,
> >
> > cph1968
> >
> >
> >
>
> --
> Andrew Schorr e-mail: aschorr@telemetry-investments.com
> Telemetry Investments, L.L.C. phone: 917-305-1748
> 152 W 36th St, #402 fax: 212-425-5550
> New York, NY 10018-8765
- manual section 4.7.1, cph1968, 2023/04/04
- Re: manual section 4.7.1, arnold, 2023/04/04
- Re: manual section 4.7.1, Ed Morton, 2023/04/04
- Re: manual section 4.7.1, Andrew J. Schorr, 2023/04/04
- Re: manual section 4.7.1, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, Andrew J. Schorr, 2023/04/04
- Re: stripping of CR characters in --csv mode, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, cph1968, 2023/04/05
- Re: stripping of CR characters in --csv mode,
arnold <=
- Re: manual section 4.7.1, cph1968, 2023/04/05
- Re: manual section 4.7.1, Manuel Collado, 2023/04/05