[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: stripping of CR characters in --csv mode
From: |
Andrew J. Schorr |
Subject: |
Re: stripping of CR characters in --csv mode |
Date: |
Tue, 4 Apr 2023 12:23:07 -0400 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Hi Ed,
I think the intent is merely to strip and ignore carriage returns that appear
just before a LF record terminator. So it should all work painlessly regardless
of whether the file's records are terminated with only LF or the combination CR
LF.
However, the current code appears to have a bug whereby it strips and
removes CR characters regardless of where they appear in the file.
Here's some sample input where the first field contains an embedded CR inside
quotes:
bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1, "afterCR"}'
| hexdump -vC
00000000 22 62 65 66 6f 72 65 43 52 0d 61 66 74 65 72 43 |"beforeCR.afterC|
00000010 52 22 0a |R".|
00000013
And when I run it through gawk --csv, the CR is unceremoniously dropped:
bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1, "afterCR"}'
| ./gawk --csv '{print $1}' | hexdump -vC
00000000 62 65 66 6f 72 65 43 52 61 66 74 65 72 43 52 0a |beforeCRafterCR.|
00000010
That seems like a bug to me, but perhaps I am confused.
Regards,
Andy
On Tue, Apr 04, 2023 at 11:05:48AM -0500, Ed Morton wrote:
> Andy - I know that's what https://www.rfc-editor.org/rfc/rfc4180 says but
> that's just one CSV "standard" and in practice most CSVs created/used on Unix
> end with LF alone and if there's a CR before the LF then it's just another
> character unless you write code to remove it.
>
> If the CSV file format as used by --csv defines the record terminator as CR LF
> and --csv strips the CRs then it's output would no longer be valid CSV by that
> same definition so that's a surprising choice. Does that mean it'll fail if
> the
> input is just LF-terminated as most Unix files are (and in which case you
> couldn't write `awk --csv 'foo' input | awk --csv 'bar'`)?
>
> Ed.
>
> On 4/4/2023 10:48 AM, Andrew J. Schorr wrote:
>
> Hi Ed,
>
> The CSV file format defines the record terminator as CR LF, so the new
> --csv
> option does in fact strip CRs.
>
> Regards,
> Andy
>
> On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:
>
> Are you sure in the FPAT output you're not just seeing the expected
> effects of there being a CR in your data? The `--csv` output is the
> one that looks wrong to me if you have `CR`s at the end of each
> line, unless `--csv` is documented to strip `CR`s from the output.
>
> Please provide the input file you used as it's hard to tell what's
> going on from just the output. Also pipe the output to `cat -v` or
> `od -c` or similar so we can see where the CRs are in the output but
> my best guess right now is that `FPAT` is retaining the CRs as
> expected while `--csv` is stripping them (which may or may not be
> expected - I'm not familiar with that option).
>
> Ed.
>
> On 4/4/2023 5:12 AM, cph1968@proton.me wrote:
>
> the regex fp[2] in section 4.7.1 (below) don't quite cut it if
> the CSV file records end in both CR and NL [0H0D 0H0A]. I believe this is a
> common feature of Windows files.
> A simple fix is however to use the gawk --csv option.
>
> ❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk
>
> ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> F = 1
> <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> 1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
>
> note here that the last '>' is first character on the next line.
>
> output using the --csv option:
> ❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk
> <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
> NF = 10
> <ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
> <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
> NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>
>
> much better :-)
>
> ❯ cat print-fields.awk
> {
> print "<" $0 ">"
> printf("NF = %s ", NF)
> for (i = 1; i <= NF; i++) {
> printf("<%s>", $i)
> }
> print ""
> }
>
>
>
> >from section 4.7.1:
>
> BEGIN {
> fp[0] = "([^,]+)|(\"[^\"]+\")"
> fp[1] = "([^,]*)|(\"[^\"]+\")"
> fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
> FPAT = fp[fpat+0]
> }
>
>
>
> kind regards,
>
> cph1968
>
>
>
--
Andrew Schorr e-mail: aschorr@telemetry-investments.com
Telemetry Investments, L.L.C. phone: 917-305-1748
152 W 36th St, #402 fax: 212-425-5550
New York, NY 10018-8765