Andy - I know that's whathttps://www.rfc-editor.org/rfc/rfc4180
says but
that's just one CSV "standard" and in practice most CSVs
created/used on Unix
end with LF alone and if there's a CR before the LF then it's just
another
character unless you write code to remove it.
If the CSV file format as used by --csv defines the record
terminator as CR LF
and --csv strips the CRs then it's output would no longer be valid
CSV by that
same definition so that's a surprising choice. Does that mean it'll
fail if the
input is just LF-terminated as most Unix files are (and in which
case you
couldn't write `awk --csv 'foo' input | awk --csv 'bar'`)?
Ed.
On 4/4/2023 10:48 AM, Andrew J. Schorr wrote:
Hi Ed,
The CSV file format defines the record terminator as CR LF, so
the new --csv
option does in fact strip CRs.
Regards,
Andy
On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:
Are you sure in the FPAT output you're not just seeing the
expected
effects of there being a CR in your data? The `--csv`
output is the
one that looks wrong to me if you have `CR`s at the end of
each
line, unless `--csv` is documented to strip `CR`s from the
output.
Please provide the input file you used as it's hard to tell
what's
going on from just the output. Also pipe the output to `cat
-v` or
`od -c` or similar so we can see where the CRs are in the
output but
my best guess right now is that `FPAT` is retaining the CRs as
expected while `--csv` is stripping them (which may or may
not be
expected - I'm not familiar with that option).
Ed.
On 4/4/2023 5:12 AM,cph1968@proton.me wrote:
the regex fp[2] in section 4.7.1 (below) don't quite
cut it if the CSV file records end in both CR and NL [0H0D 0H0A]. I
believe this is a common feature of Windows files.
A simple fix is however to use the gawk --csv option.
❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk
ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
F = 1
<ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
note here that the last '>' is first character on the
next line.
output using the --csv option:
❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f
print-fields.awk
<ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
NF = 10
<ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
<1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>
much better :-)
❯ cat print-fields.awk
{
print "<" $0 ">"
printf("NF = %s ", NF)
for (i = 1; i <= NF; i++) {
printf("<%s>", $i)
}
print ""
}
>from section 4.7.1:
BEGIN {
fp[0] = "([^,]+)|(\"[^\"]+\")"
fp[1] = "([^,]*)|(\"[^\"]+\")"
fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
FPAT = fp[fpat+0]
}
kind regards,
cph1968