[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: manual section 4.7.1
From: |
Andrew J. Schorr |
Subject: |
Re: manual section 4.7.1 |
Date: |
Tue, 4 Apr 2023 11:48:03 -0400 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Hi Ed,
The CSV file format defines the record terminator as CR LF, so the new --csv
option does in fact strip CRs.
Regards,
Andy
On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:
> Are you sure in the FPAT output you're not just seeing the expected
> effects of there being a CR in your data? The `--csv` output is the
> one that looks wrong to me if you have `CR`s at the end of each
> line, unless `--csv` is documented to strip `CR`s from the output.
>
> Please provide the input file you used as it's hard to tell what's
> going on from just the output. Also pipe the output to `cat -v` or
> `od -c` or similar so we can see where the CRs are in the output but
> my best guess right now is that `FPAT` is retaining the CRs as
> expected while `--csv` is stripping them (which may or may not be
> expected - I'm not familiar with that option).
>
> Ed.
>
> On 4/4/2023 5:12 AM, cph1968@proton.me wrote:
> >the regex fp[2] in section 4.7.1 (below) don't quite cut it if the CSV file
> >records end in both CR and NL [0H0D 0H0A]. I believe this is a common
> >feature of Windows files.
> >A simple fix is however to use the gawk --csv option.
> >
> >❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk
> >>ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> >>F = 1 <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> >>1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> >>F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> >note here that the last '>' is first character on the next line.
> >
> >output using the --csv option:
> >❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk
> ><ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
> >NF = 10 <ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
> ><1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
> >NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>
> >
> >much better :-)
> >
> >❯ cat print-fields.awk
> >{
> > print "<" $0 ">"
> > printf("NF = %s ", NF)
> > for (i = 1; i <= NF; i++) {
> > printf("<%s>", $i)
> > }
> > print ""
> >}
> >
> >
> >from section 4.7.1:
> >BEGIN {
> > fp[0] = "([^,]+)|(\"[^\"]+\")"
> > fp[1] = "([^,]*)|(\"[^\"]+\")"
> > fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
> > FPAT = fp[fpat+0]
> >}
> >
> >
> >
> >kind regards,
> >
> >cph1968
> >
- manual section 4.7.1, cph1968, 2023/04/04
- Re: manual section 4.7.1, arnold, 2023/04/04
- Re: manual section 4.7.1, Ed Morton, 2023/04/04
- Re: manual section 4.7.1,
Andrew J. Schorr <=
- Re: manual section 4.7.1, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, Andrew J. Schorr, 2023/04/04
- Re: stripping of CR characters in --csv mode, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, cph1968, 2023/04/05
- Re: stripping of CR characters in --csv mode, arnold, 2023/04/05
- Re: manual section 4.7.1, cph1968, 2023/04/05
- Re: manual section 4.7.1, Manuel Collado, 2023/04/05