[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: manual section 4.7.1
From: |
cph1968 |
Subject: |
Re: manual section 4.7.1 |
Date: |
Wed, 05 Apr 2023 06:23:05 +0000 |
hi all,
for your perusal this is the dump of the first two lines of the file (8MB)
records are terminated by 0x0d 0x0a
❯ head -n 2 TSCAINV_022023.csv| hexdump -C
00000000 49 44 2c 43 41 53 52 4e 2c 63 61 73 72 65 67 6e |ID,CASRN,casregn|
00000010 6f 2c 55 49 44 2c 45 58 50 2c 43 68 65 6d 4e 61 |o,UID,EXP,ChemNa|
00000020 6d 65 2c 44 45 46 2c 55 56 43 42 2c 46 4c 41 47 |me,DEF,UVCB,FLAG|
00000030 2c 41 43 54 49 56 49 54 59 0d 0a 31 2c 35 30 2d |,ACTIVITY..1,50-|
00000040 30 30 2d 30 2c 35 30 30 30 30 2c 2c 2c 46 6f 72 |00-0,50000,,,For|
00000050 6d 61 6c 64 65 68 79 64 65 2c 2c 2c 2c 41 43 54 |maldehyde,,,,ACT|
00000060 49 56 45 0d 0a |IVE..|
00000065
kind regards,
cph1968
Sent with Proton Mail secure email.
------- Original Message -------
On Tuesday, April 4th, 2023 at 17:48, Andrew J. Schorr
<aschorr@telemetry-investments.com> wrote:
> Hi Ed,
>
> The CSV file format defines the record terminator as CR LF, so the new --csv
> option does in fact strip CRs.
>
> Regards,
> Andy
>
> On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:
>
> > Are you sure in the FPAT output you're not just seeing the expected
> > effects of there being a CR in your data? The `--csv` output is the
> > one that looks wrong to me if you have `CR`s at the end of each
> > line, unless `--csv` is documented to strip `CR`s from the output.
> >
> > Please provide the input file you used as it's hard to tell what's
> > going on from just the output. Also pipe the output to `cat -v` or
> > `od -c` or similar so we can see where the CRs are in the output but
> > my best guess right now is that `FPAT` is retaining the CRs as
> > expected while `--csv` is stripping them (which may or may not be
> > expected - I'm not familiar with that option).
> >
> > Ed.
> >
> > On 4/4/2023 5:12 AM, cph1968@proton.me wrote:
> >
> > > the regex fp[2] in section 4.7.1 (below) don't quite cut it if the CSV
> > > file records end in both CR and NL [0H0D 0H0A]. I believe this is a
> > > common feature of Windows files.
> > > A simple fix is however to use the gawk --csv option.
> > >
> > > ❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk
> > >
> > > > ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> > > > F = 1 <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> > > > 1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> > > > F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> > > > note here that the last '>' is first character on the next line.
> > >
> > > output using the --csv option:
> > > ❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk
> > > <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
> > > NF = 10
> > > <ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
> > > <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
> > > NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>
> > >
> > > much better :-)
> > >
> > > ❯ cat print-fields.awk
> > > {
> > > print "<" $0 ">"
> > > printf("NF = %s ", NF)
> > > for (i = 1; i <= NF; i++) {
> > > printf("<%s>", $i)
> > > }
> > > print ""
> > > }
> > >
> > > from section 4.7.1:
> > > BEGIN {
> > > fp[0] = "([^,]+)|(\"[^\"]+\")"
> > > fp[1] = "([^,])|(\"[^\"]+\")"
> > > fp[2] = "([^,])|(\"([^\"]|\"\")+\")"
> > > FPAT = fp[fpat+0]
> > > }
> > >
> > > kind regards,
> > >
> > > cph1968
signature.asc
Description: OpenPGP digital signature
- Re: manual section 4.7.1, (continued)
- Re: manual section 4.7.1, arnold, 2023/04/04
- Re: manual section 4.7.1, Ed Morton, 2023/04/04
- Re: manual section 4.7.1, Andrew J. Schorr, 2023/04/04
- Re: manual section 4.7.1, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, Andrew J. Schorr, 2023/04/04
- Re: stripping of CR characters in --csv mode, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, Ed Morton, 2023/04/04
- Re: stripping of CR characters in --csv mode, cph1968, 2023/04/05
- Re: stripping of CR characters in --csv mode, arnold, 2023/04/05
- Re: manual section 4.7.1,
cph1968 <=
- Re: manual section 4.7.1, Manuel Collado, 2023/04/05