|
From: | Manuel Collado |
Subject: | Re: manual section 4.7.1 |
Date: | Wed, 5 Apr 2023 10:00:44 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 |
RS = "\r?\n"I.e., The trailing \r at the end of the CSV input record is not related at all to the FPAT regex. Instead, it is what the default RS="\n" provides.
HTH. Regards. El 5/4/23 a las 8:23, cph1968@proton.me escribió:
hi all, for your perusal this is the dump of the first two lines of the file (8MB) records are terminated by 0x0d 0x0a ❯ head -n 2 TSCAINV_022023.csv| hexdump -C 00000000 49 44 2c 43 41 53 52 4e 2c 63 61 73 72 65 67 6e |ID,CASRN,casregn| 00000010 6f 2c 55 49 44 2c 45 58 50 2c 43 68 65 6d 4e 61 |o,UID,EXP,ChemNa| 00000020 6d 65 2c 44 45 46 2c 55 56 43 42 2c 46 4c 41 47 |me,DEF,UVCB,FLAG| 00000030 2c 41 43 54 49 56 49 54 59 0d 0a 31 2c 35 30 2d |,ACTIVITY..1,50-| 00000040 30 30 2d 30 2c 35 30 30 30 30 2c 2c 2c 46 6f 72 |00-0,50000,,,For| 00000050 6d 61 6c 64 65 68 79 64 65 2c 2c 2c 2c 41 43 54 |maldehyde,,,,ACT| 00000060 49 56 45 0d 0a |IVE..| 00000065 kind regards, cph1968 Sent with Proton Mail secure email. ------- Original Message ------- On Tuesday, April 4th, 2023 at 17:48, Andrew J. Schorr <aschorr@telemetry-investments.com> wrote:Hi Ed,The CSV file format defines the record terminator as CR LF, so the new --csv option does in fact strip CRs.Regards, AndyOn Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:Are you sure in the FPAT output you're not just seeing the expected effects of there being a CR in your data? The `--csv` output is the one that looks wrong to me if you have `CR`s at the end of each line, unless `--csv` is documented to strip `CR`s from the output.Please provide the input file you used as it's hard to tell what's going on from just the output. Also pipe the output to `cat -v` or `od -c` or similar so we can see where the CRs are in the output but my best guess right now is that `FPAT` is retaining the CRs as expected while `--csv` is stripping them (which may or may not be expected - I'm not familiar with that option).Ed.On 4/4/2023 5:12 AM, cph1968@proton.me wrote:the regex fp[2] in section 4.7.1 (below) don't quite cut it if the CSV file records end in both CR and NL [0H0D 0H0A]. I believe this is a common feature of Windows files. A simple fix is however to use the gawk --csv option.❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awkID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY F = 1 <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY 1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE note here that the last '>' is first character on the next line.output using the --csv option: ❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY> NF = 10 <ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY> <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE> NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>much better :-)❯ cat print-fields.awk { print "<" $0 ">" printf("NF = %s ", NF) for (i = 1; i <= NF; i++) { printf("<%s>", $i) } print "" }from section 4.7.1: BEGIN { fp[0] = "([^,]+)|(\"[^\"]+\")" fp[1] = "([^,])|(\"[^\"]+\")" fp[2] = "([^,])|(\"([^\"]|\"\")+\")" FPAT = fp[fpat+0] }kind regards,cph1968
-- Manuel Collado - http://mcollado.z15.es
[Prev in Thread] | Current Thread | [Next in Thread] |