bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: manual section 4.7.1


From: Manuel Collado
Subject: Re: manual section 4.7.1
Date: Wed, 5 Apr 2023 10:00:44 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1

The FPAT parsing mode applies to the input record already read by using the current RS value. To use FPAT to process CSV data you should set

   RS = "\r?\n"

I.e., The trailing \r at the end of the CSV input record is not related at all to the FPAT regex. Instead, it is what the default RS="\n" provides.

HTH. Regards.

El 5/4/23 a las 8:23, cph1968@proton.me escribió:
hi all,
for your perusal this is the dump of the first two lines of the file (8MB)
records are terminated by 0x0d 0x0a

❯ head -n 2 TSCAINV_022023.csv| hexdump -C
00000000  49 44 2c 43 41 53 52 4e  2c 63 61 73 72 65 67 6e  |ID,CASRN,casregn|
00000010  6f 2c 55 49 44 2c 45 58  50 2c 43 68 65 6d 4e 61  |o,UID,EXP,ChemNa|
00000020  6d 65 2c 44 45 46 2c 55  56 43 42 2c 46 4c 41 47  |me,DEF,UVCB,FLAG|
00000030  2c 41 43 54 49 56 49 54  59 0d 0a 31 2c 35 30 2d  |,ACTIVITY..1,50-|
00000040  30 30 2d 30 2c 35 30 30  30 30 2c 2c 2c 46 6f 72  |00-0,50000,,,For|
00000050  6d 61 6c 64 65 68 79 64  65 2c 2c 2c 2c 41 43 54  |maldehyde,,,,ACT|
00000060  49 56 45 0d 0a                                    |IVE..|
00000065



kind regards,
cph1968

Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, April 4th, 2023 at 17:48, Andrew J. Schorr 
<aschorr@telemetry-investments.com> wrote:


Hi Ed,


The CSV file format defines the record terminator as CR LF, so the new --csv
option does in fact strip CRs.


Regards,
Andy


On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:


Are you sure in the FPAT output you're not just seeing the expected
effects of there being a CR in your data? The `--csv` output is the
one that looks wrong to me if you have `CR`s at the end of each
line, unless `--csv` is documented to strip `CR`s from the output.


Please provide the input file you used as it's hard to tell what's
going on from just the output. Also pipe the output to `cat -v` or
`od -c` or similar so we can see where the CRs are in the output but
my best guess right now is that `FPAT` is retaining the CRs as
expected while `--csv` is stripping them (which may or may not be
expected - I'm not familiar with that option).


Ed.


On 4/4/2023 5:12 AM, cph1968@proton.me wrote:


the regex fp[2] in section 4.7.1 (below) don't quite cut it if the CSV file 
records end in both CR and NL [0H0D 0H0A]. I believe this is a common feature 
of Windows files.
A simple fix is however to use the gawk --csv option.


❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk


ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
F = 1 <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
note here that the last '>' is first character on the next line.


output using the --csv option:
❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk
<ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
NF = 10 <ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
<1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>


much better :-)


❯ cat print-fields.awk
{
print "<" $0 ">"
printf("NF = %s ", NF)
for (i = 1; i <= NF; i++) {
printf("<%s>", $i)
}
print ""
}


from section 4.7.1:
BEGIN {
fp[0] = "([^,]+)|(\"[^\"]+\")"
fp[1] = "([^,])|(\"[^\"]+\")"
fp[2] = "([^,])|(\"([^\"]|\"\")+\")"
FPAT = fp[fpat+0]
}


kind regards,


cph1968

--
Manuel Collado - http://mcollado.z15.es



reply via email to

[Prev in Thread] Current Thread [Next in Thread]