Re: nested field separators



Ed Morton wrote:
<snip>
Thanks for the RFC tip. I've modified my script to handle "escaped quotes" as defined there as a contiguous pair of double quotes, to convert tabs in the original file to spaces and to only use a comma as the separator by default, instead of a comma followed by a space as in the OP. So, the script is now:

Some cleanup, a bug fix, and some clarification:

----------------
$ cat decsv.awk
# This script, decsv.awk, converts CSV files that conform to
# the RFC http://tools.ietf.org/html/rfc4180 to a tab-separated
# fields format by default, leaving the result in $0. To
# use this as a preprocessor for your own script, just invoke
# it as:
# awk -f decsv.awk -f myscript.awk csvfile

BEGIN {
# decsv_co = the string that represents a "comma" in CSV.
# FS = the eventual Field Separator you want to use.
# decsv_fs = the string to replace a FS within fields with.
# decsv_rs = the string to replace a RS within fields with.
# So, to process a file with a separator of a comma followed
# by a space instead of just a comma would be:
# awk -v decsv_co=",[[:space:]]" ...
decsv_co=(decsv_co ~ /^$/ ? "," : decsv_co)
FS=(FS ~ /^ $/ ? "\t" : FS)
decsv_fs=(decsv_fs ~ /^$/ ? " " : decsv_fs)
decsv_rs=(decsv_rs ~ /^$/ ? " " : decsv_rs)
}

function decsv_qq2dQ() {
# Converts every letter d to dD and every pair of contiguous
# double quotes to dQ (which due to the preceding
# modification cannot exist in the text otherwise).
gsub( /d/,"dD")
gsub(/""/,"dQ")
}

function decsv_dQ2q() {
# Converts every dQ to a single double quote (since escaping
# it is no longer necessary) and every dD back to d.
gsub(/dQ/,"\"")
gsub(/dD/, "d")
}

function decsv_FS2fs() {
# Converts all FS strings to the decsv_fs string.
gsub(FS,decsv_fs)
}

function decsv_RS2rs( inStr,quoteCnt) {
# Converts RS strings inside of quotes to the decsv_rs string.
inStr = $0
quoteCnt = gsub(/"/,"",inStr)
while ( (quoteCnt%2) && ((getline inStr) > 0) ) {
$0 = $0 decsv_rs inStr
quoteCnt += gsub(/"/,"",inStr)
FNR--
NR--
}
}

function decsv_co2FS( inStr,quoteCnt,subStr) {
# Converts decsv_co separators outside of quotes to the
# new FS string and discards the now-redundant pairs of
# enclosing double quotes.
inStr = $0
$0 = ""
quoteCnt = 0
while (match(inStr,/[^"]*\"/)) {
subStr = substr(inStr,1,RLENGTH-1)
if ((++quoteCnt)%2) {
gsub(decsv_co,FS,subStr)
}
$0 = $0 subStr
inStr = substr(inStr,RLENGTH+1)
}
gsub(decsv_co,FS,inStr)
$0 = $0 inStr
}

function decsv() {
decsv_qq2dQ()
decsv_FS2fs()
decsv_RS2rs()
decsv_co2FS()
decsv_dQ2q()
}

{ decsv() }
------------------

1) I fixed a bug above where I wasn't decrementing FNR appropriately inside RS2rs.
2) I cleaned up the global name space so that the above function names and global variables (co, fs, and rs) would be unlikely to clash with a subsequent script by prefixing them all with "decsv_".
3) I got rid of the "; print" at the end of the script so you don't need an intermediate file to process your CSV file since the conversion above stores each record from the csvfile as you'd expect in $0. So all you need to do is:

awk -f decsv.awk -f myscript.awk csvfile

where "myscript.awk" contains whatever script you really want to run on your file based on it being formated as FS-separated fields rather than CSV. If you just want the file printed, then you just need to put "1" or "{ print }" in "myscript.awk" (or at the end of "decsv.awk").

For example:
------
$ cat csvfile
1,abc,def ghi,jkl,unquoted character strings
2,"abc","def ghi","jkl",quoted character strings
3,123,4""56,789,numbers with escaped quote
4,,"", ,empty fields
5,abc,"def
ghi",234,embedded newline
6,abc,"def "" ghi",789,quoted fields

------
$ cat field4num.awk
# print records that contain a number in the 4th field
$4 ~ /[[:digit:]]/ {
printf "NR=%d, FNR=%d, NF=%d, $0=<%s>\n",NR,FNR,NF,$0
for (i=1;i<=NF;i++) {
printf "\t%d:<%s>\n",i,$i
}
}

------
$ awk -f decsv.awk -f field4num.awk csvfile
NR=3, FNR=3, NF=5, $0=<3 123 4"56 789 numbers with escaped quote>
1:<3>
2:<123>
3:<4"56>
4:<789>
5:<numbers with escaped quote>
NR=5, FNR=5, NF=5, $0=<5 abc def ghi 234 embedded newline>
1:<5>
2:<abc>
3:<def ghi>
4:<234>
5:<embedded newline>
NR=6, FNR=6, NF=5, $0=<6 abc def " ghi 789 quoted fields>
1:<6>
2:<abc>
3:<def " ghi>
4:<789>
5:<quoted fields>
------

Regards,

Ed.

.



Relevant Pages

  • Re: How to pass string in command line argument.
    ... > I want to pass a command argument to perl script with double quotes (" ... > just a straight string is geeting passed ...
    (perl.beginners)
  • Re: basic help to run an application from WSH
    ... The Run method requires at least one argument, which is a string. ... string values must be enclosed in quotes. ... The Run method will execute on c:\notepad.exe. ... I'm just learning vbscript for wsh and here is my first script, ...
    (microsoft.public.scripting.vbscript)
  • Re: How to pass string in command line argument.
    ... Perl Pra wrote: ... i have perl script that searches given string (the string should be passed ... The string should be sent to the script with double quotes attached to it ...
    (perl.beginners)
  • Re: Update existing values incrementally w/UPDATE SQL
    ... I think it would be safe to say, use the Dim statement any time you are ... Dim basically tells vba that you want to initiate a new variable. ... As far as quotes go, this was one of the trickiest ... Consider vba's interpretation of a string: ...
    (microsoft.public.access.modulesdaovba)
  • Re: Update existing values incrementally w/UPDATE SQL
    ... and in the book; Access 2007 VBA Programmer's Reference. ... As far as quotes go, this was one of the trickiest ... Consider vba's interpretation of a string: ... when an SQL is processed (I use an SQL example because it is the most ...
    (microsoft.public.access.modulesdaovba)