Re: nested field separators
- From: Ed Morton <morton@xxxxxxxxxxxxxx>
- Date: Thu, 12 Jul 2007 10:56:37 -0500
Ed Morton wrote:
<snip>
Thanks for the RFC tip. I've modified my script to handle "escaped quotes" as defined there as a contiguous pair of double quotes, to convert tabs in the original file to spaces and to only use a comma as the separator by default, instead of a comma followed by a space as in the OP. So, the script is now:
Some cleanup, a bug fix, and some clarification:
----------------# This script, decsv.awk, converts CSV files that conform to
$ cat decsv.awk
# the RFC http://tools.ietf.org/html/rfc4180 to a tab-separated
# fields format by default, leaving the result in $0. To
# use this as a preprocessor for your own script, just invoke
# it as:
# awk -f decsv.awk -f myscript.awk csvfile
BEGIN {
# decsv_co = the string that represents a "comma" in CSV.
# FS = the eventual Field Separator you want to use.
# decsv_fs = the string to replace a FS within fields with.
# decsv_rs = the string to replace a RS within fields with.
# So, to process a file with a separator of a comma followed
# by a space instead of just a comma would be:
# awk -v decsv_co=",[[:space:]]" ...
decsv_co=(decsv_co ~ /^$/ ? "," : decsv_co)
FS=(FS ~ /^ $/ ? "\t" : FS)
decsv_fs=(decsv_fs ~ /^$/ ? " " : decsv_fs)
decsv_rs=(decsv_rs ~ /^$/ ? " " : decsv_rs)
}
function decsv_qq2dQ() {
# Converts every letter d to dD and every pair of contiguous
# double quotes to dQ (which due to the preceding
# modification cannot exist in the text otherwise).
gsub( /d/,"dD")
gsub(/""/,"dQ")
}
function decsv_dQ2q() {
# Converts every dQ to a single double quote (since escaping
# it is no longer necessary) and every dD back to d.
gsub(/dQ/,"\"")
gsub(/dD/, "d")
}
function decsv_FS2fs() {
# Converts all FS strings to the decsv_fs string.
gsub(FS,decsv_fs)
}
function decsv_RS2rs( inStr,quoteCnt) {
# Converts RS strings inside of quotes to the decsv_rs string.
inStr = $0
quoteCnt = gsub(/"/,"",inStr)
while ( (quoteCnt%2) && ((getline inStr) > 0) ) {
$0 = $0 decsv_rs inStr
quoteCnt += gsub(/"/,"",inStr)
FNR--
NR--
}
}
function decsv_co2FS( inStr,quoteCnt,subStr) {
# Converts decsv_co separators outside of quotes to the
# new FS string and discards the now-redundant pairs of
# enclosing double quotes.
inStr = $0
$0 = ""
quoteCnt = 0
while (match(inStr,/[^"]*\"/)) {
subStr = substr(inStr,1,RLENGTH-1)
if ((++quoteCnt)%2) {
gsub(decsv_co,FS,subStr)
}
$0 = $0 subStr
inStr = substr(inStr,RLENGTH+1)
}
gsub(decsv_co,FS,inStr)
$0 = $0 inStr
}
function decsv() {
decsv_qq2dQ()
decsv_FS2fs()
decsv_RS2rs()
decsv_co2FS()
decsv_dQ2q()
}
{ decsv() }
------------------
1) I fixed a bug above where I wasn't decrementing FNR appropriately inside RS2rs.
2) I cleaned up the global name space so that the above function names and global variables (co, fs, and rs) would be unlikely to clash with a subsequent script by prefixing them all with "decsv_".
3) I got rid of the "; print" at the end of the script so you don't need an intermediate file to process your CSV file since the conversion above stores each record from the csvfile as you'd expect in $0. So all you need to do is:
awk -f decsv.awk -f myscript.awk csvfile
where "myscript.awk" contains whatever script you really want to run on your file based on it being formated as FS-separated fields rather than CSV. If you just want the file printed, then you just need to put "1" or "{ print }" in "myscript.awk" (or at the end of "decsv.awk").
For example:
------
$ cat csvfile
1,abc,def ghi,jkl,unquoted character strings
2,"abc","def ghi","jkl",quoted character strings
3,123,4""56,789,numbers with escaped quote
4,,"", ,empty fields
5,abc,"def
ghi",234,embedded newline
6,abc,"def "" ghi",789,quoted fields
------
$ cat field4num.awk
# print records that contain a number in the 4th field
$4 ~ /[[:digit:]]/ {
printf "NR=%d, FNR=%d, NF=%d, $0=<%s>\n",NR,FNR,NF,$0
for (i=1;i<=NF;i++) {
printf "\t%d:<%s>\n",i,$i
}
}
------
$ awk -f decsv.awk -f field4num.awk csvfile
NR=3, FNR=3, NF=5, $0=<3 123 4"56 789 numbers with escaped quote>
1:<3>
2:<123>
3:<4"56>
4:<789>
5:<numbers with escaped quote>
NR=5, FNR=5, NF=5, $0=<5 abc def ghi 234 embedded newline>
1:<5>
2:<abc>
3:<def ghi>
4:<234>
5:<embedded newline>
NR=6, FNR=6, NF=5, $0=<6 abc def " ghi 789 quoted fields>
1:<6>
2:<abc>
3:<def " ghi>
4:<789>
5:<quoted fields>
------
Regards,
Ed.
.
- References:
- nested field separators
- From: Sashi
- Re: nested field separators
- From: Ed Morton
- Re: nested field separators
- From: Ed Morton
- Re: nested field separators
- From: Sashi
- nested field separators
- Prev by Date: Re: Is there a 64 bit version von gawk for windows?
- Next by Date: Re: Is there a 64 bit version von gawk for windows?
- Previous by thread: Re: nested field separators
- Next by thread: Can one awk script call another?
- Index(es):
Relevant Pages
|