Re: Using a regexp as field separator does not work!
- From: Ed Morton <morton@xxxxxxxxxxxxxxxxxx>
- Date: Thu, 10 Jul 2008 20:32:46 -0500
On 7/10/2008 12:13 PM, Dave B wrote:
Ed Morton wrote:
Yes, I know that...but I'm not sure how awk determines it has encountered a
"field separator" if the regex '| *' is used as FS. What part of the
alternation is used? It seems that it's effectively treated like ' *', but
the corner case where a "nothingness" matches (which is allowed by ' *', and
which would thus make it behave similarly as if FS='') never happens. These
seem to be equivalent:
$ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'
$ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'
But I don't know why.
I'm not sure I understand the question. An FS of a null character is another
special case, just like an FS of a single blank character is a special case. A
null character appearing as part of an FS that's an RE isn't treated the same as
a null character that IS an FS, just like a single blank character appearing as
part of an FS that's an RE isn't treated the same as a blank character that IS
an FS.
Agreed, although strictly speaking we don't have "null characters", but
rather "null" or "empty" regexes here.
So, when you write FS='| *' you're saying the FS is either nothing at all OR a
sequence of zero or more blanks. Yes, that doesn't make sense so you can
obviously optimize it to ' *' but there's plenty of REs we see people write that
could be optimized and awk doesn't try to analyze and warn you about any of them
other than a useless backslash.
Ok, I'll try to explain better. Suppose we have FS='x|y', and the input is
fooxbarybaz
We know that, when awk encounters 'x', FS matches, so awk decides that that
'x' is a field separator. The same happens when awk gets to 'y', later.
Every time, awk (actually, awk's regex engine) has used a certain part of
the alternation in the FS regex to try a match and decide if a piece of
input was to be considered a field separator (of course, this is a simple
regex, but it can be more complex, with each part matching longer strings,
or with a different structure. Also, I'm using a regex of the form x|y
because it's similar to the case at hand).
Now, if FS='| *' (again an alternation), and the input is
abc
in theory, awk should immediatley find a match for FS, since the part to the
left of "|" is an empty regex, which matches at the beginning of the string,
at the end, and between any two characters. And, the part to the right of
the "|" (" *") also allows matching an empty string, although awk should
not get that far, since the part to the left of the "|" already matches. But
this does not happen. Also, as each character is examined, awk should find a
match for the empty regex between any two characters, but again that doesn't
happen. But awk DOES know how to do that, because if you do
a="abc"; gsub(/| */,"X",a)
you correctly get
XaXbXcX
So, my doubt was: why isn't awk matching the null regex (using either part
of the alternation appearing in FS)? I guess the answer is: because FS is
special and does not work that way, unless FS is explicitly set to '' (GNU
awk only). Ok. But then, how does it work? Why does awk choose to treat a
nonsense FS like '| *' as if it were ' *'? What's the logic behind that?
Hope this was clearer.
Yes, it's clearer. I think the inclusion of a leading "|" is a red herring since
the null string it would match is just a subset of what could be matched by " *"
so you'd expect exactly the same behavior in either gsub() or in setting FS for
" *" or "| *" and that is what you get.
The problem I think I see, though, is that I'd expect this:
$ echo "abc" | awk '{gsub(/ */,"X")}1'
XaXbXcX
to produce the same output as this:
$ echo "abc" | awk '{gsub(//,"X")}1'
XaXbXcX
which it does, or this:
$ echo "abc" | awk 'BEGIN{FS=" *";OFS="X"}{$1=$1}1'
abc
which it doesn't and I can't think of any reason for that.
Regards,
Ed.
.
- Follow-Ups:
- Re: Using a regexp as field separator does not work!
- From: Dave B
- Re: Using a regexp as field separator does not work!
- References:
- Using a regexp as field separator does not work!
- From: Ronny
- Re: Using a regexp as field separator does not work!
- From: Loki Harfagr
- Re: Using a regexp as field separator does not work!
- From: Ronny
- Re: Using a regexp as field separator does not work!
- From: Dave B
- Re: Using a regexp as field separator does not work!
- From: Ed Morton
- Re: Using a regexp as field separator does not work!
- From: Dave B
- Re: Using a regexp as field separator does not work!
- From: Ed Morton
- Re: Using a regexp as field separator does not work!
- From: Dave B
- Using a regexp as field separator does not work!
- Prev by Date: Re: gawk integer conversions
- Next by Date: Re: Using a regexp as field separator does not work!
- Previous by thread: Re: Using a regexp as field separator does not work!
- Next by thread: Re: Using a regexp as field separator does not work!
- Index(es):