Re: changing a field without recompiling the record
- From: pk <pk@xxxxxxxxxx>
- Date: Mon, 01 Dec 2008 09:50:44 +0100
On Saturday 29 November 2008 19:49, Aharon Robbins wrote:
Well, I don't know the inner workings of the algorithm that parses the
line into fields according to FS, but I suppose it involves some kind of
loop over the line, each time looking if FS matches and if so, take what's
before the match as field and what matched as separator (roughly at
least). If it's so (which of course could be wrong), all the information
needed is already available as part of the process, and all that would be
needed is assigning those offsets into a special array, so, say
OFF[1]...OFF[NF] would contain the offsets in the line where $1...$NF
start. After that, one could easily pull out the parts of $0 that matched
FS based on OFF[n] and length($n). Or awk could directly assign the parts
that matched FS to a special array, like eg SEP[1]...SEP[NF-1].
"all that would be needed..." :-)
You'd be surprised how expensive these little things are, and how they
add up in CPU time when you have large numbers of records.
When gawk first got RS as a regexp and the RT variable, I did the simple
thing and cleared and then set RT on each record. This turned out to be
very expensive, especially as most of the time the value was the same:
"\n". When I made the code a little smarter to only set RT if it changed,
I/O speed improved considerably.
RT is a really cool feature, I use it all the time. Many many thanks for
introducing it.
Furthermore, for efficiency, gawk does not parse the record as soon as
it reads it. Instead, it only parses the record up to the largest field
that is accessed, and only parses the record fully when it's needed
(such as for the value of NF or $NF).
This works well, but the code that manages it isn't simple.
Setting these arrays unconditionally would require fully parsing the
record every time, and setting them only when referenced would introduce a
lot more complexity than I really want to deal with or that I really feel
is necessary. Particularly for something that would not be used often.
Ok, I admit my remark was probably naive, but I didn't want to sound *that*
naive :-)
First of all, I never thought or said of setting the arrays unconditionally
for each record. In my vision, that is meant to be an optional feature,
probably not requested nor needed 95% of the time. That means that 95% of
the time, nothing changes for the end user (or even 100%, if he doesn't
want the feature).
And even if the user wants it (with a command line switch for example), of
course it can be implemented in the most optimized way possible. Above you
said:
"for efficiency, gawk does not parse the record as soon as it reads it.
Instead, it only parses the record up to the largest field that is
accessed, and only parses the record fully when it's needed (such as for
the value of NF or $NF)."
Good. My bet is that if the user does not need to access a field, or does
not need to know NF, neither he needs to access that special array. So
parsing the record and filling the array could be done at the same time, ie
only when absolutely necessary.
However, since I don't know the actual inner workings of gawk, I have to
agree with you that even with all possible optimizations implementing the
feature could still be too complex or expensive if compared with its actual
real life usefulness and user demand.
Thanks for your answers!
.
- Prev by Date: Re: Help: Process Data
- Next by Date: Re: Sorting of elements containing letters and numbers as only numbers!
- Previous by thread: Re: changing a field without recompiling the record
- Next by thread: Re: changing a field without recompiling the record
- Index(es):
Relevant Pages
|