Re: More on BEGINFILE / ENDFILE
- From: Manuel Collado <m.collado@xxxxxxxxxxxxxx>
- Date: Fri, 30 Jan 2009 23:28:29 +0100
Aharon Robbins escribió:
In article <gl9evo$60e$1@xxxxxxxxxxxxxxxxxx>,
Manuel Collado <m.collado@xxxxxxxxxxxxxx> wrote:
[...]
Hummm.., you said that input errors are always fatal, and not reported throughout unredirected getline. Right?That works nows; you should then add a nextfile to skip the bad file.I wonder if the following code could be used to test for readability of a set of files, and report failures:Please also consider that ENDFILE is an appropriate place to catch errors while reading records.What kind of errors show up while reading records that are catchable?
Can you give me an explicit example, because I'm not understanding you.
BEGINFILE {
if (ERRNO) {
print "error opening " FILENAME ": " ERRNO
ERRNO = ""
}
}
ENDFILE {I don't think anything would ever get here; If the file cannot be opened
if (ERRNO) {
print "error reading " FILENAME " after line " FNR ": " ERRNO
}
}
or is a directory, you catch it in the BEGINFILE block.
If the file was already successfully opened, and if, say, an NFS file
went away and read() returned -1, then getline would return -1, instead
of falling into the ENDFILE block.
And the normal case is to read the file by the input loop, with no explicit getline.
It's not clear what you're asking. A read error has two possible outcomes
in gawk:
1. If reading via getline from a command-line file, an error is returned.
2. If reading via the main input loop, the error is fatal.
You seem to be suggesting that in case 2, the error not be fatal, but
instead go into the ENDFILE block with ERRNO set.
Yes. This is exactly what I was suggesting.
Given the current structure of the code, this might be doable. I have
to think about this some more as to whether it's the right thing to do.
It is something I hadn't thought about.
Well, a file is fully readable if every record is readable.Same by using getline alone:(Your loop, BTW, checks the file's readability for every record read;
BEGIN {
while ((getline x) != 0) {
if ((getline x) < 0) {
print "error reading " FILENAME " after line " FNR ": " ERRNO
}
}
}
not very efficient.)
Yes, but your code actually reads two records per iteration... :-)
Ooops! My mistake. It was late night when I wrote it. My neurons slipped :-(
I think you want:
BEGIN {
while ((val = (getline x)) != 0) {
if (val < 0)
print "error reading " FILENAME " after line " FNR ": " ERRNO
}
}
Of course.
Anyway, in practice, it is hard to have a case where a file is readable
part way through the processing and then suddenly becomes unreadable.
Well, if AWK had true unicode support sometime in the future, we can have errors in the middle of a UTF-8 file, like "invalid byte sequence".
I have to wonder if trying to catch it isn't a case of diminishing returns.
I'm not a salesman :-)
Seriously, forcing every user to take the responsibility for error handling will diminish returns. But allowing it to users that explicitly request it, instead of forbidding, would be a good thing.
The gawk manual shows you how to do this by simply looping through ARGV,This is cumbersome.
using a redirected getline to test for readability and then removing
the bad element from ARGV.
Not any more so than your double getline loop... :-)
Ditto.
In most cases yes, but not always. In particular the XML extension of xgawk really needs a place to report errors in the middle of a file, without stopping further processing of other files. I see ENDFILE as an excellent point for handling these errors (XML parsing and encoding conversion errors).If input errors are not flagged as fatal, the above examples will process all arguments, even some of them are unreadable.I understand, but I think the current BEGINFILE semantics give you
the hooks to handle things adequately.
Do these errors show up as a result of calls to read? Are they currently
fatal errors?
In xgawk, reading input in XML mode is handled by feeding input text chunks to the expat parser which in turn delivers "records" in the form of XML SAX events. So non wellformed XML files generate faulty "records" in the middle of the file, and they are considered non-fatal. The current action is to set ERRNO and automatically ignore the rest of the file and proceed to the next input file.
This means that the error notification is mixed with the next valid record, when FILENAME and other special values related to the faulty record have been updated to refer to the current new record (no problem if the faulty file is the last one).
We could create a special XMLERROR event to notify non-fatal errors at the same level of normal records, but the addition of the ENDFILE feature opens another possibility for reporting non-fatal errors and automatically continue processing the next input file. And this new possibility can also be used to report errors of regular text files, and not only XML ones.
In other words, with the ENDFILE patch in place, can you not make use of
it as you want?
Yes. It can be used.
What other changes are needed in the gawk internals to give
you what you're looking for?
None w.r.t. XML processing. But I've been always surprised by the fact that input errors are so strictly flagged as fatal. I guess that this early design decision obeys to the fact that there are no adequate places where to notify them to the user code, other than the return code of getline. And handling getlined faulty records as non-fatal and normal faulty records as fatal would certainly be an inconsistency.
But the addition of the ENDFILE block creates a new place for reporting main input read errors. Awk scripts with an ENDFILE rule can potentially catch all input read errors in a uniform way.
All said, I agree that adding new features to gawk must be done very carefully. Perhaps tawk can be used as a reference model in this particular case.
Regards.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
.
- References:
- More on BEGINFILE / ENDFILE
- From: Aharon Robbins
- Re: More on BEGINFILE / ENDFILE
- From: Manuel Collado
- Re: More on BEGINFILE / ENDFILE
- From: Aharon Robbins
- Re: More on BEGINFILE / ENDFILE
- From: Manuel Collado
- Re: More on BEGINFILE / ENDFILE
- From: Aharon Robbins
- More on BEGINFILE / ENDFILE
- Prev by Date: Re: a wiki about awk
- Next by Date: Re: a wiki about awk
- Previous by thread: Re: More on BEGINFILE / ENDFILE
- Next by thread: Indirect function calls patch for gawk available
- Index(es):
Relevant Pages
|