Re: More on BEGINFILE / ENDFILE



In article <glvuuu$ghk$1@xxxxxxxxxxxxxxxxxx>,
Manuel Collado <m.collado@xxxxxxxxxxxxxx> wrote:
It's not clear what you're asking. A read error has two possible outcomes
in gawk:

1. If reading via getline from a command-line file, an error is returned.
2. If reading via the main input loop, the error is fatal.

You seem to be suggesting that in case 2, the error not be fatal, but
instead go into the ENDFILE block with ERRNO set.

Yes. This is exactly what I was suggesting.

OK. The diff below should do this. It is relative to the BEGINFILE
patch. I will be updating that patch on http://www.skeeve.com shortly.

Anyway, in practice, it is hard to have a case where a file is readable
part way through the processing and then suddenly becomes unreadable.

Well, if AWK had true unicode support sometime in the future, we can
have errors in the middle of a UTF-8 file, like "invalid byte sequence".

I don't even want to think about this. This is the job iconv is meant to do.

Do these errors show up as a result of calls to read? Are they currently
fatal errors?

In xgawk, reading input in XML mode is handled by feeding input text
chunks to the expat parser which in turn delivers "records" in the form
of XML SAX events. So non wellformed XML files generate faulty "records"
in the middle of the file, and they are considered non-fatal. The
current action is to set ERRNO and automatically ignore the rest of the
file and proceed to the next input file.

This means that the error notification is mixed with the next valid
record, when FILENAME and other special values related to the faulty
record have been updated to refer to the current new record (no problem
if the faulty file is the last one).

We could create a special XMLERROR event to notify non-fatal errors at
the same level of normal records, but the addition of the ENDFILE
feature opens another possibility for reporting non-fatal errors and
automatically continue processing the next input file. And this new
possibility can also be used to report errors of regular text files, and
not only XML ones.

I think the diff below gives you what you want, as long as your version
of get_a_record puts an appropriate value into the *errcode variable.

Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Thanks again for the feedback.

Arnold
----------------------------------------------------------------------------------
--- io.c.save 2008-12-25 08:59:30.000000000 +0200
+++ io.c 2009-02-03 22:29:31.000000000 +0200
@@ -353,7 +353,8 @@
fname = arg->stptr;
errno = 0;
curfile = iop_open(fname, binmode("r"), &mybuf, & isdir, FALSE);
- update_ERRNO();
+ if (! do_traditional)
+ update_ERRNO();

/* This is a kludge. */
unref(FILENAME_node->var_value);
@@ -442,17 +443,25 @@
char *begin;
register int cnt;
int retval = 0;
+ int errcode = 0;

if (at_eof(iop) && no_data_left(iop))
cnt = EOF;
else if ((iop->flag & IOP_CLOSED) != 0)
cnt = EOF;
else
- cnt = get_a_record(&begin, iop, NULL);
+ cnt = get_a_record(&begin, iop, & errcode);

if (cnt == EOF) {
cnt = 0;
retval = 1;
+ if (errcode > 0) {
+ if (do_traditional)
+ fatal(_("error reading input file `%s': %s"),
+ iop->name, strerror(errcode));
+ else
+ update_ERRNO_saved(errcode);
+ }
} else {
NR += 1;
FNR += 1;
@@ -959,10 +968,12 @@
lintwarn(_("close: `%.*s' is not an open file, pipe or co-process"),
(int) tmp->stlen, tmp->stptr);

- /* update ERRNO manually, using errno = ENOENT is a stretch. */
- cp = _("close of redirection that was never opened");
- unref(ERRNO_node->var_value);
- ERRNO_node->var_value = make_string(cp, strlen(cp));
+ if (! do_traditional) {
+ /* update ERRNO manually, using errno = ENOENT is a stretch. */
+ cp = _("close of redirection that was never opened");
+ unref(ERRNO_node->var_value);
+ ERRNO_node->var_value = make_string(cp, strlen(cp));
+ }

free_temp(tmp);
return tmp_number((AWKNUM) -1.0);
@@ -3037,13 +3048,10 @@
iop->flag |= IOP_AT_EOF;
return EOF;
} else if (iop->count == -1) {
- if (! do_traditional && errcode != NULL) {
+ iop->flag |= IOP_AT_EOF;
+ if (errcode != NULL)
*errcode = errno;
- iop->flag |= IOP_AT_EOF;
- return EOF;
- } else
- fatal(_("error reading input file `%s': %s"),
- iop->name, strerror(errno));
+ return EOF;
} else {
iop->dataend = iop->buf + iop->count;
iop->off = iop->buf;
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
.



Relevant Pages

  • Re: How to remove unwanted text from a .txt file?
    ... % Open file for reading, no idea what the t means though?! ... cnt = 0; %count of skipped lines ... while eof> 0; ... eof = fileSize-position; %check eof reached? ...
    (comp.soft-sys.matlab)
  • Re: What tool to use for processing large documents
    ... cannot parse faster than the disk can read the XML data. ... Reading 10 GB off a disk will take around 3 to 5 minutes ... I forgot to mention that my logs are in zipped xml. ... Get the set of nodes matching an XPath expression. ...
    (comp.text.xml)
  • Re: More on BEGINFILE / ENDFILE
    ... What kind of errors show up while reading records that are catchable? ... Hummm.., you said that input errors are always fatal, and not reported throughout unredirected getline. ... instead go into the ENDFILE block with ERRNO set. ... In particular the XML extension of xgawk really needs a place to report errors in the middle of a file, without stopping further processing of other files. ...
    (comp.lang.awk)
  • Re: What is base URI?
    ... I am spending time reading articles on the Web, ... specs, Dave Winer's RSS spec, etc. for some of the other stuff I need ... am looking for a one-stop shop for all the XML related technologies. ... I looked up MSDN and read this example. ...
    (comp.text.xml)
  • What is base URI?
    ... I am spending time reading articles on the Web, ... specs, Dave Winer's RSS spec, etc. for some of the other stuff I need ... am looking for a one-stop shop for all the XML related technologies. ...
    (comp.text.xml)