Re: Finding pattern more than once per record




"Jim Dornbos" <usenet@xxxxxxxx> wrote in message news:3cjub1hall8uh764a521hbqelrqirpna7s@xxxxxxxxxx
> I'm working on win2k and xp pro boxes. I've updated to gawk 3.1.3.
>
> We're doing variable data printing - where a customer supplies us with an image
> library of files in tif and eps format, and a database (in comma delimited
> format) that references those files (along with containing the rest of the
> details for the printed piece). I can't fix the front end - of making sure that
> they don't reference an image from the database until it's in the library, but I
> do need to fix the back end so we don't start printing their job while we are
> missing images.
>
> I created a file that contains the list of images in the library using the DOS
> command dir /b >>library.txt. (currently about 2500 files, but nothing says it
> won't be multiples of that next time around ). So library.txt has the file name
> and extension alone, 1 per line, no other details, in the form of : test13.eps.
>
> I've been trying to come up with an awk solution to extract image names from the
> comma delimited database and put them into another file with the record number.
> Currently there are several hundred fields and approx 100 of them may contain a
> filename. Current project has 8000 records, but again nothing to say that the
> next job won't have 50,000 records. (Actually - the current job is done, and I'm
> trying to devise a way to avoid problems for next time around.)
>
> My test input database looks like:
> "test13a.eps","test13b.tif","test13c.EpS","test13d.TIF"
> "test14a.eps","test14b.tif","test14c.EpS","test14d.TIF"
> "test15a.ePS","test15b.tif","test15c.EpS","test15d.TIF"
>
> My current awk program looks like:
> BEGIN { FS = ","
> OFS = ","
> }
>
> MAIN
> match( tolower($0), /[a-z0-9]+.eps/) { print NR " " substr($0, RSTART, RLENGTH)
> ;
> $0 = substr($0, (RSTART + RLENGTH)) }
>
> match( tolower($0), /[a-z0-9]+.tif/) { print NR " " substr($0, RSTART, RLENGTH)
> }
>
> My current output looks like:
> 13 test13a.eps
> 13 test13b.tif
> 14 test14a.eps
> 14 test14b.tif
> 15 test15a.ePS
> 15 test15b.tif
>
> The ideal output should be:
> 13 test13a.eps
> 13 test13b.tif
> 13 test13c.EpS
> 13 test13d.TIF
> 14 test14a.eps
> and so on.
>
> The $0 = substr($0, (RSTART + RLENGTH)) is my first attempt at having it rewrite
> $0 from the last successful match and carry on matching - but that's not
> working. What is the method for having awk continue looking thru the current
> record to find more occurances of the pattern?
>
> Another issue that I need to resolve - I noticed a couple of the filenames have
> characters other than [a-z0-9] in them - specifically a "+" and "$". Is there a
> better way to specify the match pattern for the first part of the filename?
>
> Thanks for any help you can offer. Didn't mean to write a book here, but hate to
> waste folks time by not supplying the details you need to give a helpful answer.
>

All this seems far too much like hard work.
Being lazy, I'd split the job into two parts:
first, write _every_ field and its record number on its own line,
then, second, select which of these lines corresponds to a file.

The first job, then, will involve stepping through each _field_
(as you wanted, above) and might look something like this:

BEGIN { FS = "," }
{
for (i = 1; i <= NF; i++)
print NR, $i
}

The for-loop steps through each field, since awk sets NF to
the number of fields in each record, so fields are numbered
from 1 to NF (the first field is $1, the last is $NF).

So, if the input looked like:
"test13a.eps","rubbish","test13c.EpS","test13d.TIF"
"test14a.eps","nonsense","test$14c.EpS","test14d.TIF"
"test15a.ePS","drivel","test15c.EpS","test15d.TIF"


This first stage gives us output something like:
1 "test13a.eps"
1 "rubbish"
1 "test13c.EpS"
1 "test13d.TIF"
2 "test14a.eps"
2 "nonsense"
2 "test$14c.EpS"
2 "test14d.TIF"
3 "test15a.ePS"
3 "drivel"
3 "test15c.EpS"
3 "test15d.TIF"

And now we can feed that to a second awk program which
just picks out credible-looking filenames, which you say
end in either .tif or .eps where these suffixes may be
in upper-, lower-, or mixed-case. Something like:

BEGIN { IGNORECASE = 1 }
/\.(eps|tif)\"$/ { print }

so that we print each line with . follwed by either tif or eps
followed by " (which we protect by \) which must be at the end ($).

Now, in your version you check the first part of the name
as well, which must be letters, numbers, + or $. If you do
not need to check this, then don't. If you do, then:

BEGIN { IGNORECASE = 1 }
$2 ~ /^\"[$+[:alnum:]]*\.(eps|tif)\"$/ { print }

but this is harder to read so leave it out if not needed.

So we now have, after passing through both programs:
1 "test13a.eps"
1 "test13c.EpS"
1 "test13d.TIF"
2 "test14a.eps"
2 "test$14c.EpS"
2 "test14d.TIF"
3 "test15a.ePS"
3 "test15c.EpS"
3 "test15d.TIF"
and removing the quotes can be left to a third program.

Some important thoughts: how sure are you about the input
csv format? Csv is the invention of the devil and can be
hard to get right. That said, so far as I can tell, you
just need to be sure that no filename includes commas
and I think Windows guarantees that (does it?).

But you may well have spaces in filenames, so using spaces
to separate fields (as above) may not be a good choice.
Better would be to stick to commas, so add OFS = "," to
the first program, and FS = "," to the second one.

--
John.





.



Relevant Pages

  • Finding pattern more than once per record
    ... We're doing variable data printing - where a customer supplies us with an image ... they don't reference an image from the database until it's in the library, ... I've been trying to come up with an awk solution to extract image names from the ... better way to specify the match pattern for the first part of the filename? ...
    (comp.lang.awk)
  • Re: Divining the full pathname of a file, all logicals translated
    ... If the database contains filenames that were not parsed with no_conceal, ... then the online filename and the database filename need to be parsed with ... return the filename using the disk logical that the disk was mounted with. ... drive name needs to be retained also, unmasked from any logicals. ...
    (comp.os.vms)
  • Re: store files at my web host - how
    ... I'm doing something similar with product images, so I'll post a little of my ... Then I try to get the filename alone (maybe ... to a database. ... You can grab a bit of metadata as well from the PostedFile, ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: tool for finding paths to files
    ... the path and filename of the file you selected in the dialog box. ... So in my PictureLoad sample, the call is behind the "Choose a Picture" ... this to include other file formats or ALL file formats. ... is a small Access database ...
    (microsoft.public.access.forms)
  • Re: construct the filename to process??
    ... the third field has a date "20060905" ... Ed Morton kindly showme this ... but i guess this is when you had passed the second filename as ... 1.construct a unique file to process, maybe using cat inside awk.. ...
    (comp.lang.awk)

Loading