Re: Finding pattern more than once per record
- From: "John L" <jl@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Mon, 27 Jun 2005 08:16:39 +0100
"Jim Dornbos" <usenet@xxxxxxxx> wrote in message news:3cjub1hall8uh764a521hbqelrqirpna7s@xxxxxxxxxx
> I'm working on win2k and xp pro boxes. I've updated to gawk 3.1.3.
>
> We're doing variable data printing - where a customer supplies us with an image
> library of files in tif and eps format, and a database (in comma delimited
> format) that references those files (along with containing the rest of the
> details for the printed piece). I can't fix the front end - of making sure that
> they don't reference an image from the database until it's in the library, but I
> do need to fix the back end so we don't start printing their job while we are
> missing images.
>
> I created a file that contains the list of images in the library using the DOS
> command dir /b >>library.txt. (currently about 2500 files, but nothing says it
> won't be multiples of that next time around ). So library.txt has the file name
> and extension alone, 1 per line, no other details, in the form of : test13.eps.
>
> I've been trying to come up with an awk solution to extract image names from the
> comma delimited database and put them into another file with the record number.
> Currently there are several hundred fields and approx 100 of them may contain a
> filename. Current project has 8000 records, but again nothing to say that the
> next job won't have 50,000 records. (Actually - the current job is done, and I'm
> trying to devise a way to avoid problems for next time around.)
>
> My test input database looks like:
> "test13a.eps","test13b.tif","test13c.EpS","test13d.TIF"
> "test14a.eps","test14b.tif","test14c.EpS","test14d.TIF"
> "test15a.ePS","test15b.tif","test15c.EpS","test15d.TIF"
>
> My current awk program looks like:
> BEGIN { FS = ","
> OFS = ","
> }
>
> MAIN
> match( tolower($0), /[a-z0-9]+.eps/) { print NR " " substr($0, RSTART, RLENGTH)
> ;
> $0 = substr($0, (RSTART + RLENGTH)) }
>
> match( tolower($0), /[a-z0-9]+.tif/) { print NR " " substr($0, RSTART, RLENGTH)
> }
>
> My current output looks like:
> 13 test13a.eps
> 13 test13b.tif
> 14 test14a.eps
> 14 test14b.tif
> 15 test15a.ePS
> 15 test15b.tif
>
> The ideal output should be:
> 13 test13a.eps
> 13 test13b.tif
> 13 test13c.EpS
> 13 test13d.TIF
> 14 test14a.eps
> and so on.
>
> The $0 = substr($0, (RSTART + RLENGTH)) is my first attempt at having it rewrite
> $0 from the last successful match and carry on matching - but that's not
> working. What is the method for having awk continue looking thru the current
> record to find more occurances of the pattern?
>
> Another issue that I need to resolve - I noticed a couple of the filenames have
> characters other than [a-z0-9] in them - specifically a "+" and "$". Is there a
> better way to specify the match pattern for the first part of the filename?
>
> Thanks for any help you can offer. Didn't mean to write a book here, but hate to
> waste folks time by not supplying the details you need to give a helpful answer.
>
All this seems far too much like hard work.
Being lazy, I'd split the job into two parts:
first, write _every_ field and its record number on its own line,
then, second, select which of these lines corresponds to a file.
The first job, then, will involve stepping through each _field_
(as you wanted, above) and might look something like this:
BEGIN { FS = "," }
{
for (i = 1; i <= NF; i++)
print NR, $i
}
The for-loop steps through each field, since awk sets NF to
the number of fields in each record, so fields are numbered
from 1 to NF (the first field is $1, the last is $NF).
So, if the input looked like:
"test13a.eps","rubbish","test13c.EpS","test13d.TIF"
"test14a.eps","nonsense","test$14c.EpS","test14d.TIF"
"test15a.ePS","drivel","test15c.EpS","test15d.TIF"
This first stage gives us output something like:
1 "test13a.eps"
1 "rubbish"
1 "test13c.EpS"
1 "test13d.TIF"
2 "test14a.eps"
2 "nonsense"
2 "test$14c.EpS"
2 "test14d.TIF"
3 "test15a.ePS"
3 "drivel"
3 "test15c.EpS"
3 "test15d.TIF"
And now we can feed that to a second awk program which
just picks out credible-looking filenames, which you say
end in either .tif or .eps where these suffixes may be
in upper-, lower-, or mixed-case. Something like:
BEGIN { IGNORECASE = 1 }
/\.(eps|tif)\"$/ { print }
so that we print each line with . follwed by either tif or eps
followed by " (which we protect by \) which must be at the end ($).
Now, in your version you check the first part of the name
as well, which must be letters, numbers, + or $. If you do
not need to check this, then don't. If you do, then:
BEGIN { IGNORECASE = 1 }
$2 ~ /^\"[$+[:alnum:]]*\.(eps|tif)\"$/ { print }
but this is harder to read so leave it out if not needed.
So we now have, after passing through both programs:
1 "test13a.eps"
1 "test13c.EpS"
1 "test13d.TIF"
2 "test14a.eps"
2 "test$14c.EpS"
2 "test14d.TIF"
3 "test15a.ePS"
3 "test15c.EpS"
3 "test15d.TIF"
and removing the quotes can be left to a third program.
Some important thoughts: how sure are you about the input
csv format? Csv is the invention of the devil and can be
hard to get right. That said, so far as I can tell, you
just need to be sure that no filename includes commas
and I think Windows guarantees that (does it?).
But you may well have spaces in filenames, so using spaces
to separate fields (as above) may not be a good choice.
Better would be to stick to commas, so add OFS = "," to
the first program, and FS = "," to the second one.
--
John.
.
- Follow-Ups:
- Re: Finding pattern more than once per record
- From: Jim Dornbos
- Re: Finding pattern more than once per record
- References:
- Finding pattern more than once per record
- From: Jim Dornbos
- Finding pattern more than once per record
- Prev by Date: Finding pattern more than once per record
- Next by Date: Re: Finding pattern more than once per record
- Previous by thread: Finding pattern more than once per record
- Next by thread: Re: Finding pattern more than once per record
- Index(es):
Relevant Pages
|
Loading