Re: *really* shortest match in awk - possible?



Hello netlanders,

On Thu, 22 Nov 2007 23:50:19 +0000, Steffen Schuler wrote:

<snip>
On Wed, 21 Nov 2007 19:18:33 +0100, Tomasz Chmielewski wrote:

Yes, everything is possible in awk, but not for mere mortals, right?

I have a mbox file (see below for an example) with advertisements
attached to the end of every message. I would like to remove these ads.
Ads are placed between -------- and _________ characters.

I tried using something like:

awk '/^----/,/^____/{next}{print}'

but it eats up a bit too much, and one could argue if it's really the
shortest match.

Here an example mbox (note that the length of ------ and ______ can
vary):


From - Sun Sep 18 12:55:25 2005
(...)
Some text I want to keep
Some text I want to keep

------------------------------------------------------- SF.Net email is
sponsored by:
Tame your development.....
________________________________________ this text should stay
email@address
https://address/should/stay


From - Sun Sep 18 12:58:18 2005
(...)
2 Some text I want to keep 2
2 Some text I want to keep 2

A cool diagram which needs to stay:
------------------------------------- |This will be gone, too if I use
| |awk '/^----/,/^____/{next}{print}' | |because of "------" above
| -------------------------------------

2 Some text I want to keep 2
2 Some text I want to keep 2


--------------------
SF.Net email is sponsored by:
some other
advertisement
_______________________________________________ this text should stay
email@address
https://address/should/stay


From - Thu Sep 22 16:00:04 2005
(...)
3 Some text I want to keep 3
3 Some text I want to keep 3

-----------------------------------
SF.Net email is sponsored by:
_________________________________________ this text should stay
email@address
https://address/should/stay

a working POSIX awk script is coded according to Janis's ideas:

/^_+$/ { pr(a, e-1)
del(a)
e = 0
next }
/^-+$/ { ++e; j = 1 }
e { a[e, j++] = $0; next }
1
END { pr(a, e) }

function pr(a, e, i, j) {
for (i = 1; i <= e; ++i) {
for (j = 1; (i,j) in a; ++j)
print a[i,j]
} }

function del(a, k) {
for (k in a)
delete a[k]
}
<snip>

A shorter, simplified, and more space and time saving POSIX awk script is:

/^-+$/ { pr(); del(); addl(); next }
/^_+$/ && i { del(); next }
!i { print; next }
{ addl() }
END { pr() }

function addl() {
a[++i] = $0
}

function pr() {
for (i = 1; i in a; ++i)
print a[i]
}

function del() {
for (i in a)
delete a[i]
i = 0
}

Enjoy awk,

Steffen "goedel" Schuler
.