The macosxhints Forums

The macosxhints Forums (http://hintsforums.macworld.com/index.php)
-   UNIX - General (http://hintsforums.macworld.com/forumdisplay.php?f=16)
-   -   sed question (http://hintsforums.macworld.com/showthread.php?t=15478)

tas 09-26-2003 09:46 AM

sed question
 
I am trying to write a sed-script. Here's my problem: the file that I want to edit is all in uppercase letters. I want to convert them into lowercase UNLESS they are preceded by a *. What I've tried so far (successfully) is
Code:

y/ABC.../abc.../g
s/\*a/A/g
s/\*b/B/g

and so on, but certainly there is a more elegant way of doing this--I'm just too dumb to find out? sed is a wonderful too--you can spend hours and hours playing around with it, avoiding serious work...

Thanks everybody!

chriscaldes 09-26-2003 09:59 AM

for the UPPER/LOWER conversion - you want to use tr probably

sh-2.05a$ echo HELLO | tr "[:upper:]" "[:lower:]"
hello
sh-2.05a$

Then you just have to figure out how to isolate the line that are preceeded by *.

Are these files you are trying to rename - or are they lines in a file?

so you may not need/want sed after all - but you could probably do it with sed....

Leo_de_Wit 09-26-2003 10:11 AM

Re: sed question
 
Quote:

Originally posted by tas
I am trying to write a sed-script. Here's my problem: the file that I want to edit is all in uppercase letters. I want to convert them into lowercase UNLESS they are preceded by a *. What I've tried so far (successfully) is
Code:

y/ABC.../abc.../g
s/\*a/A/g
s/\*b/B/g

and so on, but certainly there is a more elegant way of doing this--I'm just too dumb to find out? sed is a wonderful too--you can spend hours and hours playing around with it, avoiding serious work...

Thanks everybody!
Hi tas!
You can count me in on the sed-fanatics! :cool:
I even coded that b*stard for an Atari ST...

What you want can be done in several ways.
A simple one:
Code:

/^\*/b
y/ABC.../abc.../

Note the first line will make the script skip any line starting with a star.
I also removed the g modifier, since I think the y command doesn't have it.
A sed one-liner (that's what you were looking for ;) )
Code:

/^\*/!y/ABC.../abc.../
This says: do the y-translation for all lines NOT star-ting with a star (pun intended :p ).

Leo

tas 09-26-2003 10:18 AM

Tanks guys--that was fast!! Unfortunately, the *s are not at the beginning of lines. It's like so:
AKJH *GHJGKJ KJHLK *OIUOIU
should become
akjh Ghjgkj kjhlk Oiuoiu
throughout the file. And I don't want lines starting with * skipped 'cause it's a multi-function script and all the othe conversions must still be applied. Keep up the good work!

Leo_de_Wit 09-26-2003 10:26 AM

Quote:

Originally posted by tas
Tanks guys--that was fast!! Unfortunately, the *s are not at the beginning of lines. It's like so:
AKJH *GHJGKJ KJHLK *OIUOIU
should become
akjh Ghjgkj kjhlk Oiuoiu
throughout the file. And I don't want lines starting with * skipped 'cause it's a multi-function script and all the othe conversions must still be applied. Keep up the good work!
Oh, but you can skip to a position, by following the b(ranch) command with a label.
(Default label is end of script).
And the one-liner still holds!
If the star can be anywhere on the line, it's even easier: leave out the ^ (start of line).

Leo

tas 09-26-2003 11:09 AM

Again, thanks Leo, this is fun! Maybe I'm obtuse, but wouldn't your one-liner
Code:

/\*/!y/ABC.../abc.../
skip all LINES containing a *? Doesn't work here?

jbc 09-26-2003 01:27 PM

Trying to get ready for work, so haven't tested this. Here's a modified example from "sed & awk".

Basically it tries to change everything to lowercase, continue on if the line doesn't contain "*" followed by a lowercase letter, or loop through the line replacing the first "*[a-z]" with its uppercase counterpart otherwise.
Code:

y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
:begin
/\*[a-z]/! b
h
s/.*\(\*[a-z]\).*/\1/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.*\)\n\(.*\)\*[a-z]\(.*\)/\2\1\3/
t begin

You'll need to run this as a script file using sed's -f option.

Double-check it. I make mistakes when I'm trying to hurry!

Brad

tas 09-26-2003 03:49 PM

Thanks, jbc. That one's too complicated for me, and since I don't understand the syntax, I don't understand the error I get either: "sed: file command2 line 1: Extra characters after command." But if I understand, your script is changing everything to lowercase and then back again, so I might as well do it via 25 s/\*x/X/g commands?

jbc 09-26-2003 11:13 PM

Hmm. Got home from work and tried it; worked just fine for me. You need to save the text above as a text file, say "sedscript", then call it to process the file you're working with, say "Uppercase.txt". Typically you would redirect the output to another file.
Code:

sed -f /Users/YourUserName/Documents/sedscript \
/Users/YourUserName/Documents/Uppercase.txt > \
/Users/YourUserName/Documents/finaloutput.txt

Your 26 global replacements will work, of course. The script above may be slightly more efficient. You could use awk or perl as well, but sed is small and fun!

The script is not really as awful as it looks. Here's a commented version to hopefully clarify the syntax for you.
Code:

# Change the line to all lowercase to start.
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/

# Label for start of loop
:begin

# If line doesn't contain "*" followed by a lowercase letter,
# no change necessary, so branch to end of script.
/\*[a-z]/! b

# If line DOES contain "*" followed by a lowercase letter,
# need to change it, so place a copy of the line in sed's "hold" area
# (sort of like using a variable).
h

# Replace pattern space with first "*[a-z]" occurrence in line.
# Entire pattern space is now "*lowercaseletter".
s/.*\(\*[a-z]\).*/\1/

# Uppercase the pattern space, so entire pattern space is now
# "*" followed by correct UPPERCASE letter
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/

# Append hold space to pattern space, separated by newline (\n).
# Pattern space is now "*UPPERCASELETTER\n(original line)"
G

# Match uppercased version before \n, the part of the line before first "*[a-z]",
# and the part of the line after the first "*[a-z]".
# Replace the entire line with the matched portions in the correct order.
# This basically replaces the first "*[a-z]" in the line with "*[A-Z]".
s/\(.*\)\n\(.*\)\*[a-z]\(.*\)/\2\1\3/

# Return to top of loop in case line contains more "*[a-z]" elements.
b begin

The original "t" I used at the end was unnecessary; "b" works just fine. The "hold" command in sed takes some getting used to, but makes some fairly elaborate transformations within each line possible.

Btw, I'm using GNU sed, which is supports some things not found in the Apple-supplied sed. May be why you get an error when I don't. I highly recommend gsed; you can install it via fink, darwinports, or just compile it yourself.

tas 09-27-2003 03:44 AM

jbc,
thanks so much--I feel very embarrassed that you put so much effort into this, and I learned a lot from your example and your explanations! I think I found out what the error message was: I had written and edited the script in BBEdit, which inserted some end-of-line characters that sed didn't like. After removing them in pico, the script now works--that is, it runs without producing errors (I had to insert an extra space in ": begin" though). However, it doesn't replace *a with A, it just gobbles all characters preceded by a *. I decided to go the unelegant route after all, inserting /s/*a/A for every letter. It's not a very long script, and I tried it, it takes only 2 or 3 second on files of ~300k, so I guess there's no significant gain of speed in writing it differently. Maybe, just for the heck of it, I'll try and see if I can do the same thing in awk and perl...
Anyway, thanks a lot, I really appreciate your help, and I enjoyed writing this script--it was my first encounter with sed, and if I don't count the fact that recounting those darn //\\\*./\ is a pain in the back, it really was a lot of fun.

Mikey-San 09-27-2003 06:15 AM

I don't have much to add to this thread, but staring at it made my head explode.

"s/\//\?/ -- Regular expressions: Like watching your keyboard have a seizure." - Me

jbc 09-27-2003 12:45 PM

Quote:

Originally posted by Mikey-San
I don't have much to add to this thread, but staring at it made my head explode.
LOL..yep, that pretty much sums up reading sed scripts...

tas- no worries, just thought explaining the script a bit might help you if you're learning sed, because as Mikey-San points out, it's not easy reading. Although something is terribly amiss; the script should *not* be gobbling characters. Ah well, you have a working script, so all is fine.

Btw, if you're using a recent version of BBedit on OS X, you might want to set "Default Line Breaks" to "Unix" in the "Text Files: Saving" pref panel and make sure "Translate Line Breaks" is turned on in the "Text Files: Opening" pref panel. This means that every file has return line endings while you edit it, but gets saved with linefeed line endings that sed and other Unix commands will like. Can be a real time-saver.

pmccann 09-28-2003 04:27 AM

Phew: them be honkin' great substitutions. Does sed not allow character classes into the transliteration --err, "y"-- operator?

I guess what I'm getting at is that here's a perl implentation of what I think you're after:

Code:

perl -pe 'tr/[A-Z]/[a-z]/; s/\*([a-z])/uc $1/eg' filename
I just used someone's early description of the problem in the thread above; nothing fancy. What does the sed translation of this look like? (Well, obviously something like that given above, but does it really need to be so "extensive"?)

Cheers,
Paul

tas 09-28-2003 05:04 AM

Paul--
great you're here, I was still laughing from reading your great rant in the other post where you sed (sorry for the pun) perl was the Swiss chainsaw; wonderful! As luck would have it, I was just trying to rewrite my scrip in perl. This is my first encounter with perl, tried to pick it up a couple of times, but I prefer a hands-on approach, so reading all those manuals and tutorials did not do me any good. I tried to use s2p to convert my script into perl, but the result was a ludicrous output that translated my one-liner into 2 solid pages of code. So I decided to learn something about perl by writing the script from scratch. Just to get me started: I guess I've now understood what "open(FILEHANDLE,"filename");" does and what "while(<FILEHANDLE>){}" means. But how can I make a perlscript accept a filename from STDIN as an argument?
Oh, and yes, after what I tried, the "y/.../.../" thingy in sed does seem to demand complete sets, so you can't use [a-z] vel sim.

tas 09-28-2003 05:46 AM

That was one I could have answered myself with a bit of research: it's the @ARGV variable... Stay tuned, more stupid questions to follow!

tas 09-28-2003 10:08 AM

That was fun! After I figured out how to "open" a file for reading and writing (maybe someone out there is as dumb as I am, it's done via
Code:

while (<>) {
things weren't too difficult. Most lines could just be copied from the sed script without too much editing. The perl script looks more elegant (instead of having the chunky y/ABC.../abc.../, I now simply have
Code:

$_ = lc $_;
To sum it up. The difference between perl and sed was that in sed, you type a lot of (back)slashs (slashes?) and few semicola, in perl you type a lot of semicola and few (back)slashs. It all depends on what kind of seizure you'd rather have... Thanks for everybody's help!!!

pmccann 09-28-2003 11:39 AM

Good to hear you're giving perl a go: feel free to ask loads of questions, even if you're going to answer most of them yourself!

Yep, "<>" is the "magic" filehandle, which allows you to read all the files given in the arguments to the script line by line. So something like

perl -e 'while (<>){print}' *.txt > bigtext

would be a dumb way to write the content of all files ending in .txt into one big file called "bigtext".

Alternatively you can just iterate through @ARGV yourself to do the same thing via something like

#!/usr/bin/perl -w
use strict;
open OUT,">bigtext" or die "cannot open outfile: $!";
foreach my $thing (@ARGV){
open IN,"$thing" or die "no such file?: $!";
while (<IN>){
print OUT;
# well, here's where you'd really do something interesting
}
close IN;
}

Indeed this is pretty much equivalent to the above one-liner, but --from memory-- lacks a little bit of the magic. Well, really just the fact that <> will fall back to STDIN if you neglect to include any files on the command line, while the second script will do exactly what you ask for: nothing.

Cheers,
Paul

jbc 09-28-2003 01:55 PM

Quote:

Originally posted by pmccann
Good to hear you're giving perl a go: feel free to ask loads of questions, even if you're going to answer most of them yourself!
Paul-

I'm also giving perl a go (only in chapter 3 of the llama book <sigh>), and since this seems to be becoming a sed-compared-to-perl thread, thought this might be an appropriate place to ask this.

I'm trying to write a script that will simply strip html MIME parts from an email message or an mbox (mainly to deal with certain gnarly multipart html spams). I've about got it worked out using sed (seemed a likely candidate for this) in a shell script, but I'm curious if perl would work for this, since I'm already running spamassassin (or will a script start up a new perl process?).

The script's requirements are pretty simple as I understand them, although I'm no MIME expert. Basically it needs to find each line that starts with "Content-Type: text/html" (case insensitive) and is preceded by a line that begins with "--", then delete the "Content-Type: text/html" line down to and including the next line that matches the "--.*" line that is above it. The script needs to accept an email message from standard input and write the modified message to standard output.

Sed lacks a way to store the first "--" line to use as a range-matching criterion for the end of the range; have to put it in a shell variable. I'm guessing perl has a more elegant way to deal with this.

And since you're the resident perl expert, I thought I would ask if this is something that is easy to accomplish in perl (at least if you're past chapter 3 of the llama book!).

Thanks for any input!
Brad

pmccann 09-28-2003 09:25 PM

Hmm, good task for a Monday morning. Try the following one liner:

perl -0 -pe 's/^(--[^\n]*)\nContent\-Type:\s*text\/html.*$1//msig'

It should be reasonably robust (within the criteria you specified above: were I doing this for myself I'd certainly wimp out and pull in a MIME parsing module rather than handrolling something. Still, if all you're interested in is obliterating the things and they all take that form then maybe a regexp based solution is enough.)

Explanation? The -0 flag causes the whole file to be sucked in as one big fat lump (into $_). This shouldn't be a problem unless you're receiving html messages that are hundreds of megabytes in size. In that case you've got *real* problems! [[Anyone else noticed that if you keep typing in a text box in Safari and the text touches the bottom of the window that the scroll bar gets decorated in a pattern that looks something like a barcode? Weird!]] The "-p" flag causes each chunk of the file to be read in turn --in this case there's only one, because of the -0 -- and a "print" statement to be exuded each time. The flags at the end of the substitution are as follows: s to allow a "." to match a newline. i for case insensitive, g for global match (so that *any* text/html attachments should be wiped, not just the first) and m for multiline matching (maybe not necessary here, but doesn't harm; consider it a reflex whenever I use
g and s. "msg, Mmmmm".) Beyond that it shouldn't be too bad to understand: match a section of the message that begins with a line "--someotherstuff" and continues with a line containing the sought-for content type. Capture the first line into $1, and then continue the capture down to where that same string reoccurs. Delete this whole piece.

Let me know if this is not working: I tried it on a couple of simple things, but haven't tested extensively. Should be enough to get you going however.

Cheers,
Paul

jbc 09-28-2003 11:52 PM

Thanks, Paul. Definitely gives me something to work with. Actually looks sort of familiar from playing with Privoxy's pcre rules.

Had to put your one liner in a shell script to get it to work as a standalone script. Is there a benefit to using a .pl file instead? Kept giving me syntax errors when I tried this.

So far the perl version seems much faster than my sed shell script, which was getting clunkier by the minute the more I tried to trap exceptions. Pleasant surprise!

I'm testing with a small mbox as stdin to check it thoroughly. At present the perl line deletes from the first match to the end of the file for some reason, rather than handling each match separately. Needs to be a non-greedy match for differing boundary lines in the file; I'll sort through that one. Other than that, I've only needed to make one small change to the substitution so that a closing mime boundary gets left in place:

perl -0 -pe 's/^(--[^\n]*)\nContent\-Type:\s*text\/html.*$1/$1\n/msig'

Thanks for the head start! Think perl is more adaptable to this than my sed script. Poked around on the net for a few days trying to find something to delete mime parts; the ones I found didn't work consistently, or were huge and did way more than I needed , and all the discussions I found about "rolling your own" gave no indication of how to go about it!

Thanks again!
Brad

mervTormel 09-29-2003 12:07 AM

Quote:

Originally posted by jbc
...Had to put your one liner in a shell script to get it to work as a standalone script. Is there a benefit to using a .pl file instead? Kept giving me syntax errors when I tried this...
of course, by now, you know, if we could see the syntax error(s) in context, we might be able to extrapolate and speculate your issues :D

i can never understand user reticence to actually post the actual error text at this point, which is the single most important factor in diagnosing the issue at this juncture.

jbc, can you humor us and illuminate us, psychologically?

jbc 09-29-2003 01:44 AM

Sorry, merv...long day. And it's just an "I'm a perl dummy" error pretty much. Since it's a one-liner, there's not much context.
Code:

Line 1:  String found where operator expected near \
"pe 's/^(--[^\n]*)\nContent\-Type:\s*text\/html.*$1/$1\n/msig'"
Line 1:  syntax error near \
"pe 's/^(--[^\n]*)\nContent\-Type:\s*text\/html.*$1/$1\n/msig'"

I think some of the parameters in Paul's example are command line switches for perl, so aren't being digested well. Just tried dropping the "perl" at the beginning of the line until I can get into my books to figure out how to specify these within the script correctly. Gives the same errors in both cases.

Maybe I should clarify that ultimately the script will be called from an MTA with the line "transport_filter \path\to\perlscript". An email message is sent to the script as stdin by the MTA, and it then replaces the original message with stdout from the script. So it has to be a standalone script of some sort, not a terminal command.

jbc 09-29-2003 02:38 AM

Dug out the llama; I'd forgotten about being able to specify option switches on the shebang line. Works fine now except for a "Can't emulate -e on #! line" error. Tracking that one down...

[edit: Uh, duh. It's getting late...deleted -e option....pl file is fine]

pmccann 09-29-2003 03:18 AM

Just kill the "-e" bit: you're right, that's just a command line flag that says "the stuff between the quotes is the *e*ntire script". Definitely not necessary for a script stored in a file.

In other news: you're right, more extensive (but not extensive enough!) testing shows that my purported solution needs some work. In particular I'd stupidly assumed that each MIME section had it's own unique ID, rather than each *message* having such an ID. Oops. I'll have another go tonight and try to wrestle this thing to the ground.

Cheers,
Paul

jbc 09-29-2003 04:07 AM

Paul-

Think I found a perl finesse that was causing the problem...I'm not sure I understand it yet, but you probably will.

The original line you posted deleted everything from the first match to the end of the input. Looked ahead in the llama book and found "non-greedy" quantifiers, but ".*?" caused the script to delete only two lines for each match.

Then while reading about "memory parentheses", I noticed mention of "back references" ("\1") vs "memory variables" ("$1"). Tried changing the first occurrence of the memory variable to a back reference, and it works perfectly!
Code:

#!/opt/local/bin/perl -0 -p
s/^(--[^\n]*)\nContent\-Type:\s*text\/html.*\1/$1\n/msig

Truly impressive! You don't want to know how ugly the sed version was getting (don't want more exploding heads). One line of perl does everything and faster!

Need a few tweaks to be sure the script is working with whole lines, and it will be ideal for my needs. Thanks very much for getting me pointed in the right direction!

As far as mime parts, it seems to be the case that each part begins and ends with the same boundary identifier, although a multipart message may have different boundary identifiers for different parts. Basically the "part" consists of the the starting boundary, the type/encoding/etc headers, the content, and the ending boundary, as near as I can tell. Sounds as if you had it right.

pmccann 09-29-2003 04:13 AM

Good timing: I'd been fooling around for about fifteen minutes trying to nail down the details, and was just about to post what I imagine (will check in a moment) is pretty much exactly what you've got: that is...

perl -p -0 -e 's/^(--[^\n]*)\nContent\-Type:\s*text\/html(.*?)(?=\1)//msig' filename

(((Whir, whir, whir...))) OK, very similar to yours: mine uses a little bit of fancy regexpness -- the "negative lookahead operator", which is the (?=...) piece of the above puzzle. That is, the substitution "peeks forward" and only matches when there's a copy of the MIME boundary hanging around on the end, but doesn't "consume" the boundary. Same end result as what you've done in substituting it back in, but perhaps a little more elegant. And almost certainly less efficient, but it's still pretty much instantaneous, so who cares?

Nice work, by the way, in hunting down the problems!

Cheers,
Paul

jbc 09-29-2003 04:33 AM

Ah, the "non-greedy" quantifiers need to be in parentheses to work here! Must've missed that somehow. Will definitely use them, since "non-greediness" is critical to not mangling the mail with this.

One more good pointer...thanks, Paul.

Brad

jbc 09-29-2003 05:03 AM

Paul, one final note...your last version was correct. My script failed in some cases where the boundary was shared between two parts that were to be removed, presumably because the boundary I put back in was not getting matched as the start of a new section.

The negative lookahead operator seems to solve this problem.

It's 2 AM...I'm off to bed.


All times are GMT -5. The time now is 06:15 PM.

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.
Site design © IDG Consumer & SMB; individuals retain copyright of their postings
but consent to the possible use of their material in other areas of IDG Consumer & SMB.