PDA

View Full Version : I searched for "multithreading in unix shell scripts..."


eValuone
08-20-2009, 07:07 AM
...and didn't get any hits.

I am on a Mac Pro dual quad with 12GHz Ram.

My activity monitor shows 16 bars.

I am running a unix script (aliases nested) which processes an access log which is 10s of thousands of lines long. Part of that process compares each line to another file with thousands of comparatives in it.

When this script runs it uses only one processor in a gsed loop. It takes a long time.

If I multi thread that gsed loop...is that what is needed to spread processing across all available processors?

I am assuming that will speed things up by a factor of approximately 8 to 16 times.

Thank you,

eValuone.

EatsWithFingers
08-20-2009, 07:19 AM
If you are able to split the processing task into 8 or 16 chunks manually, then you can run each one in the background. This should result in the processes running concurrently and being assigned to separate cores. You can run a shell command in the background by adding ' &' at the end (without quotes).

eValuone
08-20-2009, 07:28 AM
Thanks!

eValuone.

SirDice
08-20-2009, 08:26 AM
I am running a unix script (aliases nested) which processes an access log which is 10s of thousands of lines long. Part of that process compares each line to another file with thousands of comparatives in it.
Don't use a shell script for that, use Perl. Perl will make minced meat of those log files in a very short time. These days you can do a lot more with Perl but it was originally written to do just that.

eValuone
08-20-2009, 03:01 PM
SirDice,

I am sure that you are correct!

Trouble is, I am likely not as qualified at Perl as you.

I have used Perl to do some recursive stuff, and some low level bit-twiddling.

I am not a programmer!

But I have been able to get Unix to do some great things for me and very quickly. What I like best is it is reliable.

I am learning Perl autodidactically, over a period of years.

Here is what I did with Unix to solve my problem in an hour or so:

"Somebody's Access Log Processing Description

#
# Problem is that our access log post processing is taking too long due our assiduous resolution of unresolved hits; URLs we have resolved manually, so far, take us down to city names in about half of all unresolved hit cases.
# Our resolution files are becoming huge: URLs are base 256 and four digits! Huge number of URLs but less than 256^4 (approximately 4.3 billion).
#
# Date of document creation: 20Aug2009.
#
# Extract days for latest period for processing:
#
m2u < logfile | grep -v '[[12][0-9]/Feb' | grep -v '[[0123][0-9]/Mar' | grep -v '[[0123][0-9]/Apr' | grep -v '[[0123][0-9]/May' | grep -v '[[0123][0-9]/Jun' | grep -v '[[0123][0-9]/Jul' | grep -v '[[01][0-9]/Aug' | less > tmp.txt
#
# Split tmp.txt for multiprocessing~multithreading on dual quad Mac Pro
#
wc tmp.txt
[for example:]129870
#
# divide by 8 [for example, rounded up, 16234. Use latter for split.
#
split -16234 tmp.txt
#results of split are eight files: xaa...xah.
#
# Use al8threads command file. It is Mac txt, so we have to m2u <al8threads.txt >al8threads
#
# Content of al8threads command file looks like this:
# alxaa xaa &
# alxab xab &
# alxac xac &
# alxad xad &
# alxae xae &
# alxaf xaf &
# alxag xag &
# alxah xah &
#
# Each alias processes a logfile 'split' segment and generates a logxa?.xls file
#
# Old way takes about 10 minutes to run. This multiprocessing approach takes 1.5 minutes.
#
#To run al8threads do this:
tcsh -s <al8threads
#
# Next we concatenate all multiprocess logxa?.xls files:
#
cat logxaa.xls logxab.xls logxac.xls logxad.xls logxae.xls logxaf.xls logxag.xls logxah.xls >log.xls
#
# Just one more step before we open our file as tmp.txt in our OpenOffice first as text and then spreadsheet.
#
# Ditch all hits which are of lesser concern (examples here are -- we mark all search engine crawler hits as ZZZ, and all Yahoo slurps are marked Yahoo at their ^line (we end up with only about 10% of hits as meaningful to us):
#
m2u < log.xls | grep -v 'deBaron' | grep -v 'Prodigous' | grep -v 'ZZZ' | grep -v '[Ss]ines' | grep -v 'TPR' | grep -v '^[Yy]ahoo' | less > tmp.txt
#
# That's it! YKW - 20Aug2009.
#"

It isn't Perl elegance, but it works!

I will take a look at my Perl manuals and see if some'at is more obvious than my fumbling approach.

Thanks SirDice,

eValuone.

dmacks
08-20-2009, 03:48 PM
| grep -v '[[12][0-9]/Feb' | grep -v '[[0123][0-9]/Mar' | grep -v '[[0123][0-9]/Apr' | grep -v '[[0123][0-9]/May' | grep -v '[[0123][0-9]/Jun' | grep -v '[[0123][0-9]/Jul' | grep -v '[[01][0-9]/Aug'

Ouch! Can you combine those into a single grep (regex for list of alternatives of the months) instead of re-re-re-scanning all that data along the pipeline?

| less

Does piping everything through a pager accomplish anything functional? Every stage of a pipeline ties up a lot of buffering.

eValuone
08-20-2009, 04:16 PM
What you see is something which evolved.

What worked for Feb, seemed like a good idea to repeat for Mar, etc.

So, can I write command line script which will do a whole year, one time and reuse it all year? I'll look at regex, etc.

Tx,

eValuone.

SirDice
08-21-2009, 01:37 AM
It's not that hard to program in perl, I'm not a programmer either :D

Perl also has a grep function, it can also do regexp. It should be relatively easy to convert your shell script to perl.
Once done I'm quite sure you'll get a faster result then the 8-16 times you're aiming for.

skolapper
09-03-2009, 06:55 AM
houghi <> writes:

> I would NEVER change the PATH in a script. You have no idea what
> people will do.

So what if you do?
You only affect your script and those scripts you call.
unless its sourced....
or unless you exec a new shell