The macosxhints Forums

The macosxhints Forums (http://hintsforums.macworld.com/index.php)
-   UNIX - General (http://hintsforums.macworld.com/forumdisplay.php?f=16)
-   -   searching for a select group of control chars (http://hintsforums.macworld.com/showthread.php?t=64236)

Hal Itosis 12-02-2006 11:26 PM

searching for a select group of control chars
 
I'm trying to write a shell script that finds files which
contain any so-called "control" characters... sometimes
also referred to as 'low ASCII'.

HOWEVER, I want to exclude tabs and newlines... obviously? ;)
And therein lies the rub. (I've written a script... but it's too long).

The script employs tr and "strings" of character ranges such as
\000-\010\013-\037\177 for the bad guys, and \011\012\040-\176
for the good guys. (Those are octal numbers btw). It also employs
grep to a great extent. Anyway... my script works okay, but that's
not the question.

I keep thinking there must be a huge shortcut I'm missing. Apparently,
there is a bracketed "character class" defined for regular expressions:

[[:cntrl:]]

And -- would that it were so -- I could simply go somewhere like:
grep -e [[:cntrl:]] $filePath (or grep -r -e [[:cntrl:]] $folderPath)

Unfortunately, that "[[:cntrl:]]" range includes tabs and newlines. So
-if I were to use it- I'd need some way to extract results, such that only
true offenders (files matching \000-\010\013-\037\177) remain. And,
this has been driving me crazy for the past few days.

Is there a way to do this in a couple of lines of code? Or, does it really
get trickier the more you think about it? (I'm talking bash here... since
perl just looks like Martian syntax to me).

:confused:

I'd like to think some use of the caret ^ in an expression to exclude
the tab/newline pair could be crafted. Or... perhaps a combination of
grep and tr (done more cleverly than I have managed) could reduce
the code -- by a hundred lines or so!

:)

Thoughts? (seems to me someone must have already invented this wheel).

TIA, -HI-

[There's also a bug? I've observed, where grep -e $'\000' matches practically
everything. So, it's no good for locating null chars... but that's another story].

hayne 12-03-2006 12:00 AM

Are you possibly searching for non-printable characters?
If so, you could do it very simply with:

grep '[^[:print:]]'

hayne 12-03-2006 12:12 AM

I'm not at all sure how to use octal numbers in grep regular expressions - my attempts seem to show that they aren't supported in grep (at least not in the version supplied with OS X)

But what you were trying to do is easy enough (and seems to work) in Perl:

perl -ne 'print if /[\000-\010]|[\013-\037]|\177/'

The above prints any line that contains a character in the range 000-010 or that contains a character in the range 013-037 or that contains the character 177
(all numbers in octal)

Hal Itosis 12-03-2006 02:46 AM

Thanks for the help hayne... I will check out grep '[^[:print:]]' momentarily.
:::
Okay, it's returning too much. It's including ANY file with tabs and/or newlines.
But that might be easy to tweak. (Tho, it does seem to be pretty much put me
in the same boat as [[:cntrl:]] did.)



Quote:

Originally Posted by hayne (Post 338898)
I'm not at all sure how to use octal numbers in grep regular expressions - my attempts seem to show that they aren't supported in grep (at least not in the version supplied with OS X)

As per the last line in my post... the "$" symbol seems needed here and
there for certain Unix statements. For example, I recently picked up that
declaring IFS in a script can be done like so:
Code:

IFS=$'\040'$'\011'$'\012'
# space # tab # newline #

Anyway, for grep the same seems true (but i don't think grep can "range"
like tr or your perl snippet). So we need to specify each and every char:
Code:

grep -e $'\001' -e $'\002' -e $'\003' -e $'\004' -e $'\005'
... and so on, and so on. (That's one benefit of writing a script I guess).


Quote:

Originally Posted by hayne (Post 338898)
But what you were trying to do is easy enough (and seems to work) in Perl:
Code:

perl -ne 'print if /[\000-\010]|[\013-\037]|\177/'
The above prints any line that contains a character in the range 000-010 or that contains a character in the range 013-037 or that contains the character 177 (all numbers in octal)

Okay. Tho I'm sure you know, we don't really want to have such items
"printed" on screen. ;) But yes, that part is the nucleus on which such
a script would be written. (It's how i used 'tr')

Does that line also do the searching part, like grep -r ? It just seems that
-- by the time I wrote the whole thing -- it had 2 grep functions (with that
long long list of all those -e $'\001', etc.), and it seemed like too much.

I'm going to work on tweaking the '[^[:print:]]' expression tho... thanks.

hayne 12-03-2006 07:38 AM

1 Attachment(s)
Quote:

Originally Posted by Hal Itosis (Post 338911)
As per the last line in my post... the "$" symbol seems needed here and
there for certain Unix statements. For example, I recently picked up that
declaring IFS in a script can be done like so:
Code:

IFS=$'\040'$'\011'$'\012'
# space # tab # newline #

Anyway, for grep the same seems true (but i don't think grep can "range"
like tr or your perl snippet).

grep can indeed handle ranges in its regular expressions - for example:
Code:

grep '[a-e]' foo
will show every line in the file "foo" that has a character in the range from 'a' to 'e'.
But grep's regular expressions don't seem to provide any way to handle octal (or hex) numbers.

Quote:

So we need to specify each and every char:
Code:

grep -e $'\001' -e $'\002' -e $'\003' -e $'\004' -e $'\005'
... and so on, and so on.
Note that the $'\xyz' thing is a feature of Bash - it is not a general Unix thing and it is not related to 'grep' or any other particular utility.
You are typing your command lines in Bash and hence they are interpreted by Bash and Bash decides what is to be passed to the other commands (like 'grep') that you are invoking.
So 'grep' receives the raw numerical values.

This is an "aside", but here's the source code for a C program that does nothing more than print out the command-line arguments that it receives:
Code:

// showArgs:
// This program displays the command-line arguments that it receives.

#include <stdio.h>

int main(int argc, char *argv[])
{
    int i;
    for (i = 0; i < argc; i++)
    {
        printf("argv[%d] is \"%s\"\n", i, argv[i]);
    }
}

If you have Apple's developer tools installed, you can compile this program by saving the above source code in a file "showArgs.c" and then issuing the command 'make showArgs'. Otherwise I attach a zipped copy of the 'showArgs' executable below.

Look at what I get when I invoke 'showArgs' using the $'\xyz' construction in Bash:
Code:

% showArgs $'\141' $'\142' $'\143'
argv[0] is "showArgs"
argv[1] is "a"
argv[2] is "b"
argv[3] is "c"

And look at what happens if I put double-quotes around the stuff so that Bash sends it as one command-line argument to 'grep':
Code:

% showArgs "$'\141' $'\142' $'\143'"
argv[0] is "showArgs"
argv[1] is "$'\141' $'\142' $'\143'"

The above program doesn't make any attempt to deal with control characters etc, so they will usually print out as blank:
Code:

% showArgs -e $'\001' -e $'\002' -e $'\003'
argv[0] is "showArgs"
argv[1] is "-e"
argv[2] is ""
argv[3] is "-e"
argv[4] is ""
argv[5] is "-e"
argv[6] is ""


Quote:

Does that line also do the searching part, like grep -r ?
No.
If you want your script to search through all files under a certain folder, you will need to implement that yourself with 'find'.
It might be better to implement your script to act on one file (or a list of files) and then to invoke your script via 'find' in the manner illustrated in the "iterating" section of this Unix FAQ

I.e. I am recommending that you have a "master" script that invokes several smaller scripts, each of which do one specific job.
Note that in general, the Unix philosophy is to have each tool do one specific job and then combine the tools to accomplish your final goal. Having a script that acts on a list of files is a more generally useful tool than one that embeds a search over folders.

Hal Itosis 12-03-2006 06:55 PM

Quote:

Originally Posted by hayne (Post 338927)
grep can indeed handle ranges in its regular expressions - for example:
Code:

grep '[a-e]' foo
will show every line in the file "foo" that has a character in the range from 'a' to 'e'.
But grep's regular expressions don't seem to provide any way to handle octal (or hex) numbers.

Yes... ranges it does (silly me, too much late-night programming).
And... I just had a fundamental breakthrough. Try this on for size:
Code:


grep -E [$'\001'-$'\010'$'\013'-$'\037'$'\177'] -a -r -l \
-m 1 --exclude=.DS_Store "$PWD" | sed '/\/Contents\//d'

Even works without specifying "-E". (Don't know how I missed it before).

2 problems remain:
• still can't get proper results with $'\000'. So if an otherwise "clean" file
has but a null char here and there, it will not get listed. I wonder why this
is the case. Maybe grep doesn't see null as a "char", but perhaps something
that represents 'no char'. (Naw, I still say it's a bug).

• if the -r option is omitted (to search just the cwd), then .dot files aren't
seen, and something like "$PWD"/* is needed just to see normal files.
So you are quite right about splitting tasks. A proper find command to feed
grep would be much better than relying on grep's default search behaviors.
Tasks that need delegating seem to be: find stuff, test stuff and display stuff.



Quote:

Originally Posted by hayne (Post 338927)
Note that the $'\xyz' thing is a feature of Bash - it is not a general Unix thing and it is not related to 'grep' or any other particular utility.
You are typing your command lines in Bash and hence they are interpreted by Bash and Bash decides what is to be passed to the other commands (like 'grep') that you are invoking.
So 'grep' receives the raw numerical values.

Okay thanks. This stuff is slowly starting to sink in.

From experimenting with scripts, I've noticed the shell can sometimes be a
friend and sometimes a pain. Takes time to "appreciate" all the subtleties.

;)

Thanks for all the feedback. I don't have Developer Tools installed, but I
did take several C and C++ courses 'once-upon-a-time'. I will save your
showArgs example for future study.

Cheers.

Hal Itosis 12-04-2006 08:01 PM

Quote:

Originally Posted by hayne (Post 338927)
But grep's regular expressions don't seem to provide any way to handle octal (or hex) numbers.

Note that the $'\xyz' thing is a feature of Bash - it is not a general Unix thing and it is not related to 'grep' or any other particular utility. You are typing your command lines in Bash and hence they are interpreted by Bash and Bash decides what is to be passed to the other commands (like 'grep') that you are invoking. So 'grep' receives the raw numerical values.

All right. Those seem to be MAJOR stumbling blocks here.

So... where are these character '[[:class:]]' items defined?
I mean physically. Are they hiding inside the grep binary,
or is there some library where they reside in plain text?
For example, I found this one which seems to be for perl:
/System/Library/Perl/5.8.6/unicore/lib/gc_sc/Cntrl.pl

AND... can we not define our own [[:class:]] ranges?
That would seem to be the only way out of this quandary.

BTW... with all the unbelievably complex operations that BRE/ERE
regexs are capable of, I just can't believe there isn't some simple
way of telling grep to search in files for any of 31 basic chars.

What am I missing here? ( [[:cntrl:]][^\\t\\n] won't work).

:mad:

hayne 12-04-2006 09:46 PM

Quote:

Originally Posted by Hal Itosis (Post 339323)
So... where are these character '[[:class:]]' items defined?
I mean physically. Are they hiding inside the grep binary,
or is there some library where they reside in plain text?

I looked at the source code for 'grep' and I see that the [:cntrl:] character class is defined via the C function 'iscntrl' (see 'man 3 iscntrl')

Quote:

AND... can we not define our own [[:class:]] ranges?
That would seem to be the only way out of this quandary.
I don't think there is any way to customize 'grep' like that.
The easiest way would be use Perl - either to write the whole script, or to write some helper-scripts that would do the sort of regular-expression searching that you want. I showed you above how easy it is with Perl.

Quote:

What am I missing here? ( [[:cntrl:]][^\\t\\n] won't work).
That would match lines that had a control character immediately followed by something that was not a tab and not a newline. Not remotely what you want.

Hal Itosis 12-04-2006 10:32 PM

Quote:

The easiest way would be use Perl - either to write the whole script, or
to write some helper-scripts that would do the sort of regular-expression
searching that you want. I showed you above how easy it is with Perl.
That still sounds like reinventing the wheel.
And perl isn't needed for comparing a few
ASCII chars. A bash script could do that.

But "ease" aside... is any of it actually necessary?

99.999% of the work is already done in grep.
All I want is to exclude tabs and newlines from
the original [[:cntrl:]] class. There must be a
trick that would take one or two lines of code.

You are suggesting I write a script with a 'while-read'
loop (or something) that scans every line until a match
is found, and then save that filename to a list? I guess
I could do that, but I'm looking for an even easier way.

;)

hayne 12-04-2006 11:26 PM

Quote:

Originally Posted by Hal Itosis (Post 339367)
That still sounds like reinventing the wheel.
And perl isn't needed for comparing a few
ASCII chars. A bash script could do that.

But "ease" aside... is any of it actually necessary?

99.999% of the work is already done in grep.
All I want is to exclude tabs and newlines from
the original [[:cntrl:]] class. There must be a
trick that would take one or two lines of code.

I don't think so. 'grep' is not that sophisticated a tool.
If you can't get 'grep' to recognize ranges of characters using octal numbers then you are likely going to have to enumerate each of the characters you want to match individually (either with "-e" or with a vertical bar (|) and using the "-E" option).

Quote:

You are suggesting I write a script with a 'while-read'
loop (or something) that scans every line until a match
is found, and then save that filename to a list?
I don't understand where you are getting that.
I showed you above (a few posts ago) how to write a one-line Perl script that does the same as the 'grep' command you were trying to do.
Just use that in place of the grep command.
Or elaborate on that Perl command to do other stuff too.

Hal Itosis 12-05-2006 01:00 AM

Quote:

Originally Posted by hayne (Post 339382)
Quote:


> You are suggesting I write a script with a 'while-read'
> loop (or something) that scans every line until a match
> is found, and then save that filename to a list?

I don't understand where you are getting that. I showed you above (a few posts ago) how to write a one-line Perl script that does the same as the 'grep' command you were trying to do. Just use that in place of the grep command.

Or elaborate on that Perl command to do other stuff too.

Well, most of the grep commands I posted were very abbreviated, to just
indicate the precise place where problems existed. Post #6 included the -l
(list) and -m1 (move on after first match) options... which more closely
resemble what I want the program to do: i.e., identify and list those files
which contain control chars [other than tab & newline].

Again: 'printing out' these control chars is neither practical or pertinent...
the real work is scanning inside every file to determine which ones to list.

No doubt perl can handle that. Are we concluding here that grep cannot?
I mean "[[:cntrl:]][^\\t\\n]" was not right... but couldn't something **like**
that be created to filter results. (PS -- I have googled around, but 90% of the
hits are just man pages with almost no code examples or tutorials).
Hmm, here's an Apple page I just found: Character Classes and Groups

[Sorry if I sound snippy. My frustration here is not with you; you're great.]

:o

hayne 12-05-2006 03:49 AM

You already know how to do it with 'grep' but listing each individual control character separately. What you've been looking for is a more concise way to do the 'grep' - and this is what I'm not sure exists, given the apparent limitations of 'grep'.

It seems that you have been planning to use 'grep -l' to output the filenames of files that contain control characters. So all I was suggesting is that if you had some other command than 'grep' that would do that, you could use it instead (since you've been having trouble with 'grep').
And hence my suggestion is that you should write a helper script that would accomplish this operation: given a list of filenames, output the names of those files that contain a control character.

Here's a Perl script that reproduces the functionality of 'grep -l'. You could use this Perl script to accomplish the above task.
Code:

#!/usr/bin/perl

# grepl:
# This script reproduces the functionality of 'grep -l'
# The first command-line argument should be the pattern (regex) to look for.
# The remaining command-line arguments are the filenames to look in.
#
# This is useful since there are some patterns that are difficult with 'grep'.
# E.g.:  grepl '[\000-\010]|[\013-\037]|\177' file1 file2 file3
#        to search for files with control characters
# Cameron Hayne (macdev@hayne.net)  December 2006

use warnings;
use strict;

die "Usage: grepl pattern file1 [file2 ...]\n" unless scalar(@ARGV) >= 2;
my $pattern = shift @ARGV;

while (<>)
{
    if (/$pattern/o)
    {
        print "$ARGV\n";  # output name of current file
        seek(ARGV, 0, 2); # go on to next file
    }
}

Save this in a file named "grepl" and make that file executable.
Then you can use it like:
grepl '[\000-\010]|[\013-\037]|\177' file1 file2 file3

Hal Itosis 12-08-2006 01:44 AM

Quote:

Originally Posted by hayne (Post 339426)
#!/usr/bin/perl

One day I'll pick up a book on perl, so I can follow along. For now, I came
up with a workaround using stuff I'm more familiar with. (Once again, it's a
real downer that grep doesn't grok octal ranges. Or... that those character
class goobers can't be more easily massaged to fit the bill).

What I did was this: before piping files to grep -E [[:cntrl:]], I use "tr" to
convert all their tabs and newlines into spaces!!! Thus... any "[:cntrl:]"
items left are the ones I was looking for. (There is a noticeable 'speed-hit'
produced by this workaround, so I'm still a bit miffed about the whole thing).

:)

Here is a slightly stripped-down version of the final product... but it includes
a simple xtrace switch, for future development or study. (The function call
wasn't really necessary here... but it's nice to know that it can be rigged-up
this way, if/when we do need to).

Anyway, the code below does the whole thing. Just give it a folder to scan,
else it defaults to the cwd:
Code:


#!/bin/bash -
PATH='/bin:/usr/bin'
export PATH
IFS=$'\040'$'\011'$'\012'
REC='-maxdepth 1'
TAB='\011'
fileList=

while getopts rtx theOpt
do
        case $theOpt in
        r)        REC=                ;; # search within folder recursively
        t)        TAB=                ;; # test run with tabs marked as bad
        x)        set -x                ;; # xtrace for debugging / observing
        esac
done
shift $(( OPTIND-1 ))

function findCharsInFile ()
{
        cat "$1" | tr "${TAB}\012" "${TAB:+\040}\040" |
        grep -E "[[:cntrl:]]" -a -l -m 1 --label="$1"
}

if [ -d "$1" ] || [ $# -eq 0 ]
then
        folderPath="${1:-"${PWD}"}"
        fileList=$(find -x "${folderPath%/}" $REC -not -name '.DS_Store' \
                -not -path '*Contents/*' -type f -print0 | xargs -0 -n 1 |
                while IFS= read -r filePath
                do
                        findCharsInFile "$filePath"
                done)

        if [ "$fileList" ]; then printf '%s\n' "$fileList"; exit 0; fi
fi
exit 1

(c)EF.2006.DEC.08

-HI-

hayne 12-08-2006 03:30 AM

You don't need to understand Perl to be able to use the 'grepl' script that I supplied above. You should treat it like a newly-discovered command that does the same thing as 'grep -l' except that it works with Perl-compatible regular expressions (including octal number ranges).
Here's your script from post #13 with my 'grepl' script being used instead of the 'findCharsInFile' function:
Code:

#!/bin/bash -
PATH='/bin:/usr/bin'
export PATH
IFS=$'\040'$'\011'$'\012'
REC='-maxdepth 1'
TAB='\011'
fileList=

while getopts rtx theOpt
do
        case $theOpt in
        r)      REC=            ;; # search within folder recursively
        t)      TAB=            ;; # test run with tabs marked as bad
        x)      set -x          ;; # xtrace for debugging / observing
        esac
done
shift $(( OPTIND-1 ))

if [ -d "$1" ] || [ $# -eq 0 ]
then
        folderPath="${1:-"${PWD}"}"
        fileList=$(find -x "${folderPath%/}" $REC -not -name '.DS_Store' \
                -not -path '*Contents/*' -type f -print0 |
                xargs -0 ~/Dev/Perl/grepl '[\000-\010]|[\013-\037]|\177' )

        if [ "$fileList" ]; then printf '%s\n' "$fileList"; exit 0; fi
fi
exit 1

In my tests on a folder with 177 files in it, 6 of which contain one of the control characters being looked for, this version of the script is 10 times faster than yours. (0.5 s instead of 5.0 s)

But I suspect that most of the speedup comes from being able to treat multiple files at once (passing all the files to 'grepl' on the command-line) instead of one file at a time as when using your 'findCharsInFile' function.
When you use that function on one file at a time, you are starting up at least 3 processes for each file.

Hal Itosis 12-08-2006 01:42 PM

Quote:

Originally Posted by hayne (Post 340449)
You don't need to understand Perl to be able to use the 'grepl' script that I supplied above.

I knew that.

Quote:

Originally Posted by hayne (Post 340449)
In my tests on a folder with 177 files in it, 6 of which contain one of the control characters being looked for, this version of the script is 10 times faster than yours.

I knew that too.

If I want to share my script with a friend (or if a stranger wants to copy it from here), all they need do is name it whatever they like, and put it wherever they will. Having to depend on another file that needs a special name and to be put in a special place adds complexity I wish to avoid.

And the speed issue... yes, even my post #6 method (the one that doesn't detect null chars) runs -- like yours -- ten times faster. I know, I know. Don't remind me.

;) :D

Speaking of speed (and/or efficiency): grep's "-m1" option will stop scanning a file at the first match, and move ahead directly on to the next file. I can't tell if your perl code will do likewise?



Quote:

Originally Posted by hayne (Post 340449)
I looked at the source code for 'grep' and I see that the [:cntrl:] character class is defined via the C function 'iscntrl'

Is there a link for that? I would like to see the source code too. {EDIT} oops nevermind... their ftp server finally let me through, so i think i can download it.{/EDIT}

--

Now for something _*new*_ that's confusing me...

On my Mac, there is a grep documentation folder with this file:
/Library/Documentation/Commands/grep/grep_4.html
(file is dated March 2005). And on that page is mentioned
a "-P" option:

grep -P --perl-regexp
Interpret the pattern as a Perl regular expression

Well... THAT sounds like what the doctor ordered!
Hey it's even on the man page and in the help text.
(Try pasting these commands at home):
$ man grep | grep -C1 '\-P'
$ grep --help | grep -C1 '\-P'

But, if i try anything like:
$ grep -P whatever
the command gets refused...
grep: The -P option is not supported

I don't understand, because...
$ grep --version
grep (GNU grep) 2.5.1

and, looking at this page:
http://directory.fsf.org/grep.html
suggests that v2.5.1 is the "latest"

Why is a -P (--perl-regexp) option mentioned in a 2005 file
put on my HD by Apple... and here it is nigh on to 2007 but
the -P option (in what seems to be the latest version of GNU
grep) doesn't function any better than this run-on sentence?

Guess I'm just having a bad month. :mad:

hayne 12-08-2006 02:49 PM

Quote:

Originally Posted by Hal Itosis (Post 340568)
If I want to share my script with a friend (or if a stranger wants to copy it from here), all they need do is name it whatever they like, and put it wherever they will. Having to depend on another file that needs a special name and to be put in a special place adds complexity I wish to avoid.

If that's the issue, then you could incorporate the Perl code into your script:
Code:

#!/bin/bash -
PATH='/bin:/usr/bin'
export PATH
IFS=$'\040'$'\011'$'\012'
REC='-maxdepth 1'
TAB='\011'
fileList=

grepl_perl_code=$(cat <<'EOT'
use warnings;
use strict;

die "Usage: grepl pattern file1 [file2 ...]\n" unless scalar(@ARGV) >= 2;
my $pattern = shift @ARGV;

while (<>)
{
    if (/$pattern/o)
    { 
        print "$ARGV\n";  # output name of current file
        seek(ARGV, 0, 2); # go on to next file
    }
}
EOT)

while getopts rtx theOpt
do
        case $theOpt in
        r)      REC=            ;; # search within folder recursively
        t)      TAB=            ;; # test run with tabs marked as bad
        x)      set -x          ;; # xtrace for debugging / observing
        esac
done
shift $(( OPTIND-1 ))

if [ -d "$1" ] || [ $# -eq 0 ]
then
        folderPath="${1:-"${PWD}"}"
        fileList=$(find -x "${folderPath%/}" $REC -not -name '.DS_Store' \
                -not -path '*Contents/*' -type f -print0 |
                xargs -0 perl -e "$grepl_perl_code" '[\000-\010]|[\013-\037]|\177' )

        if [ "$fileList" ]; then printf '%s\n' "$fileList"; exit 0; fi
fi
exit 1

Quote:

Speaking of speed (and/or efficiency): grep's "-m1" option will stop scanning a file at the first match, and move ahead directly on to the next file. I can't tell if your perl code will do likewise?
Yes it does. That is the purpose of the line "seek(ARGV, 0, 2);"

By the way, you don't need the "-m 1" option to 'grep' when you are using the "-l" option since (as it says on the 'grep' man page) with the "-l" option the search will stop on the first match.

Quote:

Is there a link for that? I would like to see the source code too.
http://www.opensource.apple.com/darw...grep/src/dfa.c
(Search for ":cntrl:" and then look back to see where is_cntrl is defined)

hayne 12-08-2006 03:05 PM

Quote:

Originally Posted by Hal Itosis (Post 340568)
On my Mac, there is a grep documentation folder with this file:
/Library/Documentation/Commands/grep/grep_4.html
(file is dated March 2005). And on that page is mentioned
a "-P" option:

grep -P --perl-regexp
Interpret the pattern as a Perl regular expression

Well... THAT sounds like what the doctor ordered!
Hey it's even on the man page and in the help text.
(Try pasting these commands at home):
$ man grep | grep -C1 '\-P'
$ grep --help | grep -C1 '\-P'

But, if i try anything like:
$ grep -P whatever
the command gets refused...
grep: The -P option is not supported

I don't understand, because...
$ grep --version
grep (GNU grep) 2.5.1

and, looking at this page:
http://directory.fsf.org/grep.html
suggests that v2.5.1 is the "latest"

Why is a -P (--perl-regexp) option mentioned in a 2005 file
put on my HD by Apple... and here it is nigh on to 2007 but
the -P option (in what seems to be the latest version of GNU
grep) doesn't function any better than this run-on sentence?

The support for Perl-compatible regular expressions (via the "-P" option) is in the 'grep' source code, but it appears that the necessary software library for this is not present. The code in the file "search.c" (where that "not supported" message is coming from) has various sections conditionally compiled (ifdef HAVE_LIBPCRE) and so it would seem that this PCRE library did not exist at the time that 'grep' was compiled (this is detected by the usual Gnu 'configure' script) and thus the version of 'grep' that we have does not include support for Perl-compatible regular expressions.

It is a bug that the help page doesn't use 'ifdef HAVE_LIBPCRE' to conditionally include the description of the "-P" option.

Hal Itosis 12-08-2006 03:08 PM

Quote:

Originally Posted by hayne (Post 340595)
you could incorporate the Perl code into your script

Thank you!


Quote:

Originally Posted by hayne (Post 340595)
Yes it does. That is the purpose of the line "seek(ARGV, 0, 2);"

Thanks again!!


Quote:

Originally Posted by hayne (Post 340595)
the "-l" option the search will stop on the first match.

Three thanks!!!


Quote:

Originally Posted by hayne (Post 340595)
Search for ":cntrl:" and then look back to see where is_cntrl is defined

Yah, those classes I guess make things better when moving across 'locales'.

Just wish grep could do something **basic** like read byte-values. Strange,
I guess sed doesn't do it either? Getting right down to it... "they" should have
already invented the [:class:] I'm trying to construct here,
which would have
obviated this entire thread!!!

:eek: :D

I suppose grep -P shall remain a mystery?
Ah ha... sneaky there with the split posts.
Excellent -- your help is much appreciated.

Hal Itosis 12-09-2006 09:52 PM

it's been a while (<>)
 
If the folder(s) I'm scanning contains any file(s)
whose name begins or ends with a space, then
perl -for some reason- sends out a warning...

Code:


: No such file or directory at -e line 10, <>

If I "use diagnostics;" perl further elaborates with...
Code:

        line 1449 (#1)
    (S inplace) The implicit opening of a file through use of the <>
    filehandle, either implicitly under the -n or -p command-line
    switches, or explicitly, failed for the indicated reason.  Usually this
    is because you don't have read permission for a file which you named on
    the command line.

When this occurs, sometimes the "control char test"
results are still correct... but, often they are not.

Needless to say, this ain't no "permissions" issue.
And thanks to the efforts of xargs -0, my function
based on grep gets it right every time. (Only files
whose name has newline chars present a problem).

Is this a known quirk of the "while (<>)" syntax?
Is there another way to phrase it, so perl won't
stumble on such basic bumps? Or am I wrong?

Well... something is wrong. And removing the
warnings and/or strict pragmas doesn't change
much: the messages *still* appear... and the
results are (sometimes) wrong.

Any ideas?

hayne 12-10-2006 03:48 AM

I'm not quite sure what is happening but it would seem to be caused by the space characters in your filenames (or is it newline characters in the filenames? - you refer to both these cases) when they are passed to the Perl script as arguments. I.e. I don't think it is the Perl script itself that is having problems but rather that it isn't receiving the filenames properly.

In your previous version of the script, you are passing each file one at a time to your Bash function, so this probably sidesteps the issue.

Usually the use 0f "-print0" and "-0" with xargs avoids issues with spaces in filenames, so I don't really understand why it is causing an issue here.
One thing you could do to troubleshoot this issue is to add some code to the beginning of the Perl script to print out the filenames to check that they got received properly.

E.g. you could add the following just before the 'while (<>)' statement:
Code:

print join('|', @ARGV), "\n";
That would print out all the filenames that were received, separated by vertical bars (|) but all on one line, so you could see if embedded spaces and newlines were received correctly.

It is possible that Perl's '<>' is not correctly handling the opening of files that have spaces or newlines in their names, but that seems unlikely.
If you find this to be the problem (i.e. if the filenames are correctly received as shown by the above print statement) then we could change the Perl script to open the files manually (without the '<>' magic) and sidestep the problem.

Hal Itosis 12-10-2006 08:06 PM

Quote:

Originally Posted by hayne (Post 340920)
I'm not quite sure what is happening but it would seem to be caused by the space characters in your filenames (or is it newline characters in the filenames? - you refer to both these cases) when they are passed to the Perl script as arguments. I.e. I don't think it is the Perl script itself that is having problems but rather that it isn't receiving the filenames properly.

Actually, I refer to your script in post #16... and it handles newlines in
filenames just fine. (It was my script I said had trouble with newlines).

But -if you add a space to the end of some filename- and then run your
script on its parent... perhaps you'll see for yourself what I'm saying.

I'll will try the 'join' test and see what results. Thanks.
I wish Perl was easier to understand. Its syntax is so
"overloaded", but I guess that's also why it's powerful.

:
:

Test complete; filenames are being received correctly;
error messages print first. Is it a compile-time issue?
(No, I guess not... "while (<>)" is simply messing up).

Those perl man pages are amazing -yet- I have a very
hard time following along. How can this be fixed?

hayne 12-10-2006 11:38 PM

One obvious solution to the problem of filenames with spaces at beginning or end is to rename those files so as to avoid this nastiness.
You could do that with the script 'rename_without_endspace' that is available from my Perl script page.

If you don't want to rename your files to avoid whitespace at the ends of the filenames, you could change your Bash script to use the following revised version of my Perl code:
Code:

grepl_perl_code=$(cat <<'EOT'
use warnings;
use strict;
use Fcntl;

die "Usage: grepl pattern file1 [file2 ...]\n" unless scalar(@ARGV) >= 2;
my $pattern = shift @ARGV;

foreach my $filename (@ARGV)
{
    sysopen(FILE, $filename, O_RDONLY)
            or die "Unable to open file \"$filename\": $!\n";
    while (<FILE>)
    {
        if (/$pattern/o)
        { 
            print "$filename\n";  # output name of current file
            seek(FILE, 0, 2); # go on to next file
        }
    }
    close(FILE);
}
EOT)

This revised version of the Perl code makes explicit the looping over command-line arguments and opening of the files that was previously done by Perl's magic "<>" operator. By using 'sysopen', I sidestep any problems with strange characters (like leading or trailing spaces) in the filenames.

Hal Itosis 12-11-2006 12:49 AM

That dood it!
Thanks dude.
:)


All times are GMT -5. The time now is 05:33 PM.

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.
Site design © IDG Consumer & SMB; individuals retain copyright of their postings
but consent to the possible use of their material in other areas of IDG Consumer & SMB.