![]() |
searching for a select group of control chars
I'm trying to write a shell script that finds files which
contain any so-called "control" characters... sometimes also referred to as 'low ASCII'. HOWEVER, I want to exclude tabs and newlines... obviously? ;) And therein lies the rub. (I've written a script... but it's too long). The script employs tr and "strings" of character ranges such as \000-\010\013-\037\177 for the bad guys, and \011\012\040-\176 for the good guys. (Those are octal numbers btw). It also employs grep to a great extent. Anyway... my script works okay, but that's not the question. I keep thinking there must be a huge shortcut I'm missing. Apparently, there is a bracketed "character class" defined for regular expressions: [[:cntrl:]] And -- would that it were so -- I could simply go somewhere like: grep -e [[:cntrl:]] $filePath (or grep -r -e [[:cntrl:]] $folderPath) Unfortunately, that "[[:cntrl:]]" range includes tabs and newlines. So -if I were to use it- I'd need some way to extract results, such that only true offenders (files matching \000-\010\013-\037\177) remain. And, this has been driving me crazy for the past few days. Is there a way to do this in a couple of lines of code? Or, does it really get trickier the more you think about it? (I'm talking bash here... since perl just looks like Martian syntax to me). :confused: I'd like to think some use of the caret ^ in an expression to exclude the tab/newline pair could be crafted. Or... perhaps a combination of grep and tr (done more cleverly than I have managed) could reduce the code -- by a hundred lines or so! :) Thoughts? (seems to me someone must have already invented this wheel). TIA, -HI- [There's also a bug? I've observed, where grep -e $'\000' matches practically everything. So, it's no good for locating null chars... but that's another story]. |
Are you possibly searching for non-printable characters?
If so, you could do it very simply with: grep '[^[:print:]]' |
I'm not at all sure how to use octal numbers in grep regular expressions - my attempts seem to show that they aren't supported in grep (at least not in the version supplied with OS X)
But what you were trying to do is easy enough (and seems to work) in Perl: perl -ne 'print if /[\000-\010]|[\013-\037]|\177/' The above prints any line that contains a character in the range 000-010 or that contains a character in the range 013-037 or that contains the character 177 (all numbers in octal) |
Thanks for the help hayne... I will check out grep '[^[:print:]]' momentarily.
::: Okay, it's returning too much. It's including ANY file with tabs and/or newlines. But that might be easy to tweak. (Tho, it does seem to be pretty much put me in the same boat as [[:cntrl:]] did.) Quote:
there for certain Unix statements. For example, I recently picked up that declaring IFS in a script can be done like so: Code:
IFS=$'\040'$'\011'$'\012'like tr or your perl snippet). So we need to specify each and every char: Code:
grep -e $'\001' -e $'\002' -e $'\003' -e $'\004' -e $'\005'Quote:
"printed" on screen. ;) But yes, that part is the nucleus on which such a script would be written. (It's how i used 'tr') Does that line also do the searching part, like grep -r ? It just seems that -- by the time I wrote the whole thing -- it had 2 grep functions (with that long long list of all those -e $'\001', etc.), and it seemed like too much. I'm going to work on tweaking the '[^[:print:]]' expression tho... thanks. |
1 Attachment(s)
Quote:
Code:
grep '[a-e]' fooBut grep's regular expressions don't seem to provide any way to handle octal (or hex) numbers. Quote:
You are typing your command lines in Bash and hence they are interpreted by Bash and Bash decides what is to be passed to the other commands (like 'grep') that you are invoking. So 'grep' receives the raw numerical values. This is an "aside", but here's the source code for a C program that does nothing more than print out the command-line arguments that it receives: Code:
// showArgs:Look at what I get when I invoke 'showArgs' using the $'\xyz' construction in Bash: Code:
% showArgs $'\141' $'\142' $'\143'Code:
% showArgs "$'\141' $'\142' $'\143'"Code:
% showArgs -e $'\001' -e $'\002' -e $'\003'Quote:
If you want your script to search through all files under a certain folder, you will need to implement that yourself with 'find'. It might be better to implement your script to act on one file (or a list of files) and then to invoke your script via 'find' in the manner illustrated in the "iterating" section of this Unix FAQ I.e. I am recommending that you have a "master" script that invokes several smaller scripts, each of which do one specific job. Note that in general, the Unix philosophy is to have each tool do one specific job and then combine the tools to accomplish your final goal. Having a script that acts on a list of files is a more generally useful tool than one that embeds a search over folders. |
Quote:
And... I just had a fundamental breakthrough. Try this on for size: Code:
2 problems remain: • still can't get proper results with $'\000'. So if an otherwise "clean" file has but a null char here and there, it will not get listed. I wonder why this is the case. Maybe grep doesn't see null as a "char", but perhaps something that represents 'no char'. (Naw, I still say it's a bug). • if the -r option is omitted (to search just the cwd), then .dot files aren't seen, and something like "$PWD"/* is needed just to see normal files. So you are quite right about splitting tasks. A proper find command to feed grep would be much better than relying on grep's default search behaviors. Tasks that need delegating seem to be: find stuff, test stuff and display stuff. Quote:
From experimenting with scripts, I've noticed the shell can sometimes be a friend and sometimes a pain. Takes time to "appreciate" all the subtleties. ;) Thanks for all the feedback. I don't have Developer Tools installed, but I did take several C and C++ courses 'once-upon-a-time'. I will save your showArgs example for future study. Cheers. |
Quote:
So... where are these character '[[:class:]]' items defined? I mean physically. Are they hiding inside the grep binary, or is there some library where they reside in plain text? For example, I found this one which seems to be for perl: /System/Library/Perl/5.8.6/unicore/lib/gc_sc/Cntrl.pl AND... can we not define our own [[:class:]] ranges? That would seem to be the only way out of this quandary. BTW... with all the unbelievably complex operations that BRE/ERE regexs are capable of, I just can't believe there isn't some simple way of telling grep to search in files for any of 31 basic chars. What am I missing here? ( [[:cntrl:]][^\\t\\n] won't work). :mad: |
Quote:
Quote:
The easiest way would be use Perl - either to write the whole script, or to write some helper-scripts that would do the sort of regular-expression searching that you want. I showed you above how easy it is with Perl. Quote:
|
Quote:
And perl isn't needed for comparing a few ASCII chars. A bash script could do that. But "ease" aside... is any of it actually necessary? 99.999% of the work is already done in grep. All I want is to exclude tabs and newlines from the original [[:cntrl:]] class. There must be a trick that would take one or two lines of code. You are suggesting I write a script with a 'while-read' loop (or something) that scans every line until a match is found, and then save that filename to a list? I guess I could do that, but I'm looking for an even easier way. ;) |
Quote:
If you can't get 'grep' to recognize ranges of characters using octal numbers then you are likely going to have to enumerate each of the characters you want to match individually (either with "-e" or with a vertical bar (|) and using the "-E" option). Quote:
I showed you above (a few posts ago) how to write a one-line Perl script that does the same as the 'grep' command you were trying to do. Just use that in place of the grep command. Or elaborate on that Perl command to do other stuff too. |
Quote:
indicate the precise place where problems existed. Post #6 included the -l (list) and -m1 (move on after first match) options... which more closely resemble what I want the program to do: i.e., identify and list those files which contain control chars [other than tab & newline]. Again: 'printing out' these control chars is neither practical or pertinent... the real work is scanning inside every file to determine which ones to list. No doubt perl can handle that. Are we concluding here that grep cannot? I mean "[[:cntrl:]][^\\t\\n]" was not right... but couldn't something **like** that be created to filter results. (PS -- I have googled around, but 90% of the hits are just man pages with almost no code examples or tutorials). Hmm, here's an Apple page I just found: Character Classes and Groups [Sorry if I sound snippy. My frustration here is not with you; you're great.] :o |
You already know how to do it with 'grep' but listing each individual control character separately. What you've been looking for is a more concise way to do the 'grep' - and this is what I'm not sure exists, given the apparent limitations of 'grep'.
It seems that you have been planning to use 'grep -l' to output the filenames of files that contain control characters. So all I was suggesting is that if you had some other command than 'grep' that would do that, you could use it instead (since you've been having trouble with 'grep'). And hence my suggestion is that you should write a helper script that would accomplish this operation: given a list of filenames, output the names of those files that contain a control character. Here's a Perl script that reproduces the functionality of 'grep -l'. You could use this Perl script to accomplish the above task. Code:
#!/usr/bin/perlThen you can use it like: grepl '[\000-\010]|[\013-\037]|\177' file1 file2 file3 |
Quote:
up with a workaround using stuff I'm more familiar with. (Once again, it's a real downer that grep doesn't grok octal ranges. Or... that those character class goobers can't be more easily massaged to fit the bill). What I did was this: before piping files to grep -E [[:cntrl:]], I use "tr" to convert all their tabs and newlines into spaces!!! Thus... any "[:cntrl:]" items left are the ones I was looking for. (There is a noticeable 'speed-hit' produced by this workaround, so I'm still a bit miffed about the whole thing). :) Here is a slightly stripped-down version of the final product... but it includes a simple xtrace switch, for future development or study. (The function call wasn't really necessary here... but it's nice to know that it can be rigged-up this way, if/when we do need to). Anyway, the code below does the whole thing. Just give it a folder to scan, else it defaults to the cwd: Code:
-HI- |
You don't need to understand Perl to be able to use the 'grepl' script that I supplied above. You should treat it like a newly-discovered command that does the same thing as 'grep -l' except that it works with Perl-compatible regular expressions (including octal number ranges).
Here's your script from post #13 with my 'grepl' script being used instead of the 'findCharsInFile' function: Code:
#!/bin/bash -But I suspect that most of the speedup comes from being able to treat multiple files at once (passing all the files to 'grepl' on the command-line) instead of one file at a time as when using your 'findCharsInFile' function. When you use that function on one file at a time, you are starting up at least 3 processes for each file. |
Quote:
Quote:
If I want to share my script with a friend (or if a stranger wants to copy it from here), all they need do is name it whatever they like, and put it wherever they will. Having to depend on another file that needs a special name and to be put in a special place adds complexity I wish to avoid. And the speed issue... yes, even my post #6 method (the one that doesn't detect null chars) runs -- like yours -- ten times faster. I know, I know. Don't remind me. ;) :D Speaking of speed (and/or efficiency): grep's "-m1" option will stop scanning a file at the first match, and move ahead directly on to the next file. I can't tell if your perl code will do likewise? Quote:
-- Now for something _*new*_ that's confusing me... On my Mac, there is a grep documentation folder with this file: /Library/Documentation/Commands/grep/grep_4.html (file is dated March 2005). And on that page is mentioned a "-P" option: grep -P --perl-regexp Interpret the pattern as a Perl regular expression Well... THAT sounds like what the doctor ordered! Hey it's even on the man page and in the help text. (Try pasting these commands at home): $ man grep | grep -C1 '\-P' $ grep --help | grep -C1 '\-P' But, if i try anything like: $ grep -P whatever the command gets refused... grep: The -P option is not supported I don't understand, because... $ grep --version grep (GNU grep) 2.5.1 and, looking at this page: http://directory.fsf.org/grep.html suggests that v2.5.1 is the "latest" Why is a -P (--perl-regexp) option mentioned in a 2005 file put on my HD by Apple... and here it is nigh on to 2007 but the -P option (in what seems to be the latest version of GNU grep) doesn't function any better than this run-on sentence? Guess I'm just having a bad month. :mad: |
Quote:
Code:
#!/bin/bash -Quote:
By the way, you don't need the "-m 1" option to 'grep' when you are using the "-l" option since (as it says on the 'grep' man page) with the "-l" option the search will stop on the first match. Quote:
(Search for ":cntrl:" and then look back to see where is_cntrl is defined) |
Quote:
It is a bug that the help page doesn't use 'ifdef HAVE_LIBPCRE' to conditionally include the description of the "-P" option. |
Quote:
Quote:
Quote:
Quote:
Just wish grep could do something **basic** like read byte-values. Strange, I guess sed doesn't do it either? Getting right down to it... "they" should have already invented the [:class:] I'm trying to construct here, which would have obviated this entire thread!!! :eek: :D I suppose grep -P shall remain a mystery? Ah ha... sneaky there with the split posts. Excellent -- your help is much appreciated. |
it's been a while (<>)
If the folder(s) I'm scanning contains any file(s)
whose name begins or ends with a space, then perl -for some reason- sends out a warning... Code:
Code:
line 1449 (#1)results are still correct... but, often they are not. Needless to say, this ain't no "permissions" issue. And thanks to the efforts of xargs -0, my function based on grep gets it right every time. (Only files whose name has newline chars present a problem). Is this a known quirk of the "while (<>)" syntax? Is there another way to phrase it, so perl won't stumble on such basic bumps? Or am I wrong? Well... something is wrong. And removing the warnings and/or strict pragmas doesn't change much: the messages *still* appear... and the results are (sometimes) wrong. Any ideas? |
I'm not quite sure what is happening but it would seem to be caused by the space characters in your filenames (or is it newline characters in the filenames? - you refer to both these cases) when they are passed to the Perl script as arguments. I.e. I don't think it is the Perl script itself that is having problems but rather that it isn't receiving the filenames properly.
In your previous version of the script, you are passing each file one at a time to your Bash function, so this probably sidesteps the issue. Usually the use 0f "-print0" and "-0" with xargs avoids issues with spaces in filenames, so I don't really understand why it is causing an issue here. One thing you could do to troubleshoot this issue is to add some code to the beginning of the Perl script to print out the filenames to check that they got received properly. E.g. you could add the following just before the 'while (<>)' statement: Code:
print join('|', @ARGV), "\n";It is possible that Perl's '<>' is not correctly handling the opening of files that have spaces or newlines in their names, but that seems unlikely. If you find this to be the problem (i.e. if the filenames are correctly received as shown by the above print statement) then we could change the Perl script to open the files manually (without the '<>' magic) and sidestep the problem. |
Quote:
filenames just fine. (It was my script I said had trouble with newlines). But -if you add a space to the end of some filename- and then run your script on its parent... perhaps you'll see for yourself what I'm saying. I'll will try the 'join' test and see what results. Thanks. I wish Perl was easier to understand. Its syntax is so "overloaded", but I guess that's also why it's powerful. : : Test complete; filenames are being received correctly; error messages print first. Is it a compile-time issue? (No, I guess not... "while (<>)" is simply messing up). Those perl man pages are amazing -yet- I have a very hard time following along. How can this be fixed? |
One obvious solution to the problem of filenames with spaces at beginning or end is to rename those files so as to avoid this nastiness.
You could do that with the script 'rename_without_endspace' that is available from my Perl script page. If you don't want to rename your files to avoid whitespace at the ends of the filenames, you could change your Bash script to use the following revised version of my Perl code: Code:
grepl_perl_code=$(cat <<'EOT' |
That dood it!
Thanks dude. :) |
| All times are GMT -5. The time now is 05:33 PM. |
Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.
Site design © IDG Consumer & SMB; individuals retain copyright of their postings
but consent to the possible use of their material in other areas of IDG Consumer & SMB.