PDA

View Full Version : TextWrangler - select entire line while finding


EchoIdent
07-02-2009, 06:48 PM
I have a .txt document with 1.5 million lines of text. Each line looks something like this:

"text" "text text text text text text text" "April 20, 2007" "text"

I need to find all the dates that say 2006, 2007 and 2008 and delete them so then I'm just left with all the ones from this year. When I run a find for ", 2007" if finds all the instances (20,000 of them), but I don't know how to delete the WHOLE line, as opposed to just the ", 2007" part. I tried using wildcards like the * and the ., but it doesn't seem to want to find the whole line. When I run:

*, 2007*

in MS Word, it would find the whole line, but in TextWrangler (and grep in general) it finds only one character to the left and one to the right.

Any suggestions? I'm more than open to using something other then TextWrangler if it'll help!

Any help is appreciated!

macosnoob
07-02-2009, 09:23 PM
If I'm understanding the question, here's a TextWrangler solution: Text > Process Lines Containing ... > [check] Delete matched lines. Putting "2007" in at the Process step will delete all lines containing "2007". Ditto for 2006, 2008, etc. When you've finished, you'll be left with a document containing only lines that do not match the lines with years you've specified.

Or are you trying to extract the lines with "2007" and then do something else with them? Text > Process Lines Containing ... > [check] Copy to new document -- will pull out all the lines with "2007" and deposit them in a separate document for further processing.

roncross@cox.net
07-02-2009, 09:57 PM
Open up the terminal for this one.

To delete, the following lines containing 2006, 2007, 2008 for example:

sed '/200[678]/d' name_of_file_you_want_to_remove_years > new_file_with_years_removed

This should delete all lines that contain 2006, 2007, 2008.

You will want to redirect the results to a new file and that is what >new_file_with_years_removed does.

EchoIdent
07-02-2009, 11:15 PM
Thanks both of you!

I looked everywhere on how to do this. But I couldn't think of what to search and I never got anywhere even after like 2 days of google searches. Thank you very much, worked like a charm!

hayne
07-02-2009, 11:39 PM
When I run:

*, 2007*

in MS Word, it would find the whole line, but in TextWrangler (and grep in general) it finds only one character to the left and one to the right.

I'm not sure why it would find one character to the left and right, but what you have shown is not a correct regular expression for use with 'grep'.
The * in regular expressions does not mean "any character", it means "0 or more of the preceding character" - i.e. it is a quantifier, not a wildcard.
So "7*" means 0 or more 7's.

If you want to learn regular expressions, google for something like:
regular expression tutorial

By the way, 'grep' = global regular expression print

EchoIdent
07-03-2009, 01:54 AM
Hmm... that make's sense. I read that when I was looking up how to do this and I tried

.*, 2007.*

Which should work since it's saying to find an number of any characters except line breaks... but it just doesn't work in this one file. I just tried again in a new file that I just typed up quick to try this and it works there, but just won't work in the one with 1.5 million lines... strange. It's written in Farsi and encoded in UTF-8, not ANSII, maybe that could be why?

Hmm... anyway, thanks for all the help guys.