Go Back   The macosxhints Forums > OS X Help Requests > Applications



Reply
 
Thread Tools Rating: Thread Rating: 2 votes, 5.00 average. Display Modes
Old 01-14-2006, 02:17 AM   #1
technorev
Prospect
 
Join Date: Jan 2006
Posts: 4
I've been reading the posts about the ability to use Terminal and a downloadable script file to eliminate duplicate loops. This is so attractive as I have a PowerBook G4 and it's almost maxed out. I tried to click on the download here word, but another page comes up saying the site is suspended or such. Is there anyone that has the script and if so are you interested in sharing it. I'd love to capture back my HD space.

Thanks
Technorev
technorev is offline   Reply With Quote
Old 01-14-2006, 03:44 AM   #2
hayne
Site Admin
 
Join Date: Jan 2002
Location: Montreal
Posts: 32,473
Please tell us what thread or article you are referring to.
__________________
hayne.net/macosx.html
hayne is offline   Reply With Quote
Old 01-15-2006, 07:36 PM   #3
technorev
Prospect
 
Join Date: Jan 2006
Posts: 4
Here is the link

Thank you for asking. I'm new to this forum, but I think it wasn't a thread, but another part of this site. Please find here the address http://www.macosxhints.com/article.p...40126045617781.
I hope you can help me.

Thanks
Technorev

Last edited by hayne; 01-15-2006 at 11:07 PM. Reason: fixed link
technorev is offline   Reply With Quote
Old 01-18-2006, 03:51 AM   #4
hayne
Site Admin
 
Join Date: Jan 2002
Location: Montreal
Posts: 32,473
Quote:
Originally Posted by technorev
it wasn't a thread, but another part of this site. Please find here the address http://www.macosxhints.com/article.p...40126045617781.

I see the problem. That article (on the main macosxhints site) links to a script on an external site, but that external site seems to no longer be operating.
There is nothing we can do about that. Maybe someone who has saved a copy of the script will see this thread and respond.

[edit]
I have sent an email to the author of that article (Xeo) asking for an updated link for the script. Check back in a while to see if Xeo has added a comment to that article supplying a new link.
[/edit]
__________________
hayne.net/macosx.html

Last edited by hayne; 01-18-2006 at 03:59 AM.
hayne is offline   Reply With Quote
Old 01-18-2006, 11:09 AM   #5
bedouin
All Star
 
Join Date: Aug 2004
Posts: 759
Actually, I just installed iLife '06 and the World Music Jam Pack and have a ton of duplicate loops I wish I could get rid of. Any ideas? The above hint isn't really relevant. Prior to installing iLife '06 I had iLife '04 installed, and also the four downloable Jam Packs from .Mac.
bedouin is offline   Reply With Quote
Old 01-19-2006, 12:46 AM   #6
hayne
Site Admin
 
Join Date: Jan 2002
Location: Montreal
Posts: 32,473
The author of the "hint" article referred to above has replied and that article has been updated to link to a copy of the script on the macosxhints server.
So have a try.
But note (as I think was mentioned in the article) that the script just removes files by name - the ones that the author noticed were dupes - it doesn't search for duplicates or verify that the ones it is removing are duplicates.
So be sure to have a backup of your loops before running the script in case it removes something that you only have one copy of.
__________________
hayne.net/macosx.html
hayne is offline   Reply With Quote
Old 01-20-2006, 01:34 AM   #7
hayne
Site Admin
 
Join Date: Jan 2002
Location: Montreal
Posts: 32,473
The script referred to in the above "hint" uses hard-coded lists of duplicate files. The original author (Xeo) somehow figured out which files were duplicates and made lists of them, which are then incorporated into the script.

But that means that it doesn't help you if you have other loops that are not among those cataloged by Xeo.

As a first step towards solving the general problem of removing duplicate loops, I wrote a script that will search for duplicates and list them automatically. I.e. it should be able to reproduce Xeo's lists.

I supply the script below.
To use it you would need to do the usual things for running a script - see this Unix FAQ.
If the script file is called "findDupeFiles" and it is in your current folder then you could run it on the two folders "/Documents/Apple Loops for Soundtrack" and "/Library/Application Support/GarageBand" as follows:
Code:
./findDupeFiles '.aif|.aiff' "/Documents/Apple Loops for Soundtrack" "/Library/Application Support/GarageBand"
The first argument ('.aif|.aiff') specifies that you want to look at files with either a ".aif" or ".aiff" suffix. You need to have this argument inside quotes since the vertical bar (|) is a special character for the shell.
You need to have the folder paths (the other two arguments) in quotes because the paths contain spaces.

Note that this command will typically take several minutes to finish and you won't see any output until just before the end.

The output (in the Terminal window) from the above command would list all the duplicates it found in those folders (and sub-folders). Each set of duplicates is separated from the next in the output by a line like this:
-----------------------

The script is actually very general so it could be used to search for duplicates of any type of file - .e.g. MP3 files
What it does is compare the files based on the "MD5 digest" which is a sequence of characters that is computed from the content of the file. It is possible but extremely unlikely that two files with different content would have the same "MD5 digest". The file comparison does not look at the file names at all - just the content of the files.

Code:
#!/usr/bin/perl
use strict;
use warnings;

# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first 
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line, 
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net)  January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------


use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

# The HFS+ filesystem used on OS X has resource forks as well as data forks
# By default this script checks the resource forks of files with duplicate data
# and issues a message if the resource forks are different.
# If you don't want to do this (e.g. on some other Unix system)
# then set the 'checkRsrc' variable to 0
my $checkRsrc = 1;  # whether to check the resource forks

my $matchSomeSuffix; # reference to a subroutine for matching suffixes
if (defined($ARGV[0]))
{
    # the list of desired suffixes is supplied in $ARGV[0]
    # separated by vertical bars - e.g. ".mp3|.aiff"
    # Note that if $ARGV[0] is '', then all files will be looked at
    
    my @suffixes = split(/\|/, $ARGV[0]);
    if (scalar(@suffixes) > 0)
    {
        # create an efficient matching subroutine using the Friedl technique
        my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);

        $matchSomeSuffix = eval "sub {$matchExpr}";
    }
    shift @ARGV;
}

# if no dirs supplied as command-line args, we search the current directory
my @searchDirs = @ARGV ? @ARGV : ".";

# verify that these are in fact directories
foreach my $dir (@searchDirs)
{
    die "\"$dir\" is not a directory\n" unless -d "$dir";
}

my %filesByDataLength; # global variable holding hash of arrays of fileInfo's

# calcMd5: returns the MD5 digest of the given file
sub calcMd5($)
{
    my ($filename) = @_;

    if (-d $filename)
    {
        # doing MD5 on a directory is not supported
        return "unsupported"; # we need to return something
    }

    # We use 'sysopen' instead of just 'open' in order to be able to handle
    # filenames with leading whitespace or leading "-"
    # The usual trick to protect against leading whitespace or "-" is to do
    # $filename =~ s#^(\s)#./$1#; open(FILE, "< $filename\0")
    # but that fails if the filename is something like "- foo"
    # (i.e. if there is an initial "-" followed by whitespace)

    sysopen(FILE, $filename, O_RDONLY)
         or die "Unable to open file \"$filename\": $!\n";
    binmode(FILE); # just in case we're on Windows!
    my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
    close(FILE);
    return $md5;
}

# hashByMd5: passed a ref to an array of fileInfo's
#            Returns a ref to a hash by md5 of the fileInfo's
sub hashByMd5($)
{
    my ($fileInfoListRef) = @_;

    my %filesByMd5;
    foreach my $fileInfo (@{$fileInfoListRef})
    {
        my $dirname = $fileInfo->{dirname};
        my $filename = $fileInfo->{filename};

        my $md5 = calcMd5("$dirname/$filename");
        push(@{$filesByMd5{$md5}}, $fileInfo);
    }
    
    return \%filesByMd5;
}

# checkFile: invoked from the 'find' routine on each file or directory in turn
sub checkFile()
{
    return unless -f $_; # only interested in files, not directories

    my $filename = $_;
    my $dirname = $File::Find::dir;

    return if $filename =~ /^\._/; # ignore files whose names start with "._"

    if (defined($matchSomeSuffix))
    {
        return unless &$matchSomeSuffix;
    }

    my $statInfo = stat($filename)
              or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
    my $size = $statInfo->size;

    my $fileInfo = {
        'dirname'  => $dirname,
        'filename' => $filename,
        };

    push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN:
{
    # traverse the directories and collate the files by data length
    # in the global variable %filesByDataLength
    find(\&checkFile, @searchDirs);
    
    my $numDupes = 0;
    my $numDupeBytes = 0;
    # process the files by size, starting with the largest
    foreach my $size (sort {$b<=>$a} keys %filesByDataLength)
    {
        my $numSameSize = scalar(@{$filesByDataLength{$size}});
        next unless $numSameSize > 1;

        #print "size: $size numSameSize: $numSameSize\n";
        my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
        my %filesByMd5 = %{$filesByMd5Ref};
        foreach my $md5 (keys %filesByMd5)
        {
            my @sameMd5List = @{$filesByMd5{$md5}};
            my $numSameMd5 = scalar(@sameMd5List);
            next unless $numSameMd5 > 1;
            
            # for each set of dupes, print the full path to the files
            my $rsrcMd5;
            foreach my $fileInfo (@sameMd5List)
            {
                my $dirname = $fileInfo->{dirname};
                my $filename = $fileInfo->{filename};
                my $filepath = "$dirname/$filename";
                print "$filepath\n";
                
                if ($checkRsrc)
                {
                    my $rsrcFilepath = "$filepath/..namedfork/rsrc";
                    if (!defined($rsrcMd5))
                    {
                        $rsrcMd5 = calcMd5($rsrcFilepath);
                    }
                    elsif ($rsrcMd5 ne calcMd5($rsrcFilepath))
                    {
                        print "Resource fork differs\n";
                    }
                }
            }
            print "----------\n";
            
            $numDupes += ($numSameMd5 - 1);
            $numDupeBytes += ($size * ($numSameMd5 - 1));
        }
    }
    
    my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
    print "Number of duplicate files: $numDupes\n";
    print "Megabytes duplicated: $numDupeMegabytes\n";
}
__________________
hayne.net/macosx.html

Last edited by hayne; 03-02-2006 at 08:22 PM. Reason: use sysopen instead of open in order to handle filenames that start with "- "; made faster by first collating by size; now reports if resource forks differ
hayne is offline   Reply With Quote
Old 02-28-2006, 02:33 PM   #8
John Pilgrim
Registered User
 
Join Date: Feb 2006
Posts: 1
Issues solved re script execution

Thanks Cameron,
That solved it! I had copied and pasted the script into BBEdit, but neglected to set the line breaks to Unix (from the default traditional Macintosh) before chmod'ing and executing.
Thanks again!
John
PS: Sorry for the dupe emails...earthlink webmail was misbehaving.

>On 27-Feb-06, at 9:09 PM, John wrote:
>
>> I found your findDupeFiles perl script at http://
>> forums.macosxhints.com/showthread.php?p=264200
>> and had a problem with it in that I consistently get an error:
>>
>> john-pilgrims-powerbook:/Library/Audio/Apple Loops johnpilgrim$ ./
>> findDupeFiles.pl '.aif|.aiff' "/Apple Loops for GarageBand" "/Apple
>> Loops for Soundtrack Pro"
>> use: bad interpreter: No such file or directory
>> john-pilgrims-powerbook:/Library/Audio/Apple Loops johnpilgrim$ ./findDupeFiles.pl
>> '.aif|.aiff'
>> use: bad interpreter: No such file or directory
>
>> I'm not familiar enough with Perl to debug it myself. The comparison directories
>> exist, /usr/bin/perl exists, the script had been chmod'ed, but I don't know what
>> the "bad interpreter" error means but I consistently get it, no matter
>> if I specify the search directories or not.
>
>> Thanks in advance for your assistance,
>> John
>
>Hi John
>
>Please ask these sorts of questions on the forums - e.g. in the
>thread that you refer to. That way future readers can benefit from
>the answer.
>But I believe your problem is that the script file is not using the
>correct (Unix-style) line endings. This issue is discussed in the
>Unix FAQ that I referred to in that post.
>
>--
>Cameron Hayne
>macdev@hayne.net
John Pilgrim is offline   Reply With Quote
Old 03-02-2006, 08:30 PM   #9
hayne
Site Admin
 
Join Date: Jan 2002
Location: Montreal
Posts: 32,473
I modified the above script in response to comments made on the article about this script on the main macosxhints site (http://www.macosxhints.com/article.p...06030205235028)

- it now collates files by data-fork size and only uses MD5 to distinguish files that have the same data size. This makes it run about twice as fast as before in my tests on the AIFF loop folders

- it now reports "Resource fork differs" when the resource forks of duplicate data files differ

- it now handles unusual filenames better (e.g. "- foo" with an initial "-" followed by whitespace)
__________________
hayne.net/macosx.html
hayne is offline   Reply With Quote
Old 05-16-2008, 03:14 PM   #10
a11en
Triple-A Player
 
Join Date: Nov 2005
Posts: 71
So, I'm confused... is the post linked above the corrected version of the script, or is the script in this thread above the corrected verison of the script? [please note the required answer isn't "yes" or "no" unless the above sentence is parsed]

Also- I found that "~/Allen's\ WorkFiles/AFM/Data/" this is not a directory. I'm not super great in the unix world... but why wouldn't it be a directory? My thought is that either the ~ or the "\ " is crapping out the script?

Thanks!!
-Allen
a11en is offline   Reply With Quote
Old 05-16-2008, 04:21 PM   #11
trevor
Moderator
 
Join Date: Jun 2003
Location: Boulder, CO USA
Posts: 19,853
Quote:
Also- I found that "~/Allen's\ WorkFiles/AFM/Data/" this is not a directory. I'm not super great in the unix world... but why wouldn't it be a directory? My thought is that either the ~ or the "\ " is crapping out the script?

Assuming that the path to the directory in question is / > Users > YourUsername > Allen's WorkFiles > AFM > Data, then the way to refer to that directory would be ~/Allen\'s\ WorkFiles/AFM/Data/

In other words, the ' character (the single quote or apostrophe character) needs to be escaped with a \ as well. Your version above does not escape the single quote, so your computer interprets it differently than you expect.

There's no problem with the \ and ~ characters that you do have, assuming my guess as to the path is correct.

Trevor

Last edited by trevor; 05-16-2008 at 04:24 PM.
trevor is offline   Reply With Quote
Old 05-16-2008, 05:05 PM   #12
hayne
Site Admin
 
Join Date: Jan 2002
Location: Montreal
Posts: 32,473
Quote:
Originally Posted by a11en
So, I'm confused... is the post linked above the corrected version of the script, or is the script in this thread above the corrected verison of the script?

The script supplied in this forums thread is the better version.
Alternatively, you can get it from my web site:
http://hayne.net/MacDev/Perl/

Quote:
Also- I found that "~/Allen's\ WorkFiles/AFM/Data/" this is not a directory. I'm not super great in the unix world... but why wouldn't it be a directory? My thought is that either the ~ or the "\ " is crapping out the script?

Trevor has explained how to handle the quote and the space in that folder name - but an alternative (that would make things easier) would be just to rename that folder to have a more Unix-friendly name - e.g.: AllenWorkFiles
__________________
hayne.net/macosx.html
hayne is offline   Reply With Quote
Old 05-16-2008, 06:00 PM   #13
a11en
Triple-A Player
 
Join Date: Nov 2005
Posts: 71
Thanks guys! I'll try this as soon as I get home. I agree with the more unix friendly directories... actually, the problem with the directory path was created by a snagpath cm plugin. Guess they forgot to \' all ' characters.

Thank you for your help!!
-Allen
a11en is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump



All times are GMT -5. The time now is 04:32 AM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.
Site design © IDG Consumer & SMB; individuals retain copyright of their postings
but consent to the possible use of their material in other areas of IDG Consumer & SMB.