Rdfind – redundant data find

Introduction

Rdfind is a program that finds duplicate files. It is useful for compressing backup directories or just finding duplicate files. It compares files based on their content, NOT on their file names.

When I want to change some file, I am often nervous to break something and therefore copy all the old files to some directory named app_2006xxxx or whatever. The same when I switch computer system and am afraid to lose my old stuff. This makes all my files exist in numerous places, and I never feel like cleaning up. This is where rdfind comes in handy. It will find those files and report them to you. Optionally, erase them or replace them with links (hard or symbolic). Rdfind is a command line tool – that means no GUI.

Install

Rdfind is written in c++ and should compile under any *nix. It is currently running under Mandriva, Fedora Core, Mac os X and Windows (under cygwin).

Note that rdfind is licensed under GPL v2, and there is no warranty etc. See the license for details.

You can install rdfind in several ways. There are precompiled packages for Debian and Ubuntu. If you are on Mac, you have to build from source which is quite easy, especially if you use MacPorts.

Getting the source code

The packages are signed with keys below:

Newer key, ID 0xAB0234EB: rdfind0xAB0234EB.asc

Older (expired) key, ID 0x509CCB46: rdfind0x509CCB46.asc

Version

File

Signature file

Sha1 checksum

comment

rdfind 1.2.4

rdfind-1.2.4.tar.gz

rdfind-1.2.4.tar.gz.asc

27fff523036ee4a5d69c5e646d27db75d8e3f1d1


rdfind 1.2.3

rdfind-1.2.3.tar.gz

rdfind-1.2.3.tar.gz.asc

de508d5d1c5aa438111ad8a7180126eac184ccc7


rdfind 1.2.2

rdfind-1.2.2.tar.gz

rdfind-1.2.2.tar.gz.asc

df929767112e740e5a0a2e528da681d2e9aad6ab


Note to self: export pkg=rdfind-1.2.4.tar.gz; sha1sum $pkg; gpg -u 0xAB0234EB --clearsign $pkg;

Installing via debian package repository

Salvatore Ansani has been kind to generate packages for debian, for amd64 and i386 architectures for the stable, testing, unstable and experimental versions of debian. Thanks!

Add the following to /etc/apt/sources.list:

# BINARY REPOSITORY #

deb http://ansani.it/debian/ X contrib

#

# SOURCES REPOSITORY #

deb-src http://ansani.it/debian/ X contrib

(substitute X with one of stable, testing, unstable, experimental depending on what you run)

Enter (as root) :

apt-get update && apt-get install rdfind

and you should have it installed. Beware that this repository contains other software, so be careful if you do not want to mess with your installed software (like k3b etc, see http://ansani.it/my-debian-repository/ to see what is in there.)

Installing from a deb (for Debian and Ubuntu)

Download the .deb package from the table below, beloning to your os and architecture. If you want to verify the signature, download the signature file .deb.asc too and verify with gpg --verify *.deb.asc You first have to import the key (links above).

Install the selected file with dpkg -i rdfind_1.2.3-1_*.deb

Version

Operating system

architecture

file

signature

Rdfind 1.2.3

debian lenny

amd64

rdfind_1.2.3-1_amd64.deb

rdfind_1.2.3-1_amd64.deb.asc

Rdfind 1.2.3

debian lenny

i386

rdfind_1.2.3-1_i386.deb

Note: I failed getting this one to compile properly, will get updated when I understand how to do it.

Will try with:

setarch i386 dpkg-buildpackage -rfakeroot -ai386 -k0xAB0234EB

Please use Salvatores repository instead if you need deb for this arch.

rdfind_1.2.3-1_i386.deb.asc

Rdfind 1.2.3

debian lenny

armel

rdfind_1.2.3-1_armel.deb

rdfind_1.2.3-1_armel.deb.asc

Rdfind 1.2.3

Ubuntu hardy heron (8.04)

i386

ubuntu_rdfind_1.2.3-1_i386.deb

ubuntu_rdfind_1.2.3-1_i386.deb.asc

Rdfind 1.2.3

Ubuntu hardy heron (8.04)

(seems to work for 8.10 Intrepid Ibex as well)

amd64

ubuntu_rdfind_1.2.3-1_amd64.deb

ubuntu_rdfind_1.2.3-1_amd64.deb.asc

(note to self: packages created with dpkg-buildpackage -rfakeroot -k0xAB0234EB -aarmel ) (wont work for me with this command, I compiled on a arm host to get the arm package above. There seems to be a bug in dpkg-buildpackage for now, causing this behaviour.)

(note to self: packages signed with gpg -u 0xAB0234EB --clearsign *.deb)

Please note that I am new to creating deb packages, so there may be errors.

Installing with RPM

Here are precompiled packages, up to 1.2.2. As I do not use rpm anymore, these are a bit old.

rdfind-1.2.2-1.i386.rpm (compiled on Debian, with rpmbuild. Tested on Fedora Core 3)
rdfind-1.2.2-1.i586.rpm (compiled on Debian, with rpmbuild. Tested on Fedora Core 3)
rdfind-1.2.1-0.1.20060mdk.i586.rpm (compiled on Mandriva 2006. Tested on a Fedora Core 3 and Mandriva 2006)
rdfind-1.2.2-1.src.rpm

Packages are signed with this key

Import the key (as root) with rpm --import rdfind-distribution-key.gpg

Verify the package with rpm -K rdfind-1.2.1-0.1.20060mdk.i586.rpm

Installing from source (generic)

Installing from source requires the nettle library. Note that the nettle library is available in both Ubuntu, Debian and Mac Os X (via Macports). It might be easier to install it using one of those systems.

Here is how to get and install nettle if you do not use one of the methods above:

wget ftp://ftp.lysator.liu.se/pub/security/lsh/nettle-1.14.tar.gz -nc
wget ftp://ftp.lysator.liu.se/pub/security/lsh/nettle-1.14.tar.gz.asc -nc
wget ftp://ftp.lysator.liu.se/pub/security/lsh/distribution-key.gpg -nc
gpg --fast-import distribution-key.gpg                    # omit if you do not want to verify
gpg --verify nettle-1.14.tar.gz.asc --nettle-1.14.tar.gz  # omit if you do not want to verify
tar -xzvf nettle-1.14.tar.gz
./configure
make
su # Only if you have root privileges. See note below.
make install
exit

If you install nettle as non-root, you must create a link so that rdfind later can do #include "nettle/nettle_header_files.h" correctly. Use for instance the commands

cd nettle-1.14
cd ..
ln -s nettle-1.14 nettle

Next step is to build rdfind. (If you are on Mac Os X and use MacPorts, you can use this portfile and build using macports instead of doing it manually. Hopefully I can get the portfile into macports some day. See instructions on the macports documentation site on how to use a local portfile.)

Download the source code using one of the links in the table above under ”Getting the source”.

Build rdfind:

[pauls@localhost tmp]$ bunzip2 rdfind-1.2.1.tar.bz2
[pauls@localhost tmp]$ tar xf rdfind-1.2.1.tar
[pauls@localhost tmp]$ cd rdfind-1.2.1/
[pauls@localhost rdfind-1.2]$./configure    # see note below
[pauls@localhost rdfind-1.2]$make
[pauls@localhost rdfind-1.2]$su             # if you have root privileges
[pauls@localhost rdfind-1.2]$make install

Note that if nettle is not installed in a standard place, you might need to pass LDFLAGS=-L../path/to/nettle/library CPPFLAGS=-I../path/to/nettle_headerfiles to configure.

Usage

The syntax is

rdfind [options] directory_or_file_1 [directory_or_file_2] [directory_or_file_3]

Without options, a results file will be created in the current directory. For full options, see the man pages in the table below.

Version

Man page

1.2.3

man

1.2.2

man



Examples

Basic example, taken from a *nix environment:
Look for duplicate files in directory /home/pauls/bilder:

[pauls@localhost ~]$ rdfind /home/pauls/bilder/
Now scanning "/home/pauls/bilder", found 3301 files.
Now have 3301 files in total.
Removed 0 files due to nonunique device and inode.
Now removing files with zero size...removed 3 files
Total size is 2861229059 bytes or 3 Gib
Now sorting on size:removed 3176 files due to unique sizes.122 files left.
Now eliminating candidates based on first bytes:removed 8 files.114 files left.
Now eliminating candidates based on last bytes:removed 12 files.102 files left.
Now eliminating candidates based on md5 checksum:removed 2 files.100 files left.
It seems like you have 100 files that are not unique
Totally, 24 Mib can be reduced.
Now making results file results.txt
[pauls@localhost ~]$                  

From the last row, it is seen that there are 100 files that are not unique. Let us examine them by looking at the newly created results.txt:

[pauls@localhost ~]$ cat results.txt
# Automatically generated
# duptype id depth size device inode priority name
DUPTYPE_FIRST_OCCURENCE 960 3 4872 2056 5948858 1 /home/pauls/bilder/digitalkamera/horisontbild/.xvpics/test 001.jpg.gtmp.jpg
DUPTYPE_WITHIN_SAME_TREE -960 3 4872 2056 5932098 1 /home/pauls/bilder/digitalkamera/horisontbild/.xvpics/test 001.jpg
.
(intermediate rows removed)
.
DUPTYPE_FIRST_OCCURENCE 1042 2 7904558 2056 6209685 1 /home/pauls/bilder/digitalkamera/skridskotur040103/skridskotur040103 014.avi
DUPTYPE_WITHIN_SAME_TREE -1042 3 7904558 2056 327923 1 /home/pauls/bilder/digitalkamera/saknat/skridskotur040103/skridskotur040103 014.avi
# end of file

Consider the last two rows. It says that the file skridskotur040103 014.avi exists both in /home/pauls/bilder/digitalkamera/skridskotur040103/ and /home/pauls/bilder/digitalkamera/saknat/skridskotur040103/. I can now remove the one I consider a duplicate by hand if I want to.

Algorithm

Rdfind uses the following algorithm. If N is the number of files to search through, the effort required is in worst case O(Nlog(N)). Because it sorts files on inodes prior do disk reading, it is quite fast. It also only reads from disk when it is needed.

  1. Loop over each argument on the command line. Assign each argument a priority number, in increasing order.

  2. For each argument, list the directory contents recursively and assign it to the file list. Assign a directory depth number, starting at 0 for every argument.

  3. If the input argument is a file, add it to the file list.

  4. Loop over the list, and find out the sizes of all files.

  5. If flag -removeident true: Remove items from the list which already are added, based on the combination of inode and device number.

  6. Sort files on size. Remove files from the list, which have unique sizes.

  7. Sort on device and inode(speeds up file reading). Read a few bytes from the beginning of each file (first bytes).

  8. Remove files from list that have the same size but different first bytes.

  9. Sort on device and inode(speeds up file reading). Read a few bytes from the end of each file (last bytes).

  10. Remove files from list that have the same size but different last bytes.

  11. Sort on device and inode(speeds up file reading). Perform a checksum calculation for each file.

  12. Only keep files on the list with the same size and checksum. These are duplicates.

  13. Sort list on size, priority number, and depth. The first file for every set of duplicates is considered to be the original.

  14. If flag ”-makeresultsfile true”, then print results file (default). Exit.(?)

  15. If flag ”-deleteduplicates true”, then delete (unlink) duplicate files. Exit.

  16. If flag ”-makesymlinks true”, then replace duplicates with a symbolic link to the original. Exit.

  17. If flag ”-makehardlinks true”, then replace duplicates with a hard link to the original. Exit.

Alternatives and benchmark

There are some interesting alternatives.

Duff: http://duff.sourceforge.net/ by Camilla Berglund.

Fslint: http://www.pixelbeat.org/fslint/ by Pádraig Brady

A search on ”finding duplicate files” will give you lots of matches.

Here is a small benchmark. Times are obtained from ”elapsed time” in the time command. The command has been repeated several times in a row, where the result from each run is shown in the table below. The test computer is a 3 GHz PIV with 1 GB RAM, Maxtor SATA 8 Mb cache, running Mandriva 2006.

Test case
command line

duff 0.4

time ./duff -rP dir >slask.txt 

Fslint 2.14

time ./findup dir >slask.txt

Rdfind 1.1.2

time rdfind dir

Directory with 3301 files (2782 Mb jpegs) in a directory structure, from which 100 files (24 Mb) are redundant.

0:01.55
0:01.61
0:01.58

0:02.59
0:02.66
0:02.58

0:00.49
0:00.50
0:00.49

Directory with 35871 files (5325 Mb) in a directory structure, from which 10889 files (233 Mb) are redundant.

3:24.90
0:46.48
0:46.20
0:45.31

1:26.37
1:16.36
1:15.38
0:53.20

0:29.37
0:07.81
0:06.24
0:06.17

Note: units are minutes:seconds

Author

Rdfind is written by Paul Sundvall. If you find this software useful, please drop me an email! The adress is x@y.z where x=rdfind, y=paulsundvall, z=net.

Help with creating files for rpm/deb building is especially needed. Suggestions and comments are very welcome.