Data Deduplication with Linux

Petros Koutoupis

Issue #207, July 2011

Lessfs offers a flexible solution to utilize data deduplication on affordable commodity hardware.

In recent years, the storage industry has been busy providing some of the most advanced features to its customers, including data deduplication. Data deduplication is a unique data compression technique used to eliminate redundant data and decrease the total capacities consumed on an enabled storage volume. A volume can refer to a disk device, a partition or a grouped set of disk devices all represented as single device. During the process of deduplication, redundant data is deleted, leaving a single copy of the data to be stored on the storage volume.

One ideal use-case scenario is when multiple copies of a large e-mail message are distributed and stored on a mail server. An e-mail message the size of just a couple megabytes does not seem too bad, but if it were sent and forwarded to more than 100 recipients—that's more than 200MB of copies of the same file(s).

Another great example is in the arena of host virtualization. In recent years, virtualization has been the hottest trend in server administration. If you are deploying multiple virtual guests across a network that may share the same common operating system image, data deduplication significantly can reduce the total size of capacity consumed to a single copy and, in turn, reference the differences when and where needed.

Again, the primary focus of this technology is to identify large sections of data that can include entire files or large sections of files, which are identical, and store only one copy of it. Other benefits include reduced costs for additional capacities of storage equipment, which, in turn, can be used to increase volume sizes or protect large numbers of existing volumes (such as RAID, archivals and so on). Using less storage equipment also leads to a reduced cost in energy, space and cooling.

Two types of data deduplication exist: post-process and inline deduplication. Each has its advantages and disadvantages. To summarize, post-process deduplication occurs after the data has been written to the storage volume in a separate process. While you are not losing performance in computing the necessary deduplication, multiple copies of a single file will be written multiple times, until post-process deduplication has completed, and this may become problematic if the available capacity becomes low. During inline deduplication, less storage is required, because all deduplication is handled in real time as the data is written to the storage volume, although you will notice a degradation in performance as the process attempts to identify redundant copies of the data coming in.

Storage technology manufacturers have been providing the technology as part of their proprietary and external storage solutions, but with Linux, it also is possible to use the same technology on commodity and very affordable hardware. The solutions provided by these storage technology manufacturers are in some cases available only on the physical device level (that is, the block level) and are able to work only with redundant streams of data blocks as opposed to individual files, because the logic is unable to recognize separate files over the most commonly used protocols, such as SCSI, Serial Attached SCSI (SAS), Fibre Channel, InfiniBand and even Serial ATA (SATA). This is referred to as a chunking method. The filesystem I cover here is Lessfs, a block-level-based deduplication and FUSE-enabled Linux filesystem.

Lessfs

Lessfs is a high-performance inline data deduplication filesystem written for Linux and is currently licensed under the GNU General Public License version 3. It also supports LZO, QuickLZ and BZip compression (among a couple others), and data encryption. At the time of this writing, the latest stable version is 1.3.3.1, which can be downloaded from the SourceForge project page: sourceforge.net/projects/lessfs/files/lessfs.

Before installing the lessfs package, make sure you install all known dependencies for it. Some, if not most, of these dependencies may be available in your distribution's package repositories. You will need to install a few manually though, including mhash, tokyocabinet and fuse (if not already installed).

Your distribution may have the libraries for mhash2 either available or installed, but lessfs still requires mhash. This also can be downloaded from SourceForge: sourceforge.net/projects/mhash/files/mhash. At the time of this writing, the latest stable build is 0.9.9.9. Download, build and install the package:

$ tar xvzf mhash-0.9.9.9.tar.gz
$ cd mhash-0.9.9.9/
$ ./configure
$ make
$ sudo make install

Lessfs also requires tokyocabinet (1978th.net/tokyocabinet), as it is the main database on which it relies. The latest stable build is 1-4.47. To build tokyocabinet, you need to have zlib1g-dev and libbz2-dev already installed, which usually are provided by most, if not all, mainstream Linux distributions.

Download, build and install the package using the same configure, make and sudo make install commands from earlier. On 32-bit systems, you need to append --enable-off64 to the configure command. Failure to use --enable-off64 limits the databases to a 2GB file size.

If it is not already installed or if you want to use the latest and greatest stable build of FUSE, download it from SourceForge: sourceforge.net/projects/fuse. At the time of this writing, the latest stable build is 2.8.5. Download, build and install the package using the same configure, make and sudo make install commands from earlier.

After resolving all the more obscure dependencies, you're ready to build and install the lessfs package. Download, build and install the package using the same configure, make and sudo make install commands from earlier.

Now you're ready to go, but before you can do anything, some preparation is needed. In the lessfs source directory, there is a subdirectory called etc/, and in it is a configuration file. Copy the configuration file to the system's /etc directory path:

$ sudo cp etc/lessfs.cfg /etc/

This file defines the location of the databases among a few other details (which I discuss later in this article, but for now let's concentrate on getting the filesystem up and running). You will need to create the directory path for the file data (default is /data/dta) and also for the metadata (default is /data/mta) for all file I/O operations sent to/from the lessfs filesystem. Create the directory paths:

$ sudo mkdir -p /data/{dta,mta}

Initialize the databases in the directory paths with the mklessfs command:

$ sudo mklessfs -c /etc/lessfs.cfg

The -c option is used to specify the path and name of the configuration file. A man page does not exist for the command, but you still can invoke the on-line menu with the -h command option.

Now that the databases have been initialized, you're ready to mount a lessfs-enabled filesystem. In the following example, let's mount it to the /mnt path:

$ sudo lessfs /etc/lessfs.cfg /mnt

When mounted, the filesystem assumes the total capacity of the filesystem to which it is being mounted. In my case, it is the filesystem on /dev/sda1:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3031812   2541028  55% /mnt

$ df -t ext4
Filesystem        1K-blocks      Used Available Use% Mounted on
/dev/sda1           5871080   3031812   2541028  55% /

Currently, you should see nothing but a hidden .lessfs subdirectory when listing the contents of the newly mounted lessfs volume:

$ ls -a /mnt/
.  ..  .lessfs

Once mounted, the lessfs volume can be unmounted like any other volume:

$ sudo umount /mnt

Let's put the volume to the test. Writing file data to a lessfs volume is no different from what it would be to any other filesystem. In the example below, I'm using the dd command to write approximately 100MB of all zeros to /mnt/test.dat:

$ sudo dd if=/dev/zero of=/mnt/test.dat bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 5.05418 s, 20.7 MB/s

Seeing how the filesystem is designed to eliminate all redundant copies of data and being that a file filled with nothing but zeros qualifies as a prime example of this, you can observe that only 48KB of capacity was consumed, and that may just be nothing more than the necessary data synchronized to the databases:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3031860   2540980  55% /mnt

If you list a detailed listing of that same file in the lessfs-enabled directory, it appears that all 100MB have been written. Utilizing its embedded logic, lessfs reconstructs all data on the fly when additional read and write operations are initiated to the file(s):

$ ls -l
total 102400
-rw-r--r-- 1 root root 104857600 2011-02-26 13:57 test.dat

Now, let's work with something a bit more complex—something containing a lot of random data. For this example, I decided to download the latest stable release candidate of the Linux kernel source from www.kernel.org, but before I did, I listed the total capacity consumed available on the lessfs volume as a reference point:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3031896   2540944  55% /mnt

$ sudo wget http://www.kernel.org/pub/linux/kernel/v2.6/
↪testing/linux-2.6.38-rc6.tar.bz2

Listing the contents, you can see that the package is approximately 75MB:

$ ls -l linux-2.6.38-rc6.tar.bz2 
-rw-r--r-- 1 root root 74783787 2011-02-21 19:50 
 ↪linux-2.6.38-rc6.tar.bz2

Listing the capacity used to store the Linux kernel source archive yields a difference of roughly 75MB:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3106440   2466400  56% /mnt

Now, let's create a copy of the archived kernel source:

$ sudo cp linux-2.6.38-rc6.tar.bz2 linux-2.6.38-rc6.tar.bz2-bak

$ ls -l linux-2.6.38-rc6.tar.bz2*
-rw-r--r-- 1 root root 74783787 2011-02-21 19:50 
 ↪linux-2.6.38-rc6.tar.bz2
-rw-r--r-- 1 root root 74783787 2011-02-26 14:43 
 ↪linux-2.6.38-rc6.tar.bz2-bak

By having a redundant copy of the same file, an additional 44KB is consumed—not nearly as much as an additional 75MB:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3106484   2466356  56% /mnt

And, because the databases contain the actual file and metadata, if an accidental or intentional system reboot occurred, or if for whatever reason you need to unmount the filesystem, the physical data will not be lost. All you need to do is invoke the same mount command and everything is restored:

$ sudo umount /mnt/
$ sudo lessfs /etc/lessfs.cfg /mnt
$ ls
linux-2.6.38-rc6.tar.bz2  linux-2.6.38-rc6.tar.bz2-bak

In the situation when a system suffers from an accidental reboot, possibly due to power loss, as of version 1.0.4, lessfs supports transactions, which eliminates the need for an fsck after a crash.

Shifting focus back to lessfs preparation, note that the lessfs volume's options can be defined by the user when mounting. For instance, you can define the desired options for big_write, max_read and max_write. The big_write improves throughput when used for backup purposes, and both max_read and max_write must be defined to use it. The max_read and max_write options always must be equal to one another and define the block size for lessfs to use: 4, 8, 16, 32, 64 and 128KB.

The definition of a block size can be used to tune the filesystem. For example, a larger block size, such as 128KB (131072), offers faster performance but, unfortunately, at the cost of less deduplication (remember from earlier that lessfs uses block-level deduplication). All other options are FUSE-generic options defined in the FUSE documentation. An example of the use of supported mount options can be found in the lessfs man page:

$ man 1 lessfs

The following example is given to mount lessfs with a 128KB block size:

$ sudo lessfs /etc/lessfs.cfg /fuse -o negative_timeout=0,\
        entry_timeout=0,attr_timeout=0,use_ino,\
        readdir_ino, default_permissions,allow_other,big_writes,\
        max_read=131072,max_write=131072

Additional configurable options for the database exist in your lessfs.cfg file (the same file you copied over to the /etc directory path earlier). The block size can be defined here as well as even the method of additional data compression to use on the deduplicated data and more. Below is an excerpt of what the configuration file contains. In order to define a new value for various options clearly, just uncomment the option desired and, in turn, comment everything else:

BLKSIZE=131072
#BLKSIZE=65536
#BLKSIZE=32768
#BLKSIZE=16384
#BLKSIZE=4096
#COMPRESSION=none
COMPRESSION=qlz
#COMPRESSION=lzo
#COMPRESSION=bzip
#COMPRESSION=deflate
#COMPRESSION=disabled

This excerpt defines the default block size to 128KB and the default compression method to QuickLZ. If the defaults are not to your liking, in this file you also can define the commit to disk intervals (default is 30 seconds) or a new path for your databases, but make sure to initialize the databases before use; otherwise, you'll get an error when you try to mount the lessfs filesystem.

Summary

Now, Linux is not limited to a single data deduplication solution. There also is SDFS, a file-level deduplication filesystem that also runs on the FUSE module. SDFS is a freely available cross-platform solution (Linux and Windows) made available by the Opendedup Project. On its official Web site, the project highlights the filesystem's scalability (it can dedup a petabyte or more of data); speed, performing deduplication/reduplication at a line speed of 290MB/s and higher; support for VMware while also mentioning its usage in Xen and KVM; flexibility in storage, as deduplicated data can be stored locally, on the network across multiple nodes (NFS/CIFS and iSCSI), or in the cloud; inline and batch mode deduplication (a method of post-process deduplication); and file and folder snapshot support. The project seems to be pushing itself as an enterprise-class solution, and with features like these, Opendedup means business.

It is also not surprising that since 2008, data deduplication has been a requested feature for Btrfs, the next-generation Linux filesystem. Although that also may be in response to Sun Microsystem's (now Oracle's) development of data deduplication into its advanced ZFS filesystem. Unfortunately, at this point in time, it is unknown if and when Btrfs will introduce data deduplication support, although it already contains support for various types of data compression (such as zlib and LZO).

Currently, the lessfs2 release is under development, and it is supposed to introduce snapshot support, fast inode cloning, new databases (including hamsterdb and possibly BerkeleyDB) apart from tokyocabinet, self-healing RAID (to repair corrupted chunks) and more.

As you can see, with a little time and effort, it is relatively simple to utilize the recent trend of data deduplication to reduce the total capacity consumed on a storage volume by removing all redundant copies of data. I recommend its usage in not only server administration but even for personal use, primarily because with implementations such as lessfs, even if there isn't too much redundant data, the additional data compression will help reduce the total size of the file when it is eventually written to disk. It is also worth mentioning that the lessfs-enabled volume does not need to remain local to the host system, but it also can be exported across a network via NFS to even iSCSI and utilized by other devices within that same network, providing a more flexible solution.

Petros Koutoupis is a full-time Linux kernel, device-driver and application developer for embedded and server platforms. He has been working in the data storage industry for more than six years and enjoys discussing the same technologies.