Enhanced Duplicate File Searching With TotalStorage Productivity Center

Abstract

This Technote looks at how TotalStorage Productivity Center searches for and reports on duplicate files in an environment and what configuration changes can be made to customize its focus when necessary. With an understanding of how TPC looks for duplicate files, it is possible to create targeted
profiles that can collect large amounts of file names but be targeted to specific servers (perhaps file and print) and also focus on specific file types.
Details on the procedure to create the profiles and jobs to identify duplicate files are documented in the IBM Redbooks publication TotalStorage Productivity Center Advanced Topics, SG24-7348.

Contents

How does TPC identify a duplicate file?
There are a number of misconceptions about how TPC identified duplicate files in the past. By default TPC will not find every instance of a duplicate file across your environment. TPC performs duplicate file processing only on the filenames that are stored in its repository. By default these filenames are collected by the following profiles:

  • TPCUser.Largest Files
  • TPCUser.Largest Orphans
  • TPCUser.Most at Risk
  • TPCUser.Most Obsolete
By default, these profiles collect 20 filenames per client for the largest file, largest orphans, most at risk, and most obsolete. Therefore when TPC performs duplicate file spotting it is only going to look at a small number of filenames per machine. Any duplicates found are, by definition, in the top 20 largest orphans, most at risk, and most obsolete. In many, if not most, situations this limited level of duplicate spotting is too small to be of any great use. It will not find the true extent of a duplicate file problem in this default configuration. Conversely if TPC were to attempt to do duplicate file matching for all files in your environment and there were ten million (not an unrealistic figure these days) of them to look through, then clearly the TPC repository would become huge even if only 1% of your files were duplicates.

A balanced approach is the solution. Configure TPC to look for more filenames, but target its focus at specific filesystems and file types, where the problem is suspected or most likely to exist. For example, collecting all the filenames on C: drives for all your Windows machines is only going to tell you what you already know, which is that the same file names exist on all of them. Storing all these file names in TPC for this reason represents poor reporting value for the effort involved in collecting, storing, and managing the file name data. There would be greater value in perhaps collecting file names of, for example, all Microsoft® Office file types (*.doc, *.xls, *.ppt) and perhaps media files (*.mp3, *.avi, and so forth) in users' file and print directories. Reporting at this level could spot perhaps spreadsheets that were created by one person but then e-mailed to many and then detached multiple times, for example video files or jokes from external sources that the staff shares via e-mail and detaches.
Note that TPC matches duplicates by files name and size only. It does not open or examine the content of any files or perform any kind of checksum processing.

Following are the steps to configure TPC to identify duplicate files:
  1. Create a profile to target file names for duplicate processing.
  2. Create a targeted scan job.
  3. Generate a targeted duplicate file report.
  4. Define a scripted action.

Special Notices

This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.

Profile

Publish Date
27 June 2007


Rating:
(based on 1 review)


Author(s)

IBM Form Number
TIPS0648