Enhanced Duplicate File Searching With TotalStorage Productivity Center
Published 27 June 2007
Authors: Mary Lovelace
This Technote looks at how TotalStorage Productivity Center searches for and reports on duplicate files in an environment and what configuration changes can be made to customize its focus when necessary. With an understanding of how TPC looks for duplicate files, it is possible to create targeted
profiles that can collect large amounts of file names but be targeted to specific servers (perhaps file and print) and also focus on specific file types.
Details on the procedure to create the profiles and jobs to identify duplicate files are documented in the IBM Redbooks publication TotalStorage Productivity Center Advanced Topics, SG24-7348.
How does TPC identify a duplicate file?
There are a number of misconceptions about how TPC identified duplicate files in the past. By default TPC will not find every instance of a duplicate file across your environment. TPC performs duplicate file processing only on the filenames that are stored in its repository. By default these filenames are collected by the following profiles:
- TPCUser.Largest Files
- TPCUser.Largest Orphans
- TPCUser.Most at Risk
- TPCUser.Most Obsolete
A balanced approach is the solution. Configure TPC to look for more filenames, but target its focus at specific filesystems and file types, where the problem is suspected or most likely to exist. For example, collecting all the filenames on C: drives for all your Windows machines is only going to tell you what you already know, which is that the same file names exist on all of them. Storing all these file names in TPC for this reason represents poor reporting value for the effort involved in collecting, storing, and managing the file name data. There would be greater value in perhaps collecting file names of, for example, all Microsoft® Office file types (*.doc, *.xls, *.ppt) and perhaps media files (*.mp3, *.avi, and so forth) in users' file and print directories. Reporting at this level could spot perhaps spreadsheets that were created by one person but then e-mailed to many and then detached multiple times, for example video files or jokes from external sources that the staff shares via e-mail and detaches.
Note that TPC matches duplicates by files name and size only. It does not open or examine the content of any files or perform any kind of checksum processing.
Following are the steps to configure TPC to identify duplicate files:
- Create a profile to target file names for duplicate processing.
- Create a targeted scan job.
- Generate a targeted duplicate file report.
- Define a scripted action.
This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.
Follow IBM Redbooks
Follow IBM Redbooks