4
Nov
by Chris
Here is a simple translet for finding duplicates.
function T-FindDuplicates{ param ($inxml) begin{ . PSlib:\xml\invoke-transform.ps1 [xml]$xslt = @" <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" />
<xsl:key name="file-checksums" match="file" use="@Checksum" />
<xsl:template match="file">
<xsl:copy>
<xsl:attribute name="Duplicate">
<xsl:value-of select="count(key('file-checksums', @Checksum)) > 1" />
</xsl:attribute>
<xsl:apply-templates select="@*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"@ } process{ if ($_ -is [xml]){ [xml](invoke-transform -inxml $_ -inxsl $xslt) } } end{ if ($inxml -is [xml]){ [xml](invoke-transform -inxml $inxml -inxsl $xslt) } }} As you can see in the code it adds a Duplicate attribute with a value of true or false depending on whether there is a file with a duplicate @Checksum.
It can be used like this
PS> . .\T-GetDirAsXml.ps1
PS> . .\T-AddChecksum.ps1
PS> . .\T-FindDuplicates.ps1
PS> Get-DirAsXml | T-AddChecksum | T-FindDuplicates
It might produce
<root Name="root" Root="True" Date="2008/11/03 01:35:14">
<folder Name="test" Base="D:\powershell\blog\test" Parent="D:\powershell\blog">
<folder Name="test2">
<file Duplicate="true" Name="test.ps1" Checksum="C47313D06C6AADA288AF6D61E03EFD7FA7C52DD73AB097E9D556535D330798D3" />
<file Duplicate="false" Name="test.txt" Checksum="CE217706948A41613FFA00C46B64D48A514D3D80758C8334EE00D6B0786AE47F" />
<file Duplicate="true" Name="test.zip" Checksum="7F2CCA02F17FF0E9458C0777C659D6D00B80F1C9D2921AEC971AE9A82D296AA5" />
<file Duplicate="true" Name="tmp.xml" Checksum="1351245F9834D0406C42DD5AF622FCA691A9A36F440A7C88F389927800292303" />
</folder>
<file Duplicate="true" Name="test.ps1" Checksum="C47313D06C6AADA288AF6D61E03EFD7FA7C52DD73AB097E9D556535D330798D3" />
<file Duplicate="false" Name="test.txt" Checksum="0D7439F5894B4E8EFEC8FB409635D0D8EA7A450E902F6B30B335907B5867DF16" />
<file Duplicate="true" Name="test.zip" Checksum="7F2CCA02F17FF0E9458C0777C659D6D00B80F1C9D2921AEC971AE9A82D296AA5" />
<file Duplicate="true" Name="tmp.xml" Checksum="1351245F9834D0406C42DD5AF622FCA691A9A36F440A7C88F389927800292303" />
</folder>
</root>
Here is the code
T-FindDuplicates.zip (745 b)
All of the files in folder test2 are copies of the files in test except for test.txt and as you can see only having an @Duplicate indicator doesn't tell you which file the file is a duplicate of so this translet is only useful if you have very few duplicate files.
What you do when you find a duplicate is up to you and depends very much on the downstream application. One thing you could do is put a list of the duplicate files into an attribute like this
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" />
<xsl:key name="file-checksums" match="file" use="@Checksum" />
<xsl:template match="file">
<xsl:copy>
<xsl:if test="count(key('file-checksums', @Checksum)) > 1">
<xsl:attribute name="Duplicate">true</xsl:attribute>
<xsl:attribute name="Duplicates">
<xsl:for-each select="key('file-checksums', @Checksum)">
<xsl:call-template name="get-path" />
<xsl:value-of select="'
'" />
</xsl:for-each>
</xsl:attribute>
</xsl:if>
<xsl:apply-templates select="@*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template name="get-path">
<xsl:for-each select="ancestor-or-self::*[not(@Root)]">
<xsl:value-of select="@Parent" />
<xsl:text>\</xsl:text>
<xsl:value-of select="@Name" />
</xsl:for-each>
</xsl:template>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Which will produce
<root Name="root" Root="True" Date="2008/11/03 01:35:14">
<folder Name="test" Base="D:\powershell\blog\test" Parent="D:\powershell\blog">
<folder Name="test2">
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\test.ps1 D:\powershell\blog\test\test.ps1" Name="test.ps1" Checksum="C47313D06C6AADA288AF6D61E03EFD7FA7C52DD73AB097E9D556535D330798D3" />
<file Name="test.txt" Checksum="CE217706948A41613FFA00C46B64D48A514D3D80758C8334EE00D6B0786AE47F" />
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\test.zip D:\powershell\blog\test\test.zip" Name="test.zip" Checksum="7F2CCA02F17FF0E9458C0777C659D6D00B80F1C9D2921AEC971AE9A82D296AA5" />
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\tmp.xml D:\powershell\blog\test\tmp.xml" Name="tmp.xml" Checksum="1351245F9834D0406C42DD5AF622FCA691A9A36F440A7C88F389927800292303" />
</folder>
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\test.ps1 D:\powershell\blog\test\test.ps1" Name="test.ps1" Checksum="C47313D06C6AADA288AF6D61E03EFD7FA7C52DD73AB097E9D556535D330798D3" />
<file Name="test.txt" Checksum="0D7439F5894B4E8EFEC8FB409635D0D8EA7A450E902F6B30B335907B5867DF16" />
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\test.zip D:\powershell\blog\test\test.zip" Name="test.zip" Checksum="7F2CCA02F17FF0E9458C0777C659D6D00B80F1C9D2921AEC971AE9A82D296AA5" />
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\tmp.xml D:\powershell\blog\test\tmp.xml" Name="tmp.xml" Checksum="1351245F9834D0406C42DD5AF622FCA691A9A36F440A7C88F389927800292303" />
</folder>
</root>
Here is the code
T-FindDuplicatesInfoTip.zip (927 b)
