Calendar

<<  April 2017  >>
MonTueWedThuFriSatSun
272829303112
3456789
10111213141516
17181920212223
24252627282930
1234567

View posts in large calendar

RecentComments

None

 
 
     
 

Here is a simple translet for finding duplicates.

function T-FindDuplicates{
    param ($inxml)
    begin{
        . PSlib:\xml\invoke-transform.ps1
        [xml]$xslt = @"
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   
<xsl:output method="xml" indent="yes" />
   
<xsl:key name="file-checksums" match="file" use="@Checksum" />
   
<xsl:template match="file">
       
<xsl:copy>
           
<xsl:attribute name="Duplicate">
               
<xsl:value-of select="count(key('file-checksums', @Checksum)) &gt; 1" />
           
</xsl:attribute>
           
<xsl:apply-templates select="@*|node()" />
       
</xsl:copy>
   
</xsl:template>
   
<xsl:template match="@* | node()">
       
<xsl:copy>
           
<xsl:apply-templates select="@* | node()" />
       
</xsl:copy>
   
</xsl:template>
</xsl:stylesheet>
"@
    }
    process{
        if ($_ -is [xml]){
            [xml](invoke-transform -inxml $_ -inxsl $xslt)
        }
    }
    end{
        if ($inxml -is [xml]){
           [xml](invoke-transform -inxml $inxml -inxsl $xslt)
        }
    }
}

As you can see in the code it adds a Duplicate attribute with a value of true or false depending on whether there is a file with a duplicate @Checksum.

It can be used like this

PS> . .\T-GetDirAsXml.ps1
PS> . .\T-AddChecksum.ps1
PS> . .\T-FindDuplicates.ps1
PS> Get-DirAsXml | T-AddChecksum | T-FindDuplicates

It might produce

<root Name="root" Root="True" Date="2008/11/03 01:35:14">
   
<folder Name="test" Base="D:\powershell\blog\test" Parent="D:\powershell\blog">
       
<folder Name="test2">
           
<file Duplicate="true" Name="test.ps1" Checksum="C47313D06C6AADA288AF6D61E03EFD7FA7C52DD73AB097E9D556535D330798D3" />
           
<file Duplicate="false" Name="test.txt" Checksum="CE217706948A41613FFA00C46B64D48A514D3D80758C8334EE00D6B0786AE47F" />
           
<file Duplicate="true" Name="test.zip" Checksum="7F2CCA02F17FF0E9458C0777C659D6D00B80F1C9D2921AEC971AE9A82D296AA5" />
           
<file Duplicate="true" Name="tmp.xml" Checksum="1351245F9834D0406C42DD5AF622FCA691A9A36F440A7C88F389927800292303" />
       
</folder>
       
<file Duplicate="true" Name="test.ps1" Checksum="C47313D06C6AADA288AF6D61E03EFD7FA7C52DD73AB097E9D556535D330798D3" />
       
<file Duplicate="false" Name="test.txt" Checksum="0D7439F5894B4E8EFEC8FB409635D0D8EA7A450E902F6B30B335907B5867DF16" />
       
<file Duplicate="true" Name="test.zip" Checksum="7F2CCA02F17FF0E9458C0777C659D6D00B80F1C9D2921AEC971AE9A82D296AA5" />
       
<file Duplicate="true" Name="tmp.xml" Checksum="1351245F9834D0406C42DD5AF622FCA691A9A36F440A7C88F389927800292303" />
   
</folder>
</root>

Here is the code

T-FindDuplicates.zip (745 b)

All of the files in folder test2 are copies of the files in test except for test.txt and as you can see only having an @Duplicate indicator doesn't tell you which file the file is a duplicate of so this translet is only useful if you have very few duplicate files.

What you do when you find a duplicate is up to you and depends very much on the downstream application. One thing you could do is put a list of the duplicate files into an attribute like this

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   
<xsl:output method="xml" indent="yes" />
   
<xsl:key name="file-checksums" match="file" use="@Checksum" />
   
<xsl:template match="file">
       
<xsl:copy>
           
<xsl:if test="count(key('file-checksums', @Checksum)) &gt; 1">
               
<xsl:attribute name="Duplicate">true</xsl:attribute>
               
<xsl:attribute name="Duplicates">
                   
<xsl:for-each select="key('file-checksums', @Checksum)">
                       
<xsl:call-template name="get-path" />
                       
<xsl:value-of select="'&#xA;'" />
                   
</xsl:for-each>
               
</xsl:attribute>
           
</xsl:if>
           
<xsl:apply-templates select="@*|node()" />
       
</xsl:copy>
   
</xsl:template>
   
<xsl:template name="get-path">
       
<xsl:for-each select="ancestor-or-self::*[not(@Root)]">
           
<xsl:value-of select="@Parent" />
           
<xsl:text>\</xsl:text>
           
<xsl:value-of select="@Name" />
       
</xsl:for-each>
   
</xsl:template>
   
<xsl:template match="@* | node()">
       
<xsl:copy>
           
<xsl:apply-templates select="@* | node()" />
       
</xsl:copy>
   
</xsl:template>
</xsl:stylesheet>

Which will produce

<root Name="root" Root="True" Date="2008/11/03 01:35:14">
   
<folder Name="test" Base="D:\powershell\blog\test" Parent="D:\powershell\blog">
       
<folder Name="test2">
           
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\test.ps1 D:\powershell\blog\test\test.ps1" Name="test.ps1" Checksum="C47313D06C6AADA288AF6D61E03EFD7FA7C52DD73AB097E9D556535D330798D3" />
           
<file Name="test.txt" Checksum="CE217706948A41613FFA00C46B64D48A514D3D80758C8334EE00D6B0786AE47F" />
           
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\test.zip D:\powershell\blog\test\test.zip" Name="test.zip" Checksum="7F2CCA02F17FF0E9458C0777C659D6D00B80F1C9D2921AEC971AE9A82D296AA5" />
           
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\tmp.xml D:\powershell\blog\test\tmp.xml" Name="tmp.xml" Checksum="1351245F9834D0406C42DD5AF622FCA691A9A36F440A7C88F389927800292303" />
       
</folder>
       
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\test.ps1 D:\powershell\blog\test\test.ps1" Name="test.ps1" Checksum="C47313D06C6AADA288AF6D61E03EFD7FA7C52DD73AB097E9D556535D330798D3" />
       
<file Name="test.txt" Checksum="0D7439F5894B4E8EFEC8FB409635D0D8EA7A450E902F6B30B335907B5867DF16" />
       
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\test.zip D:\powershell\blog\test\test.zip" Name="test.zip" Checksum="7F2CCA02F17FF0E9458C0777C659D6D00B80F1C9D2921AEC971AE9A82D296AA5" />
       
<file Duplicate="true" Duplicates="D:\powershell\blog\test\test2\tmp.xml D:\powershell\blog\test\tmp.xml" Name="tmp.xml" Checksum="1351245F9834D0406C42DD5AF622FCA691A9A36F440A7C88F389927800292303" />
   
</folder>
</root>

Here is the code

T-FindDuplicatesInfoTip.zip (927 b)

Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkList

Add comment