Tome's Land of IT

IT Notes from the Powertoe – Tome Tanasovski

Category Archives: Mathematics

Median and Mode in a Measure-Object Proxy Function or How to Add Properties to the Return Object in a Proxy Function

I was poking around Khan Academy for something to do.  Because I’ve been living and breathing data, I thought it only appropriate to run through the statistics lessons up there.  While, I’m no slouch at statistics, I figured it can’t hurt to listen to lesson 1: Mean, Median, and Mode.  I started to think about how to do perform these calculations in PowerShell.  Mean required no thought at all:

Mean

$data = (0,1,1,3,5)
($data |Measure-Object -Average).Average

Median and Mode required some thought.  I quickly mocked up the following which worked like a charm:

Median

$data = (0,1,1,3,5)
$data = $data |sort
if ($data.count%2) {
    #odd
    $medianvalue = $data[[math]::Floor($data.count/2)]
}
else {
    #even
    $MedianValue = ($data[$data.Count/2],$data[$data.count/2-1] |measure -Average).average
}    
$MedianValue

Mode

$data = (0,1,1,3,5)

$i=0
$modevalue = @()
foreach ($group in ($data |group |sort -Descending count)) {
    if ($group.count -ge $i) {
        $i = $group.count
        $modevalue += $group.Name
    }
    else {
        break
    }
}
$modevalue

This is all fine and dandy, but working on this made me think of a great talk that Kirk Munro did at TEC 2012 about proxy functions. This is a topic that I’ve been dying to play with, but I have not had the desire beyond curiosity. This, however, was a perfect occasion. I decided to extend Measure-Object to include a -Median and a -Mode parameter.

I’m not going to dig into how to do proxy functions. If you’d like a step-by-step guide, I’d suggest reading Shay Levy’s blog post on Hey Scripting Guy! It’s really the best there is on the subject. However, after you read that article, you will likely scratch your head as I did when thinking about how to perform your own calculation on the objects in the pipeline, and then modify the return object to include new properties.

Perform your own calculation on the objects in the pipeline within the proxy function

The first problem to solve was easy in my opinion. I wanted to collect all of the objects passed to the Process block, and then do my calculations on this acquired list in the End block. I initialize an array called $data in Begin. Within the Process block, you can simply add $_ to that $data list.  Actually, in the case of Measure-Object you need to also be mindful of whether someone used the Property parameter. If they do, you need to ensure that you are collecting the values of the property specified for the objects in the pipeline rather than the object itself.  Here is the relevant snippits with elipses (…) indicating the missing code. You will be able to see the full code at the end of this article:

begin
    {
        try {
            # Initialize my $data array
            $data = @()
...

   process
    {
        try {
            if ($Property) {
               $data += $_.($property)
            } else {
               $data += $_
            }
...

With the above code, you can now access $data in the End block. However, this is not enough. In order for my proxy function to feel like a single function it needs to return the data along with object that the function normally returns.

Modify the return object of the original function

At first glance it looks like you could call Add-Member on $steppablePipeline.End(). This will not work. The End() method does not actually return anything at all. I think it’s a bit counter-intuitive. Unfortunately, the only way I have found to solve this problem is to call the original function on the data, and then call Add-Member on the return value of the function. Shay points out a subtle hint in his article to this, by telling us that we must use the full namespace\cmdletname to the function in order to call the original function (the non-proxied version). The only thing you need to be careful about is that you properly call the function with the original parameters.  This can be done by using $pscmdlet.MyInvocation.BoundParameters, but you need to be sure to exclude the InputObject and the Property parameter.   The InputObject should be taken from the $data variable you have populated. The property parameter needs to be excluded because you have already flattened the data down to the value of the property in the Process block as described in the previous section.  The following code illustrates how all of this can be accomplished in your end block:

$params = @{}
foreach ($key in ($pscmdlet.MyInvocation.BoundParameters.Keys |?{($_ -ne 'inputobject') -and ($_ -ne 'Property')})) {
     $params.($key) = $pscmdlet.MyInvocation.BoundParameters.($key)
}
$return = $data |Microsoft.PowerShell.Utility\Measure-Object @params
$return |add-member noteproperty -Name SomeName -Value SomeValue
$return

Here is the final version of my code that extends Measure-Object to include -Median and -Mode. The only decision I made that makes it feel not a part of the original function is that I do not add the Median and Mode properties to the return object unless the respective parameters are specified.  I have consciously done this in order to avoid any negative performance impact if I do not use the Median or Mode switch parameters.  It’s also debatable whether the Measure-Object cmdlet should return one of its normal properties if the parameter switch for that property was not used, but that’s not something I’m here to debate.

function Measure-Object {
    [CmdletBinding(DefaultParameterSetName='GenericMeasure', HelpUri='http://go.microsoft.com/fwlink/?LinkID=113349', RemotingCapability='None')]
    param(
        [Parameter(ParameterSetName='GenericMeasure')]
        [switch]
        ${Average},

        [Parameter(ValueFromPipeline=$true)]
        [psobject]
        ${InputObject},

        [Parameter(Position=0)]
        [ValidateNotNullOrEmpty()]
        [string[]]
        ${Property},

        [Parameter(ParameterSetName='GenericMeasure')]
        [switch]
        ${Sum},

        [Parameter(ParameterSetName='GenericMeasure')]
        [switch]
        ${Maximum},

        [Parameter(ParameterSetName='GenericMeasure')]
        [switch]
        ${Minimum},

        # Add my two parameters
        [Parameter(ParameterSetName='GenericMeasure')]
        [switch]
        $Mode,

        [Parameter(ParameterSetName='GenericMeasure')]
        [switch]
        $Median,
        # Parameters added

        [Parameter(ParameterSetName='TextMeasure')]
        [switch]
        ${Line},

        [Parameter(ParameterSetName='TextMeasure')]
        [switch]
        ${Word},

        [Parameter(ParameterSetName='TextMeasure')]
        [switch]
        ${Character},

        [Parameter(ParameterSetName='TextMeasure')]
        [switch]
        ${IgnoreWhiteSpace})

    begin
    {
        try {
            # Initialize my $data array
            $data = @()
            # $data array initialized

            $outBuffer = $null
            if ($PSBoundParameters.TryGetValue('OutBuffer', [ref]$outBuffer))
            {
                $PSBoundParameters['OutBuffer'] = 1
            }
            $wrappedCmd = $ExecutionContext.InvokeCommand.GetCommand('Measure-Object', [System.Management.Automation.CommandTypes]::Cmdlet)

            # Remove my parameters if they are used so that errors are not thrown when passed to the Measure-Object function
            if ($PSBoundParameters['Mode']) {
                $PSBoundParameters.Remove('Mode') |Out-Null            
            }

            if ($PSBoundparameters['Median']) {
                $PSBoundParameters.Remove('Median') |Out-Null            
            }
            #Parameters removed

            $scriptCmd = {& $wrappedCmd @PSBoundParameters }
            $steppablePipeline = $scriptCmd.GetSteppablePipeline($myInvocation.CommandOrigin)
            $steppablePipeline.Begin($PSCmdlet)
        } catch {
            throw
        }
    }

    process
    {
        try {
            # If one of my parameters is used, populate $data with the objects        
            if ($Median -or $Mode) {
                if ($Property) {
                    # The next line ensures that I'm populating the array with the values I should be measuring
                    # if the -Property parameter is used
                    $data += $_.($property)
                } else {
                    $data += $_
                }
            }
            # $data populated
            else {
                $steppablePipeline.Process($_)
            }
        } catch {
            throw
        }
    }

    end
    {
        try {
            # If my parameters are used, calculate and add the property to the return
            if ($Median -or $Mode) {
                # Grab all of the parameters except for InputObject
                $params = @{}
                foreach ($key in ($pscmdlet.MyInvocation.BoundParameters.Keys |?{($_ -ne 'inputobject') -and ($_ -ne 'Property')})) {
                    $params.($key) = $pscmdlet.MyInvocation.BoundParameters.($key)
                }
                # Call the original Measure-Object on the data so that I can add-Member my
                # properties to this later
                $return = $data |Microsoft.PowerShell.Utility\Measure-Object @params
                if ($Median) {
                    $data = $data |sort
                    if ($data.count%2) {
                        #odd
                        $medianvalue = $data[[math]::Floor($data.count/2)]
                    }
                    else {
                        #even
                        $MedianValue = ($data[$data.Count/2],$data[$data.count/2-1] |measure -Average).average
                    }    
                    $return |Add-Member Noteproperty -Name Median -Value $MedianValue
                }
                if ($Mode) {
                    $i=0
                    $modevalue = @()
                    foreach ($group in ($data |group |sort -Descending count)) {
                        if ($group.count -ge $i) {
                            $i = $group.count
                            $modevalue += $group.Name
                        }
                        else {
                            break
                        }
                    }
                    if ($modevalue.Count -gt 1) {
                        $return |Add-Member Noteproperty -Name Mode -Value $modevalue
                    } else {
                        $return |Add-Member Noteproperty -Name Mode -Value $modevalue[0]
                    }
                }
                $return
            }
            else {
                $steppablePipeline.End()
            }
        } catch {
            throw
        }
    }
    <#     .ForwardHelpTargetName Measure-Object     .ForwardHelpCategory Cmdlet     #>

}

Next Steps

The only thing remaining is to consider whether or not I should even used $wrappedcmd at all. Part of me thinks it might be best to drop it completely and create a function that just processes InputObject so that I can build it into a collection to be used later. Part of me says this is not worth thinking about right now. The latter has won. Good night.

How many combinations of unique pairs can you create out of two data sets

I’m going to take a moment to stray from the world of PowerShell to bring us down to some mathematics.  I had a lot of trouble finding an exact solution to my problem, so I thought I would share it..

Problem

I have two arrays:

$i=@('a','b','c')
$j=@('1','2','3')

How many unique pairs can I create from these two data sets. Furthermore, what if the count of items in set $i was greater than the count of items in set $j.

$i=@('a','b','c','d')

I’m not looking to create the algorithm to generate these pairs. I’m just looking to understand how many possible combinations there are in order to understand whether or not I need to explore a matching optimization technique like the genetic algorithm or whether or not I can apply brute force to the problem by trying all combinations of pairs.

I should note that these are obviously hypothetical data sets for the purpose of bringing the problem down to its root. The real data sets are objects that when compared with each other produce an integer value that provides a relative score of how compatible the pair is. You can think of this as a match.com-like problem.  The real-world problem also has larger data sets of 25 and 20 objects.

Solution to the first part of the problem

The first problem seems fairly straightforward. The resulting pairs look like this:

a  b  c

a1 b2 c3
a1 b3 c2
a2 b3 c1
a2 b1 c3
a3 b1 c2
a3 b2 c2

There are six possible solutions. As you can see, I leave the $i array in a static order and then try every permutation of $j to generate the unique pairs. If I rearrange $i, it won’t matter because it will not create unique pairs. Therefore the answer to problem #1 is J!. If your rusty on your math symbols, that’s the factorial of j that will account for every possible permutation. For example, $j.count equals 3. Therefore, the total number of possible solutions is 3x2x1 or 6.

Solution to the second part of the problem

This is the tricky one. The answer involves getting all of the possible ways of creating $j.count letters out of $i. For example, the grid from solution #1 is valid:

a  b  c

a1 b2 c3
a1 b3 c2
a2 b3 c1
a2 b1 c3
a3 b1 c2
a3 b2 c2

However, we also know that we can do all of the above permutations of $j with:

a b c
a b d
a c d
b c d

That means I have the 6 possible combinations of $j to try with the four combinations of $i. I can see the 24 possible combinations, but how do I figure that mathematically.

So the question is how can I figure out that $i would have only 4 possible ways of cutting it if I was looking for sets of 3 (the total number of $j.count). In other words, how many sets of a specific size I can create out of any set. This of course has been figured out. I mean, people have been playing with the possible card combinations forever. This is a typical dealing problem. How many combinations of 5 cards can be delivered out of a deck of 52 cards. The solution is the binomial coefficient.

To summarize:

n!/k!(n-k)!

where n is the total number of items in the set and k is the count for the groupings. This means that in my problem where I had 4 letters in $j, and I want to see how many ways I can match them up with my 3 different numbers, I can get the answer with the following:

4!/3! = 4x3x2x1/3x2x1x1 = 24/6 = 4

Now if I multiply this number by the total permutations of $j, I get the total number of combinations:

4x3! = 24

The final formula to determine all of the possible unique pairs can be reduced down as follows. It assumes that n > k.

(n!/k!(n-k)!) * k!
n!k!/k!(n-k)!
n!/(n-k)!

In the case where n -eq k, then the answer is just n!

Summary

Fortunately, this helped me come up with an equation to understand how many possible combinations I would need to test in my script. I happened to have 30 in one set and 29 in the other. That would mean 30! or 265,252,859,812,191,058,636,308,480,000,000 or 265 nonillion possible combinations. That’s definitely way too many to try all of them. Case closed.  The genetic algorithm will help me greatly find the best possible combinations without having to try them all. We’ll save that discussion for another day.

%d bloggers like this: