Using hashtables – Why?

I was encouraged to write something on the subject of using hash tables, so today I will begin a few articles that will hopefully help someone else who’s struggled with them, and hopefully will come to appreciate their utility as I have.

I’m going to start with talking about why you should want to use hash tables, particularly since Powershell frequently provides cmdlets that make it very easy to avoid them. The short answer is efficiency. A solution using hash table accumulators will nearly always out-perform a solution using Powershells built in cmdlets like group-object and measure-object, but the savings aren’t always apparent.

For a demonstation, I’m going to count how many there are of each file extension type in the windows directory structure. Here’s two small scripts, one using a hash table, and one using Powershell’s group-object cmdlet that will both accomplish that, with timers to measure how long it takes:

First, the hash table:

$t = measure-command{
$ht = @{}
gci c:\windows -recurse |? {!$_.psiscontainer} |%{
$ht[$_.extension] ++
“hashtable time is $($t.totalseconds) seconds”

And then group-object:

$t = measure-command {
$a = gci c:\windows -recurse |? {!$_.psiscontainer} | select extension | group extension | select name,count
“group-object time is $($t.totalseconds) seconds.”

Running both of them in a new ps session (if you run these on your own system, do not trust the results of the first test, disk caching will skew the results of the subsequent tests compared to the first one. Run one set to pre-load the cache, then run a full set for comparison).

hashtable time is 26.8952759 seconds
group-object time is 32.1055125 seconds.

So, we see that the hash table method is faster, but not substantially so and possibly not worth the time to do the extra coding. But time-to-execute is not all there is to efficiency. Besided the CPU cycles required, which we’ve roughly compared using time-to-execute, there is also memory utilization to consider. Let’s see if we can measure that. Here’s a couple of more test scripts, doing the same hashtable and group-object accumulations of the file types in the windows directory structure. These calculate the memory usage by checking the working set size of the powershell sessions process before and after running each one:

$ht_m1 = (get-process -id $pid).ws
$ht = @{}
gci c:\windows -recurse |? {!$_.psiscontainer} |%{
$ht[$_.extension] ++
$ht_m2 = (get-process -id $pid).ws
“hashtable memory usage is $(($ht_m2 -$ht_m1) /1mb) MB”

$go_m1 = (get-process -id $pid).ws
$a = gci c:\windows -recurse |? {!$_.psiscontainer} | select extension | group extension | select name,count
$go_m2 = (get-process -id $pid).ws
“get-object memory usage is $(($go_m2 – $go_m1) /1mb) MB”

And running each of these in a fresh PS session yields:

hashtable memory usage is 27.046875 MB
group-object memory usage is 156.27734375 MB

That’s for 1401 files found, and running on a system with plenty of free physical memory. If you try to scale that to several or tens of thousands of files you’ll soon find that group-object will quickly use up all of your available memory, and your script will be taking a long time to finish as it runs out of available memory and begins paging to the swap file to find enough room to work, where the hashtable solution will finish much quicker, being intrinsically faster to start with, and requiring much less memory to work in.

For small batches of work, cmdlets like group-object and measure object work fine. If you need to scale it to large numbers of objects, you’ll find that using hash tables as accumulators enables you to work at scales that are simply impractical using those cmdlets.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s