2013 Scripting Games event 5 notes

I haven’t blogged for awhile, but I find myself with a little extra time. My test bed became a victim of the weather, so I can’t properly test an entry in the last event. I’ve decided to invest that time in writing this blog entry instead.

This is related to my Event 5 entry, but applicable to any situation where you’re going to be reading and extracting data from possibly large text files, and challenge some “conventional wisdom” with regard to the most efficient way read data using Get-Content.

The “conventional wisdom” is that Get-Content -ReadCount 0 is the fastest, most efficient way to read data using Get-Content. One of the celebrity judges made this observation about one of the entries:

“It also uses Get-Content with the parameter -ReadCount 0, which causes the cmdlet to return a string array in one chunk. This increases performance by approximately 6000%, and scanning large log files will take seconds, not minutes.”

You might be able to get that kind of performance increase with the right kind of file (lots and lots of very small records), but real world testing on IIS logs produced different results.

This is my test setup. I’m using a single log file that’s about 250MB, and contains about 500K records.


&{
$path = 'C:\TestLogs\u_ex130422.log'
'Readcount 1'
gps -id $pid
$t = (Measure-Command {gc $path -r 1}).TotalMilliseconds
gps -id $pid
"Elapsed time $t ms `n"}

&{
$path = 'C:\TestLogs\u_ex130422.log'
'Readcount 0'
gps -id $pid
$t = (Measure-Command {gc $path -r 0}).TotalMilliseconds
gps -id $pid
"Elapsed time $t ms `n"}

&{
$path = 'C:\TestLogs\u_ex130422.log'
'Readcount 1200'
gps -id $pid
$t = (Measure-Command {gc $path -r 1200 }).TotalMilliseconds
gps -id $pid
"Elapsed time $t ms `n"}

Each of those script blocks was then run in a separate, fresh PS instance. Here are the results:

Readcount 1

Handles NPM(K) PM(K) WS(K) VM(M) CPU(s) Id ProcessName
——- —— —– —– —– —— — ———–
379 33 98760 104372 633 1.70 8096 powershell
377 33 94784 102180 632 12.11 8096 powershell
Elapsed time 10509.9201 ms

Readcount 0

Handles NPM(K) PM(K) WS(K) VM(M) CPU(s) Id ProcessName
——- —— —– —– —– —— — ———–
398 34 99540 104728 634 1.65 6516 powershell
373 35 672492 667200 1166 5.60 6516 powershell
Elapsed time 3950.5904 ms

Readcount 1200

Handles NPM(K) PM(K) WS(K) VM(M) CPU(s) Id ProcessName
——- —— —– —– —– —— — ———–
392 31 97012 104164 628 1.67 7972 powershell
352 31 107696 115308 628 4.46 7972 powershell
Elapsed time 2775.5476 ms

Using ReadCount 0 is certainly faster than ReadCount 1, but not nearly 6000%. It also has a dramatic affect on the memory footprint of the process.

The last option -ReadCount 1200 shows what happens when you tune ReadCount to match the type of file being processed.

You avoid the massive memory footprint required to read large files using -ReadCount 0, and reduce the number of returns you have to iterate through from Get-Content by a factor of whatever that ReadCount setting is.

The end result is that with the right ReadCount setting, you can actually read the file data faster than using ReadCount 0, and maintain virtually the same (minimal) memory footprint of using ReadCount 1.

My entry for Advanced Event 5 uses Get-Content with this ReadCount setting, and is able to extract the IP addresses from the log files at an average rate of a little over 25,000 records per second on my laptop. It’s #1074 if you’d like to try it out. I appreciate any comments you’d care to leave, but to save you typing I’ve already gotten a proper dressing down about naming that filter :).

Advertisements

2 responses to “2013 Scripting Games event 5 notes

  1. HI, starting from this post, I tried to determine the best number of readcounts. I took a 14MB IIS 6.0 logfile (70K lines) and made it chew by the following script:

    $path = ‘C:\ex130129.log’
    $container=@()
    0..5000 | % {
    “Readcount $PSItem”
    $t = (Measure-Command {gc $path -r $PSItem}).TotalMilliseconds
    $HT = [pscustomobject]@{
    ‘readcount’ = $PSItem;
    ‘duration’ = $t
    }
    $container += $HT
    }
    “Ten worst”
    $container | sort -Descending duration | select -First 10 | ft
    “Ten Best”
    $container | sort duration | select -First 10 | ft

    Here’s the results I got:

    Ten worst
    readcount duration
    1 1192,8004
    2 543,1072
    3 436,2255
    4552 386,453
    3749 354,5503
    2926 350,6685
    1979 346,13
    807 342,4731
    4 339,5118
    1559 316,0544

    Ten best
    readcount duration
    740 105,1161
    1275 105,3452
    711 105,8606
    737 106,2841
    758 106,2906
    773 106,3076
    747 106,3411
    676 106,597
    681 106,6071
    3037 106,7535

    It looks to me like some of the results are random, probably due to heads position). Using “fsutil fsinfo ntfsinfo c:” I get standard NTFS default values:

    Bytes Per Sector : 512
    Bytes Per Cluster : 4096
    Bytes Per FileRecord Segment : 1024

    Now I wonder if there’s any relation between cluster size and number of reads to use… any idea? Why did you choose 1200 as readcount value? What do yuo mean with “ReadCount to match the type of file being processed”?

    Interesting subject!
    Carlo

  2. Hi Carlo!
    i’ve done pretty much those same tests, and was also frustrated by seemingly random results. I don’t think the problem is with disk itself, but with that fact that you’re having to share it with other processes that your test can’t control. Any other process that might use the disk while a test is running is going to add additional time to that test, and skew it’s results.

    I came to the conclusion that there’s really not much you can do about it in practical terms, and changed my testing criteria to just looking at memory utilization and page faults a given number produced.

    As far as the comment about setting the readcount to match the type of file being processed, that’s meant as a general rule for using this method for parsing large files outside the context of just IIS log files.

    The objective was to make the best possible use of a memory limited memory space, keeping as much of the CPU time as possible dedicated to producing results. Time spent on memory management is time not being spent on getting those IP addresses.

    Once you’ve made the adjustment to the idea that you’re processing a stream of arrays and gotten the code right to do that efficiently, then it’s just a matter of figuring out how big you can make those arrays without making the system do extra work getting them to fit into memory.

    The content of log files can vary widely. Some will have relatively small records that may only record an event type and a timestamp. Others may have much larger records that record a lot of diagnostic information in each record. The number of records it takes to hit that “sweet spot” where it’s taking in as much data as possible on each read without needing to involve memory management to make it fit will be much different for each one.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s