Named captures in regular expressions

Two short bits of script, for parsing the output of netstat -n to a ps object. One uses a regex with unnamed captures (and a little extra commenting), and the other uses named captures.


#parse Protocol, LocalAddress, LocalPort, ForeignAddress, ForeignPort, State 
# ex     TCP    10.106.11.32:54301     10.106.31.171:49909    ESTABLISHED
#Matches  1           2       3             4          5          6


$regex = [regex]'(TCP|UDP)\s+([\d.]+):(\d+)\s+([\d.]+):(\d+)\s+(\w+)'
#Matches             1          2       3        4       5      6

$data |% {

if ($_ -match $regex){

    new-object psobject -property @{

                     Protocol = $matches[1]
                     LocalAddress = $matches[2]
                     LocalPort = $matches[3]
                     ForeignAddress = $matches[4]
                     ForeignPort = $matches[5]
                     State = $matches[6]
                    }
   }
}


#parse Protocol, LocalAddress, LocalPort, ForeignAddress, ForeignPort, State from data line

# ex    TCP    10.106.11.32:54301     10.106.31.171:49909    ESTABLISHED

$regex = [regex]'(?<Protocol>TCP|UDP)\s+(?<LocalAddress>[\d.]+):(?<LocalPort>\d+)\s+(?<ForeignAddress>[\d.]+):(?<ForeignPort>\d+)\s+(?<State>\w+)'

$data |% {
    if ($_ -match $regex){
    new-object psobject -property @{

                     Protocol = $matches['Protocol']
                     LocalAddress = $matches['LocalAddress']
                     LocalPort = $matches['LocalPort']
                     ForeignAddress = $matches['ForeignAddress']
                     ForeignPort = $matches['ForeignPort']
                     State = $matches['State']
                    }
      }
}


#UPATE Third form added using the ignorepatternwhitespace option and utilizing a here-string to handle the line breaks due to the added line length.

$regex = [regex]@'
(?x)                       # ignore pattern whitespace option
(?<Protocol>TCP|UDP)\s+    # first field as "Protocol"
(?<LocalAddress>[\d.]+):   # second field IP as "LocalAddress"
(?<LocalPort>\d+)\s+       # second fieled port as "Localport"
(?<ForeignAddress>[\d.]+): # third field IP as "ForeignAddress"
(?<ForeignPort>\d+)\s+     # third field port as "ForeignPort
(?<State>\w+)              # State as "State"
'@

$data |% {
    if ($_ -match $regex){
    new-object psobject -property @{

                     Protocol = $matches['Protocol']
                     LocalAddress = $matches['LocalAddress']
                     LocalPort = $matches['LocalPort']
                     ForeignAddress = $matches['ForeignAddress']
                     ForeignPort = $matches['ForeignPort']
                     State = $matches['State']
                    }
      }
}

Update: There are a few “corner cases” where I think named captures in regexes are beneficial, but generally I don’t like them because for me they tend to obfuscate the active code of the regex. As a coding convention for documentation purposes I think most of the time external comments like I’ve used here are more effective than using named groups inside the regex.

Advertisements

10 responses to “Named captures in regular expressions

  1. I like what you did with the commenting for the regex in the first example as it makes it easier for someone to understand what is happening in the code. Plus compared to the second example with the named groups, there is a significant decrease in the amount of regex code being used and it does feel cleaner than using the named groups.

    Maybe I am just spoiled on using name captures when it comes time to display the info using the $matches[”]. πŸ™‚

    I think I will give your commenting method a shot in upcoming scripts so I can get a good feel for the commenting method and see which one I like best. The only time I see this maybe having issues is if your regex has to wrap into the next line due to its length (I hope that never becomes the case).

    Very good examples on differences!

    • Line wraps in general are a bad thing. Broke lines obfuscate code – the SG judges deduct for using line contiunations, and having a line conole-wrap is effectively the same thing when you’re reading it.

      That’s actually part of why I don’t like using the named captures as a convention. If you’re going to use a name, it should be meaningful. Frequently that means being more verbose. The more verbosity you add to the regex itself, the more likely it is that it will console-wrap. Part of what makes that second example ugly is that the regex itself line wraps because of the added length of the capture names.

  2. that commenting method is great. very easy to read.

    although for netstat i think you should use the win32 api πŸ˜‰

  3. Thanks. Anyone who codes much Powershell gets used to working with array indexes, so it’s intuitive what it means. Mapping the index number to the example data string describes exactly what the regex is doing with the data. I couldn’t do that reliably with named captures. A lot of the time the cature name will be longer than what’s actually being captured from the data and then your map goes sideways.

  4. If by broke lines you mean a string that is broken up using a “`” with no logical reason behind it, then I would agree with you. But if a line break is done using a natural continuation such as a “|” or a “{“, then it actually can help the readability of the script/function in the ISE. The “`” can help also help out as long as it is used properly with cmdlets that use several parameters.

    As long as the above is used for line breaks, then I really don’t see a deduction in points on scripts submitted. At least in my personal opinion.

    • Absolutely agree about the normal breaks that are allowed at ‘|’ and ‘}’. I think a command who’s parameter list and arguments need to employ a line continuation probably should have been splatted instead.

  5. Thanks a lot. Very useful . Another advantage of using named captures is that they make the PSObject conversion much easier:
    $data | % { if ($_ -match $regex) { $Matches.Remove(0); new-object PSObject -property $Matches } }

  6. Rob:
    Excellent article! I like the comment method, but I do tend to lean toward named captures. For me, Regex has zero readability to start with, so cluttering up the regex is not a big deal. I know that either way when I go back to review I’m going to have to research the Regex elements.

    The third option nice, too! Not sure which one to use moving forward now, but luckily I don’t do Regex too often.

  7. Nice examples!

    One thing that I’ll point out that I notice just about every time I see a named capture example: you don’t have to use the named index for named captures like you do with the numbered index. You can treat the named capture just like any other property on an object.

    For example, if your named capture was “Name”. I see everyone accessing that capture with “$Matches[“Name”]”, but I find it a lot easier to type and to read to just access it using “$Matches.Name”.

    If I had to use named indexing to use named capture, then I’d probably only use it in special instances too. Being able to call it as an object property, though, makes it so much more accessible for me, so I mostly use that method.

    Just my two cents.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s