Spelling corrector, in vanilla PowerShell

April 24, 2007

in .Net,PowerShell

Peter Norvig, director of research at Google, posted a Python sample demonstrating a statistical language processing problem, spelling correction.

Google corrects a search for speling in about 0.1 of a second. The 21 line Python sample is as quick. For comparison here is a Haskell implementation.

PowerShell Version

Version 1 of the PowerShell script performs as quickly although it is three times the number of lines. This is because I broke out the training and enumeration routines into functions so I could better understand the process.

Slow Performance

A training text file is used which contains about 5,500 lines and 100,000 words. Here is my first pass at reading the file, splitting the lines up, capturing the individual words and creating a container of unique words.

function train($text)
{
   $text = [string]::join(” “, $text)
   [regex]::split($text.ToLower(), ‘\W+’) | ForEach {$h = @ {}} {$h[$word] = ”} {$h}
}

Unfortunately, this approach took over 8 seconds to process the file. The join, regex::split and ToLower combined were sub second. Piping to the  ForEach consumed all the time.

Here is the revised sub second version.

function train($text)
{
  $text = [string]::join(” “, $text)
  $h = @{}
  ForEach ($word in [regex]::split($text.ToLower(), ‘\W+’) )
  {
    $h[$word] = ”
  }
  $h
}

The first approach used the ForEach cmdlet and the second the ForEach loop. In version 1.0, the loop is significantly faster. Thanks to Bruce Payette, founding member of PowerShell and  author of PowerShell In Action, for reviewing the code and making suggestions.

Future Posts

In a follow up post, another file containing about a million words will be used and the train function will be updated to handle this.

Download here

{ 1 trackback }

Spell Checking strings in your PowerShell scripts
05.30.11 at 8:56 am

{ 1 comment… read it below or add one }

Kalani 04.24.07 at 9:44 pm

Excellent.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Contrat Creative Commons

© 2007-2012, Doug Finke