Peter Norvig, director of research at Google, posted a Python sample demonstrating a statistical language processing problem, spelling correction.
Google corrects a search for speling in about 0.1 of a second. The 21 line Python sample is as quick. For comparison here is a Haskell implementation.
PowerShell Version
Version 1 of the PowerShell script performs as quickly although it is three times the number of lines. This is because I broke out the training and enumeration routines into functions so I could better understand the process.
Slow Performance
A training text file is used which contains about 5,500 lines and 100,000 words. Here is my first pass at reading the file, splitting the lines up, capturing the individual words and creating a container of unique words.
function train($text)
{
$text = [string]::join(” “, $text)
[regex]::split($text.ToLower(), ‘\W+’) | ForEach {$h = @ {}} {$h[$word] = ”} {$h}
}
Unfortunately, this approach took over 8 seconds to process the file. The join, regex::split and ToLower combined were sub second. Piping to the ForEach consumed all the time.
Here is the revised sub second version.
function train($text)
{
$text = [string]::join(” “, $text)
$h = @{}
ForEach ($word in [regex]::split($text.ToLower(), ‘\W+’) )
{
$h[$word] = ”
}
$h
}
The first approach used the ForEach cmdlet and the second the ForEach loop. In version 1.0, the loop is significantly faster. Thanks to Bruce Payette, founding member of PowerShell and author of PowerShell In Action, for reviewing the code and making suggestions.
Future Posts
In a follow up post, another file containing about a million words will be used and the train function will be updated to handle this.
Download here



{ 1 trackback }
{ 1 comment… read it below or add one }
Excellent.