Performance tuning GO
Unexpected lessons about the cost of I/O during a simple programming exercise on encoding/decoding.

Introduction
This post is going to contain a short story on how I managed to optimize the execution of a simple program, written for a coding challenge on the site runcode.ninja.
Short description of the task:
There is a text file which is given as argument to your program.This text
file contains lines, each of which is an encoded englishword. Recover them
and print them out to the standard output lineby line. Hint: the UNIX
built-in dictionary may come in handy at "/usr/share/dict/american-english".
To attack problem, I used the GO language to write a program which used the built-in encoding and os/exec packages to decode the lines and to call grep to search in the file-based dictionary. It was not very difficult to figure out that the encoding in use was base64.
However, to make each line valid either a single = or double equation == characters had to be added to each line. The below code takes care of this addition of extra characters at the end of each line.
func decode(encodedStr string) string {
decoded, err := base64.StdEncoding.DecodeString(encodedStr)
for err != nil {
encodedStr += "="
decoded, err = base64.StdEncoding.DecodeString(encodedStr)
}
return string(decoded)
}
In order to test if the result of a decode operation is a valid word, a helper function was written, which is passed a string as an argument and performed the call to grep via os/exec.
func dictLookup(word string) bool {
dictLocation := "/usr/share/dict/american-english"
_, err := exec.Command("grep", "-w", word, dictLocation).Output()
if err != nil {
return false
}
return true
}
Finally, putting these pieces together, there is a function which reads in the txt file, iterates over the lines and calls decode and dict lookup until a valid word comes out, then prints it to standard output. Below is the sample code.
scanner := bufio.NewScanner(file)
var line string
for scanner.Scan() {
line = decode(scanner.Text())
for !(dictLookup(line)) {
line = decode(line)
}
fmt.Println(line)
}
Initial results
The sample code worked well enough and running it on the test / sample data provided yielded correct output, so all seemed to be fine!
flrnks@t460:~/drop_the_bass (master) ▶ go run main.go input.txt
interpretation
sanctioned
lawn
electives
unifying
Then came the idea to try to test this code on both of my laptops because it did not seem to run very quickly, even though it only had to decode 5 lines. So one of the machines I have is a ThinkPad T460 with an i5 and 16GB of RAM, while the other is a 15” MacBook Pro with i9 CPU and 32GB of RAM. I initially developed the code on the ThinkPad, and was quite surprised how much slower it was to execute on the MacBook. I would have expected that it would be the opposite, since the ThinkPad is around 3-4 years old already with a less powerful CPU. Initial test results from both machine:
[MacBook] [ThinkPad]
interpretation 285.76ms 32.61ms
lawn 425.63ms 59.31ms
unifying 1.10s 93.60ms
electives 1.20s 91.10ms
sanctioned 6.18s 141.28ms
Overall the MacBook took on average 9 seconds to finish, while the ThinkPad took around 0.5 to 1 second to finish. This was not normal, so I had to investigate! 👀 😄
Performance Tuning 1.0
Seeing the results and the difference in performance, I was quite interested what could be the cause for such a performance drop on the MacBook. My first idea was to implement concurrency into the processing, so that instead of reading lines sequentially, they get processed in parallel by getting assigned to a worker using channels, which will return it to the main routine waiting for the results.

The above figure contains the basic idea for this concurrent processing model and the below code snippet shows some parts of the code that are most important:
// define the channels for distributing work and collecting the results
jobs := make(chan string)
results := make(chan string)
// use the waitgroup for syncing up between the workers
wg := new(sync.WaitGroup)
// start up some workers that will block and wait
for w := 1; w <= 5; w++ {
wg.Add(1)
go workerFunc(jobs, results, wg)
}
// interate over the file line by line and queue them up in the jobs channel
go func() {
scanner := bufio.NewScanner(file)
for scanner.Scan() {
jobs <- scanner.Text()
}
close(jobs)
}()
// In parallel routine wait for WG to finish and close channel for results
go func() {
wg.Wait()
close(results)
}()
// Print out the results from the results channel.
for v := range results {
fmt.Println(v)
}
This parallel processing has noticeable improved the performance, but still did not eliminate the substantial difference between the two platforms.
Note: implementing the concurrent model means the words on the standard output will appear in a random order, and so the submission to the grading system might fail.
Performance Tuning 2.0
Next, I was looking around on the internet (StackOverFlow.com in particular) where I got the idea to stop calling grep via the os/exec package, and instead read the contents of the dictionary into memory and perform lookups that way. Essentially this was trading memory footprint for speed. So then I create a global dictionary {‘map[string]bool’} which was loaded once at the start of the program and used as often as needed by the various go-routines. And this was perfectly fine because the worker routines called read-only operations on this map so there was no issue with concurrent access to the global map variable.
var wordDict = make(map[string]bool)
func loadDictionary() {
dict, _ := os.Open("/usr/share/dict/american-english")
defer dict.Close()
ds := bufio.NewScanner(dict)
for ds.Scan() {
wordDict[ds.Text()] = true
}
}
This way the lookups in the dictionary cannot be a bottleneck of the I/O system of the particular OS the program is running on. Executing the same timing test this time yielded much improved results. It became clear that the issue on the MacBook was slow execution of the external grep call from the GO program. Why this is the reason I am not sure, but the results speak for themselves:
[MacBook] [ThinkPad]
interpretation 54.691µs 24.17µs
lawn 65.922µs 9.176µs
unifying 155.726µs 71.785µs
electives 113.074µs 47.478µs
sanctioned 286.94µs 464.20µs
Somehow the older and less powerful ThinkPad still seems considerably faster, but at least the difference is not so substantial anymore… 😌
Results
The below picture briefly summarizes the observed results when it comes to performance, which was measured by execution time. In order to mitigate transient effects on execution time, there were 10 measurements taken for each variant.

Explanation for the different variants (Seq vs. Con and Grep vs Map):
Seq: each line is decoded one after the other in sequence.Con: each line is processed concurrently on a pool of workers.Grep: dictionary lookup done via exec call to GREP.Map: dictionary is loaded into a string map in memory.
Quite frankly, the results speak for themselves. The most notable thing is that, compared to the most basic version (Seq-Grep), the biggest improvement is achieved not by using concurrency, but by eliminating the repeated calls to Grep.
This is not to say that enabling concurrency did not have an impact on the execution time, on average it decreased from 9 to 6 seconds, which is quite good already!
However, I/O latency seems to have a higher cost on the performance than lack of parallel processing. At least at the scale of input for this example this is the case. This difference is less pronounced when tests were run using a file which had 500 lines of encoded words (instead of just 5).
Conclusion
Never underestimate the power of I/O delay and the effect it can have on your program. Even if you have a very powerful machine, this can bog your performance down considerably! Also, it may help your program’s performance further, if you implement proper concurrent processing whenever possible.