Golang memory streamlining for high throughput APIs
Recently at work, I've been doing some greenfield work creating an endpoint in golang that can serve a large static dataset. While doing this, because of the large amount of data throughput I've been running into OOM errors. I was pretty surprised about this considering all the praise that golangs GC gets, but after tinkering with the program I was able to reign in the memory usage.
Diagnosing
Initially, I found it quite hard to figure out what was happening. In my code, I was loading aprox 15Gib of data into memory, and so I expected to see a large increase while that was loading, and then it would remain relatively constant while running a benchmark. What I was seeing was that sometimes it would OOM while loading the dataset, and while running, it would gradually increase in memory usage until it OOM crashed.
Typically, it's recommended to use pprof
to determine what behaviour is happening in runtime. I was expecting the memory usage endpoints to be particularly helpful to me, but it would just show the 15Gib chunks I was expecting and nothing else. The biggest help in diagnosing this problem was the CPU profile endpoint. In particular, I saw a 32 bit hash combined with a malloc call node taking up a load of CPU time. From this, I gathered that a large issue was the hashmaps I was using to extract the data and serialise it into JSON. This accounted for the behaviour I was seeing while running benchmarks, but I was still encountering issues while loading the data.
After reading how golang allocates its arrays, I discovered it was due to the fact I wasn't preallocating the array size. As an array gets appended to, golang has to keep on creating larger arrays to move the expanding data. Because it'd be extremely inefficient to keep increasing the array size by 1 element each time we append, it instead does this exponentially. As a result, if the program got unlucky with the array size allocation, sometimes it was trying to allocate 32Gib of mem for my 24Gib system.
As a note, golang allows you to allocate an array without actually filling it with 0'd elements, only seeing the 'capacity'.
package main
import "log"
func main() {
foo := make([]byte, 10)
bar := make([]byte, 0, 10)
log.Println("foo size", len(foo)) // 10
log.Println("foo cap", cap(foo)) // 10
log.Println("bar size", len(bar)) // 0
log.Println("bar cap", cap(bar)) // 10
bar = append(bar, 2) // this doesn't require a malloc call
foo = append(foo, 2) // but this does
}
sync.Pool
One way I succeeded in removing the memory usage for allocating hashmaps, was to allocate less! sync.Pool
gives you a useful way of reusing already allocated structures. By init'ing it first with a structure constructer, it provides 2 thread safe methods to use: func (p *Pool) Get() any
& func (p *Pool) Put(x any)
.
package main
import "sync"
func main() {
p := &sync.Pool{
New: func() any {
return make(map[string]string)
}
}
// retrieve an item from the pool. sync.Pool will call New if none exists
data := p.Get().(map[string]string)
data["foo"] = "bar"
// some json encoding/http response
// ...
// put it back when you're done with it
p.Put(data)
}
You might notice one possible issue with this code, namely data sanitization. If we're reusing maps, and we don't overwrite data we might get some previously assigned key value pairs. By iterating through each key and delete()
ing it, we solve this issue. This has the added benefit of removing extra mallocs, as we will reduce the number of total hashmap bins to store the map elements. Below I show an example wrapping the sync.Pool
which returns a close function which should be called after finishing with the map.
package data
type Pool struct {
pl *sync.Pool
}
func NewPool() (*Pool) {
return &Pool{
pl: &sync.Pool{
New: func() any {
return make(map[string]string)
}
}
}
}
func (p *Pool) Get() (data any, close func()) {
data = p.pl.Get()
close = func() {
for k := range data {
delete(data, k)
}
p.pl.Put(data)
}
return
}
Note with this method, we don't have to bother providing a Put
method. Instead, we can just call defer close()
after every call to our Get
method.
Preventative methods with a soft memory limit
One of the features of pprof
allows you to see what any cpu is running at any time. This normally identifies code by showing the goroutine ID, or more importantly for me in this case, when it's running a GC cycle. I noticed that golang does not care if you're about to run out of memory and will happily keep on running with a load of reserved memory until it asks for more and then kills itself with OOM. To resolve this, by setting GOMEMLIMIT
to a target memory limit, the GC will scale the number of cycles it runs as it gets closer to the set limit. By setting this to 95% of the system's memory, it totally eliminated any OOM crashes. For me, this will become a nessaccary setting for any critical program.
Quick wins
Don't use encoding/json
One easy optimisation was to switch out the standard library encoder for goccy/go-json. It's fully api compatable so it's as easy as changing the import at the top of the source file. It's more efficient by reusing byte buffers & avoiding reflection by instead looking directly at the type pointer in the underlying golang implementation of interfaces/types.
Encoding the json body buffer directly
Using json.Marshal
is easy to use, but one downside is the entire response is loaded into memory in a []byte
var. To avoid this, and instead write the encoding straight to the http reponse, a json.Encoder
can be used. Passing in the http writer to the constructer allows you to encoder the response directly to the socket without having to store the full encoded bytes in memory.
TODO
golang hashmaps have a lot of overhead. By the very nature of a hashmap, it requires a lot of hashing, memory allocation, and cpu cycles. In my case for basically just using the map for data serialisation, is a lot of wasted work. One alernative that I couldn't get working was instead using a struct with 2 arrays. One for headers and one for values. Obviously worse lookup, but we're iterating through each element anyway so it doesn't matter for our use case.
// This also has the added benefit of not having to reallocate the header map.
// Just pass the same array through each time.
type VecMap struct {
headers []string
values []string
}
For the life of me, I couldn't get the json serialisation to work faster. By using the benchmem
option in testing, I always had way better stats for allocating the data in the inital step, but with serialisation I was always behind. Looking at the goccy/go-json
code didn't help much either. To get that boost in performance they use the underlying implementation of golang data types, and so reading through it was very opaque. By this point I was getting good performance anyway, so this is on the backburner until something calls for it.