Reading and writing multiple records to a file with protobuf format using Go

Pandula Weerasooriya
6 min readAug 7, 2021

Protocol buffers are a flexible, efficient, automated data serialization mechanism developed by Google. It’s language-neutral, platform-neutral, and an extensible mechanism for serializing structured data. Currently it supports generated code in Java, Python, Objective-C, and C++, with the new proto3 language having support for Dart, Go, Ruby, and C#.

This article consists of two parts. The first part is to provide a guide on how to setup and work with proto3 using Go modules as most tutorials in the internet are out of date now. The second part is on writing and reading multiple records to a file as this can be a tricky topic since protobufs are a binary data format as opposed to textual data formats like CSV, JSON and XML. I also assume that you have a beginner knowledge in protocol buffers (https://developers.google.com/protocol-buffers/docs/overview) and a working knowledge in Go as well.

Setting up protocol buffers in Go

Installing protoc

As you may know, we write protobuf definitions in .proto files. Now, in order to generate the appropriate data classes and the source code, we need the protocol buffer compiler protoc.

You can navigate to this link and download the necessary zip file according to your operating system. I’ll be using WSL2 in windows to run my code, so I have downloaded the protoc-3.17.3-linux-x86_64.zip file. (You can run uname -a on Linux to see your computer architecture). Once downloaded, you can unzip it. The protoc binary will be included in the bin folder. You’d have to copy it in the necessary binary path for your OS. In Linux, this would be /usr/local/bin. The archive folder also includes an include directory which optionally you can copy the contents to somewhere like /usr/local/include/. This consists of some well known types such as the timestamp type. To test if everything works correctly, you can write the below snippet into a file called person.proto

protobuf definition for a simple person object

and run the command,

The — go_out option tells the compiler to output Go code and use the current working directory as the root directory of the project. The go_package option defines the import path of the package which will contain all the generated code for this file. This works similar to the Go packaging structure. Once you run this command, you’d see that there is a folder created called person according the go_package option that we provided and within that folder there’s a file called person.pb.go which has the correct package definition of person and includes code generated by the protoc compiler. Once you have successfully finish this step, you can remove the generated code and the .proto file.

So, now that we have correctly setup protoc, the next step would be to setup our project structure. I would be using Go modules in this project as it’s the recommended approach to structure Go code now. You can cd in to your project directory and type the command go mod init <yourGitRepoLink>, which would create a new go.mod file.

Now, we need a Go package that can work with the source code generated by protoc. You can run “go get google.golang.org/protobuf/cmd/protoc-gen-go to download the module (https://pkg.go.dev/google.golang.org/protobuf). This module contains a set of Go packages that form the runtime implementation of protobufs in Go. This provides the set of interfaces that define what a message is and functionality to serialize message in various formats (e.g., wire, JSON, and text).

Project Implementation

The project plan is to convert a small JSON file containing a list of transaction records fetched from a MongoDB database into a binary file using protobuf serialization. A snapshot of the data format is shown in the below image and the relevant data file can be downloaded from this link. The link to the project repository can be found here.

Document format

As for our first step, we can write the transactions.proto file to convert the above JSON format to protobuf.

I have created a folder called protoFiles in the root directory and added the transaction.proto file. Once you have done that, you can run the command

to generate the code and the transaction package folder.

The second step would be to write the code to read and unmarshal the JSON data into Go struct as shown below.

Read and unmarshalling the JSON line file

We would also require some functions to convert the created JSON struct types to the appropriate proto struct types. These struct definitions are exported from the generated transaction.pb.go file.

Notice how we are converting the time.Time objects to the the equivalent nano seconds since epoch time due to the fact that protocol buffers doesn’t have a native timestamp type.

Now that we can read JSON lines from the file, unmarshal to JSON struct types and convert to the appropriate proto struct type, it’s time to write these serialized proto objects to a new file.

The problem we have here is that, proto files being binary files, will be difficult for us to define a boundary for each transaction. If it was only a single transaction, we could marshal the object and write it to a file and then read it back and unmarshal it without an issue. For a textual file representation, its easy to define a boundary. CSV files use line endings, JSON files can use commas/brackets and XML files can use closing tags.

Our strategy here is to get the size of the transaction object (number of bytes) that we are writing first, encode that size information in to the output file using a fixed size length (typically 4bytes) and then encode the transaction object in to the file. And we would have to repeat this process for each of the transactions that we have. Below is a snippet which does just that.

encodedData is a marshalled proto object. Which is only a [ ]byte slice

Now for the reading part, we would first read the fixed size length (4bytes) and extract the size or in other words, the number of bytes that we need to read to get the next transaction. We can then read that set of bytes and unmarshal the transaction. Then we would have to move the offset in to that position and redo this step again. We can do this until the end of the file is reached.

The full code for this two steps is given below.

We have finished with the program and now would be a good time to test it out!. Below is the main.go file of the project which reads the JSON file, writes to the proto binary file and reads it back.

One interesting thing here to note is that the written binary file is of size 1.7mb as opposed to the 17mb JSON file. There’s a 10 fold reduction in the file size. However, it’s worth noting that protobuf format is not generally considered as a persistent storage format (parquet format is a better choice). It’s main use case lies in establishing efficient communication between services/processes.

Protocol buffers are becoming more and more significant due to the rise of Micro-service architecture and frameworks such as gRPC as they can be used as a very efficient inter-process communication mechanism.

Hope this article will be useful in your journey of exploring protocol buffers!

--

--

Pandula Weerasooriya

A fullstack engineer who's passionate about building data intensive products and distributed systems. My stack includes Golang, Rust, React, NodeJS and Python.