Unsupervised Learning
Posts
Serialization Security Bugs Explained

Serialization Security Bugs Explained

If you’re in information security you’ve probably heard a lot about serialization bugs. They are becoming increasingly common, and I wanted to give a basic overview of how they work and why they’re an issue.

The parsing problem

So much of security comes down to parsing. It’s the primary reason we need input validation, and the reason that software like antivirus and network protocol analyzers can have so many security issues.

The job of a parser is to take input from somewhere else and run it through your own software. That should frighten you. It’s like a CDC employee using the ‘open and lick’ method to test petri dish samples.

Bottom line: If you’re going to parse something, you have to get intimate with it.

And that brings us to serialization.

Serialization

Serialization is the process of capturing a data structure or an object’s state into a (serial) format that can be efficiently stored or transmitted for later consumption.

So you can take an object, capture its state, and then put it in memory, write it to disk, or send it over the network. Then at some point the object can be retrieved and consumed, restoring the object’s state.

Example

A basic example of serialization might be to take the following array:

$array = array("a" => 1, "b" => 2, "c" => array("a" => 1, "b" => 2));

Unsupervised Learning — Security, Tech, and AI in 10 minutes…

Get a weekly breakdown of what's happening in security and tech—and why it matters.

And to serialize it into this:

a:3:{s:1:"a";i:1;s:1:"b";i:2;s:1:"c";a:2:{s:1:"a";i:1;s:1:"b";i:2;}}

At its core, serialization is a type of encoding.

The crux

So this brings us to the core issue: deserialization requires parsing.

In order to go from that serialized format to usable data, some software package needs to unpack that content, figure it out, and then consume it.

Unfortunately, this is precisely what parsers are so bad at. And doing it wrong can lead to all manner of security flaws, up to and including arbitrary code execution.

Summary

Parsing untrusted input is hard
Serialization takes data and encodes it into opaque formats for transfer and storage
To make use of that content, parsers must unpack and consume it
It’s extremely hard to do this correctly, and if you do it wrong it could mean code execution
Don’t deserialize untrusted data if you can avoid it
If you can’t avoid it, just realize you’re asking your parsing software to lick some petri dishes labeled “SAMPLE UNKNOWN”, and explore your options for making it so you don’t have to do this anymore

This overall concept applies to most any language that uses serialization, but some languages (like Java) are in worse shape than others.

Notes

The Wikipedia article has a good set of language implementations
Here’s a writeup on a Java serialization bug.