One of the most esoteric bugs i ever found in a PHP application

16.9.2022

I spent so much time tracking down this bug that I want to tell you the story of what happened.

A while ago, I wanted to see if the Doctrine PHPCR-ODM tests work fine with PHP 8.1. PHPCR-ODM supports two different storage layers, Doctrine DBAL and Jackrabbit. With PHP 8.1 and DBAL storage, some tests failed, but those same tests succeeeded with the Jackrabbit storage. That lead me to check if something might be off with the storage layer, but the extensive test suites of both storage layers is fine with PHP 8.1.

To get a closer look at the problem, I ran only the failing test. To my surprise, the failure did not happen when I ran the test in isolation. So something about test isolation, a previous test producing some leftover artifacts that break this test? It was not unlikely, given that the test suite does a lot of functional tests that includes the storage layer. But if that was the only problem, it should affect both storage layers, and it would also be surprising that the PHP version makes a difference. I kept circling back to investigate the storage layers, wondering if there might be some hidden bug in the DBAL storage layer that only happens with PHP 8.1, but found nothing.

At that point, I called for help on Twitter:

I got some ideas that I tried, but without luck. One good suggestion was to run the PHP 8.1 tests with the exact same versions of the dependencies as are used for PHP 8.0. I used the platform configuration to tell composer to only install the PHP 8.0 compatible dependencies for the PHP 8.1, but still had the same result. I also tried installing the lowest possible versions of the dependencies, hoping that would make the problem go away, and then I could compare the differences to find the cause for the failing tests. Those lowest versions also failed the same way.

In desperation, I hoped maybe there was some strange problem with phpunit that could have been fixed in the newest version. But the full test suite still failed with phpunit 9.5. I still had no indication what the relevant difference causing the problem actually was. At that point, I gave up for a couple of weeks.

The actual bug

Today, I took to the task again with a step-by-step debugger. I ran the test suite (remember, running the test in isolation does not show the bug) and set a breakpoint at the beginning of failing test. With the stack trace, I knew when things failed, but obviously that was due to bad state that must be caused by something before. I spent a couple of hours wrapping my head around what goes on in the 4000 lines UnitOfWork class. Finally, I realized that during the flush operation, the unit of work should insert 2 items into storage, but only inserts one. The process later assumes that the items have been inserted into storage, and that is where the exception is thrown. After a few more iterations, i pinpointed the problem to a specific bit of code that is used to break circular object graphs:

public function computeChangeSet(ClassMetadata $class, $document)
{
    ...
    $oid = spl_object_hash($document);
    if (in_array($oid, $this->changesetComputed)) {
        return;
    }
    ...
}

Somehow, the method decided that the spl_object_hash of the document was already in the list of hashes that have been processed. spl_object_hash, according to the documentation returns a unique identifier for the object. There were clearly 2 different objects, as I could see with the debugger. I could even see that the $oid was different.

At this point, some of you might guess where this is going... I fired up 3v4l.org and indeed managed to write a minimal reproducer of my problem:

$list = ['000000000000e2690000000000000000'];
var_dump(in_array('000000000000e2460000000000000000', $list));
var_dump(in_array('000000000000e2460000000000000000', $list, true));

And indeed, the results are surprising but finally explain what was going wrong:

bool(true)
bool(false)

In case you wonder, it boils down to

'000000000000e2690000000000000000' == '000000000000e2460000000000000000'; // true
'000000000000e2690000000000000000' === '000000000000e2460000000000000000'; // false

I did not dig to the bottom as to why these two values (e269 vs e246, surrounded by lots of zeros) are considered equal. But the problem seems to only occur with specific ids. I guess it is something about interpreting those strings as some sort of number.

The spl_object_hash uses the internal memory location of the object to build a hash. So when running the same tests in the same version of PHP, this is reproducible (luckily, otherwise it would have been even harder to find). But running a test in isolation or using different storage layers will lead to different objects being created, leading to different memory locations and therefore different hashes. As will a different minor version of PHP, it seems.

This means the bug was present all along, but only visible under very specific circumstances. Those circumstances just never happened to come together in the tests with older PHP versions. It might well have broken things for actual applications - probably only for specific situations.

I don't think this specific problem has security implications, but if the in_array check was involved in a security check, it might well be a bug that could allow an attack. The attacker would have to trick the system into producing problematic object hashes here, which might prove difficult, but who knows...

And the morale is

To put it in short words: Use strict comparison in PHP wherever applicable. Editors and code quality tools can remind you to do it. I just got a hard reminder as to why...

php debugging