Fuzzing Zcash with Kubernetes

Security is paramount to the success of Zcash and ECC, and as such, we invest significant resources into security. We have all network upgrade designs and implementations assessed by reputable external vendors, and we publish those results in full. We also make efforts internally to detect and prevent security issues. In this post, we detail one such effort undertaken in Q3 of 2020.

We set an aggressive Q3 goal for security-bug finding in zcash/zcash and decided that fuzzing should be part of the approach: We used Kubernetes and libFuzzer, along with a bit of trickery around our old build system, to start the search.

So far, the private cluster we put together hasn’t reached its scaling limits within our budget, and it has reached 500k real executions per second using preemptible VMs in GKE.

Prelink is dead and security killed it

Prior to the NU3 security assessment (which quite rightly highlighted an inflexible build system and a lack of fuzzing), some effort had already been put into integrating AFL into the zcash/zcash repository and also into a system to parallelize fuzzing in a Kubernetes cluster. 

As we were going into the assessment, I was looking to optimize fuzzer execution with AFL, to go from a baseline (with a trivial input case) of hundreds of exec/s/core to somewhere in the thousands range on a mid-level CPU. It’s important in fuzzing to let whatever test case generation you’re using have the best chance possible to find crashes and hangs by testing as large a number of test cases as possible per second.

While Trail of Bits (ToB) were working on this assessment, I realized that, unfortunately, `prelink` hasn’t kept up pace with (ironically enough) security upgrades to the Linux loader, meaning that it would no longer operate on binaries built for a modern Linux system. Its primary use case, that I’m aware of, seemed to be for administrators to rebase a binary elsewhere randomly (although only once, when prelink is called) to implement a sort of ASLR. Well, Linux has that right in the dynamic loader now, and the new sections that are used to support that functionality will confuse prelink. Without prelink, AFL’s comparative slowness over a direct call into a library function that runs the test case can’t be ignored. Since we’re obviously not going to pay for CPU time to dynamically link a binary again and again, and a simple method for precomputing that part no longer exists, some other solution was needed.

So, in line with the finding from the NU3 report, we moved to libFuzzer instead. Being only a library call for each input test case, it already comes pretty well optimized from the perspective of test case execution speed — although, of course, there are still plenty of dials in a libFuzzer binary to further optimize finding issues.

Using Clang/LLVM has been a real pleasure. It’s a great suite of tools that I hadn’t had a chance to work with much before. I was able to derive build integration for libFuzzer into zcash/zcash from the code already in place for AFL and derive a method for scaling it up massively from the existing parallelization framework I had been developing to use for AFL. The synchronization paradigm isn’t all that different.

Sidestepping an ‘Inflexible build system’

I should just mention that the report did highlight that the build system for zcash/zcash is somewhat inflexible. While it’s reliable and builds zcashd reproducibly (which we consider to be an important security feature in context) it is somewhat inflexible and difficult to modify and maintain. It’s something that we used from Bitcoin but don’t have the people power to replace.

It didn’t seem like an appropriate trade-off, given limited resources, to replace the system, especially at the opportunity cost of other important efforts, but I was able to leverage existing inputs and have it emit libFuzzer binaries instead of the original zcashd.

Our integration sets the compiler (CC, CXX) to be a wrapper script, which analyzes the arguments passed to it, makes adjustments (such as pie vs pic, enabling sanitizers and adjusting linking) and then calls the original compiler. This avoids having to make changes to and become the de-facto owner and maintainer of the build system — at the cost of slightly increasing the momentum and difficulty if anyone decides to do that in the future. I’m stacking more things on an inflexible system, but doing that is in step with our business plan.

This way we can:

  • Conditionally inject libFuzzer instrumentation into only the modules we require
  • Conditionally inject other sanitization instrumentation into only the modules we require
  • Avoid subverting ownership over maintenance of the whole build system
  • Remove the requirement for fuzzer authors to integrate with the linker

On that last point: This approach gives access to all functionality within the monolith without requiring a fuzzer author to decide what to link or build with. This sounds like a minor improvement, but it really comes into its own when offloading the effort of writing fuzzers to others — if it’s a step that can be avoided, let’s avoid it.

Private parallelization is easy now

We have limited resources to secure a codebase relative to a market capitalization far beyond our own company’s assets.

This presents interesting challenges for security. Almost every system we rely on has some kind of risk control applied, such that the remaining risk is understood. For instance, how secure are other people’s fuzzing clusters? How can we add monitoring and risk mitigation strategies to the underlying infrastructure? Security bugs in an open source cryptocurrency are a tricky thing to handle, who would be doing that and are they aligned with our mission?

It’s also the case that OSS-Fuzz/ClusterFuzz are far more fully featured than we’re likely to need, and they were also conceived at a time where parallelizing common computing tasks in clusters was harder than it is now. It was extremely forward-thinking at the time to parallelize fuzzing, but thankfully, general-purpose parallelization has now caught in the aspects of shared storage, cluster management, parallel execution and job lifecycle management.

For those reasons, we have our own private parallelization for fuzzing. It’s based on Kubernetes and can be triggered trivially. It uses preemptive nodes, autoscaling clusters and redundant storage for persistence of corpus growth and sharing via a sync job.

Open source it?

I hope that we get to the stage where we open source the parallelization piece, but given that it’s re-using general-purpose clustering, it’s not very large or complex and it probably doesn’t provide all that much value for anyone looking to do anything beyond fuzzing zcashd. Time and effort will have to be put into making it useful to others, and frankly, the entire exercise has been one of how much can we do with how little.

Can we offload security test coverage?

Once demonstrated internally that the fuzzing platform we have is viable, we can start fuzzers to the codebase in places that make sense, with the expectation that we can run them at scale soon after they’re committed and tested.

If we could get to a place where anyone adding new C++ code thinks in terms of security tests just being a part of the testing they need to add, then a lot of the heavy lifting of harnessing components could be pushed out to developers, for security to then expand on later. That would be a bit of a culture shift we need to create. Given that there’s a lot of interest and skill in security on the ECC team, I thought they’d appreciate a challenge to write and tune fuzzers (tuning meaning provide CLI options that speed it along, including corpus and a dictionary, but there’s lots of options in libFuzzer). I’d like to run that externally at some point, but like anything it takes overhead, so we’re running a small trial internally first.

Clang support

Support for Clang/LLVM hasn’t been a part of zcashd for very long, but the macos build of zcashd has been using it for a little while. It has been playing second fiddle to the leader GCC, but (in part inspired by this work and entirely on their own) the core engineering team decided to switch to Clang in zcash/zcash master. Outstanding!

Taylor Hornby contributed significantly to this work and article.