内容简介:by William Wang, UCLAAs part of my internship this past winter and spring, I’ve been prototyping a tool called
by William Wang, UCLA
OpenSSL is one of the most popular cryptographic libraries out there; even if you aren’t using C/C++, chances are your programming language’s biggest libraries use OpenSSL bindings as well. It’s also notoriously easy to mess up due to the design of its low-level API. Yet many of these mistakes fall into easily identifiable patterns, which raises the possibility of automated detection.
As part of my internship this past winter and spring, I’ve been prototyping a tool called Anselm , which lets developers describe and search for patterns of bad behavior. Anselm is an LLVM pass, meaning that it operates on an intermediate representation of code between source and compilation. Anselm’s primary advantage over static analysis is that it can operate on any programming language that compiles to LLVM bitcode, or any closed-source machine code that can be converted backwards. Anselm can target any arbitrary sequence of function calls, but its original purpose was to inspect OpenSSL usage so let’s start there.
OpenSSL
The design of OpenSSL makes it difficult to understand and work with for beginners. It has a variety of inconsistent naming conventions across its library and offers several, arguably too many, options and modes for each primitive. For example, due to the evolution of the library there exist both high level (EVP) and low level methods that can be used to accomplish the same task (e.g. DSA signatures or EC signing operations). To make this worse, their documentation can be inconsistent and difficult to read.
In addition to being difficult to use, other design choices make the library dangerous to use. The API inconsistently returns error codes, pointers (with and without ownership), and demonstrates other surprising behavior. Without rigorously checking error codes or defending against null pointers, unexpected program behavior and process termination can occur.
So what types of errors can Anselm detect? It depends on what the developer specifies, but that could be anything from mismanaging the OpenSSL error queue to reusing initialization vectors. It’s also important to remember that these are heuristics, and misidentifying both good and bad behavior is always possible. Now, let’s get into how the tool works.
Function Calls
While the primary motivation of this project was to target OpenSSL, the library itself doesn’t actually matter. One can view OpenSSL usage as a sequence of API calls, such as EVP_EncryptUpdate
and EVP_EncryptFinal_ex
. But we could easily replace those names with anything else, and the idea remains the same. Hence bad behavior is a pattern of any function calls (not just OpenSSL’s) which we would like to detect.
My main approach was to search through all possible paths of execution in a function, looking for bad sequences of API calls. Throughout this post, I’ll be using OpenSSL’s symmetric encryption functions
in my examples. Let’s consider EVP_EncryptUpdate
, which encrypts a block of data, and EVP_EncryptFinal_ex
, which pads the plaintext before a final encryption. Naturally, they should not be called out of order:
EVP_EncryptFinal_ex(ctx, ciphertext + len, &len); ... EVP_EncryptUpdate(ctx, ciphertext, &len, plaintext, plaintext_len);
This should also be flagged, since the bad sequence remains a possibility:
EVP_EncryptFinal_ex(ctx, ciphertext + len, &len); ... if (condition) { EVP_EncryptUpdate(ctx, ciphertext, &len, plaintext, plaintext_len); }
I worked with LLVM BasicBlocks , which represent a list of instructions that always execute together. BasicBlocks can have multiple successors, each reflecting different paths of execution. A function, then, is a directed graph of many BasicBlocks. There is a single root node, and any leaf node represents an end of execution.
Finding all possible executions amounts to performing a depth-first search (DFS) starting from the root node. However, notice that the graph can contain cycles; this is analogous to a loop in code. If we performed a blind DFS, we could get stuck in an infinite loop. On the other hand, ignoring previously visited nodes can lead to missed behavior. I settled this by limiting the length of any path, after which Anselm stops exploring further (this can be customized).
One issue remains, which is that performing DFS over an entire codebase can be very time-consuming. Even if our exact pattern is simple, it still needs to be matched against all possible paths generated by the search. To solve this, I first prune the graph of any BasicBlock that does not contain any relevant API calls. This is done by pointing any irrelevant node’s predecessors to each of its successors, removing the middleman.
In practice, this dramatically reduces the complexity of a graph for faster path-finding: entire if statements and while loops can be eliminated without any consequences! It also makes any path limit much more reasonable.
Matching Values
Although solely examining function calls is a good start, we can do better. Consider OpenSSL contexts, which are created by EVP_CIPHER_CTX_new
and must be initialized with algorithm, key, etc. before actual use. In the following situation, we want every context to be initialized by EVP_EncryptInit_ex
:
EVP_CIPHER_CTX *ctx1 = EVP_CIPHER_CTX_new(); EVP_CIPHER_CTX *ctx2 = EVP_CIPHER_CTX_new(); EVP_EncryptInit_ex(ctx1, EVP_aes_256_cbc(), NULL, key, iv);
EVP_EncryptInit_ex
always follows EVP_CIPHER_CTX_new
, yet ctx2
is obviously not initialized properly. A more precise pattern would be, “Every context returned from EVP_CIPHER_CTX_new
should later be initialized in EVP_CIPHER_CTX_new
.”
I addressed this by matching arguments and return values — checking whether they pointed to the same LLVM Value object in memory. Contexts are a prime situation to match values, but we can use the same technique to detect repeated IVs:
EVP_EncryptInit_ex(ctx1, EVP_aes_256_cbc(), NULL, key1, iv); EVP_EncryptInit_ex(ctx2, EVP_aes_256_cbc(), NULL, key2, iv);
Internally, Anselm uses regex capture groups to perform this analysis; every execution path is a string of function calls and Value pointers, while a bad behavior is defined by some regex pattern.
Pattern Format
Towards the end of my internship, I also defined a format for developers to specify bad behaviors, which Anselm translates into a (somewhat messy) regex pattern. Every line begins with a function call, followed by its return value and arguments. If you don’t care about a value, use an underscore. Otherwise, define a token which you can use elsewhere. Hence a rule banning repeat IVs would look like this:
EVP_EncryptInit_ex _ _ _ _ _ iv EVP_EncryptInit_ex _ _ _ _ _ iv
Since the iv
token is reused, Anselm constrains its search to only match functions which contain the same Value pointer at that argument position.
I also defined a syntax to perform negative lookaheads, which tells Anselm to look for the absence of specific function calls. For example, if I wanted to prevent any context from being used before initialized, I would prepend an exclamation mark like so:
EVP_CIPHER_CTX_new ctx ! EVP_EncryptInit_ex _ ctx _ _ _ _ EVP_EncryptUpdate _ ctx _ _ _ _
In English, this pattern identifies any calls to EVP_CIPHER_CTX_new
and EVP_EncryptUpdate
that do not have EVP_EncryptInit_ex
sandwiched in between.
Final Notes
With its current set of tools, Anselm is capable of interpreting a wide range of function call patterns and searching for them in LLVM bitcode. Of course, it’s still a prototype and there are improvements to be made, but the main ideas are there and I’m proud of how the project turned out. Thanks to Trail of Bits for supporting these types of internships — it was a lot of fun!
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。