Remapping Python Opcodes

栏目: IT技术 · 发布时间: 4年前

内容简介:In myLet’s first look at the problem.In the installation directory of Druva inSync, you’ll notice that there is a Python27.dll and library.zip. When inSync.exe boots up, these files are immediately read. I’ve filtered the procmon output for brevity. This b

In my previous blog post , I talked about compiled Python (.pyc) files that I couldn’t manage to decompile. Because of this, my ability to audit the source code of Druva inSync was limited, and I felt compelled to dig deeper. What I found was that all of the operation codes (opcodes) had been shuffled around. This is a  known technique for obfuscating Python bytecode. I’ll take you step by step through how I fixed these opcodes to successfully recover the source code. This technique is not limited to Druva and can generally be used to remap any Python opcodes.

Let’s first look at the problem.

It won’t decompile

In the installation directory of Druva inSync, you’ll notice that there is a Python27.dll and library.zip. When inSync.exe boots up, these files are immediately read. I’ve filtered the procmon output for brevity. This behavior is indicative of an application built with py2exe .

Remapping Python Opcodes

When you unzip library.zip, it contains a bunch of compiled Python modules — custom and standard. Shown below is a subset of the whole.

Remapping Python Opcodes

.pyc’s can often be decompiled pretty quickly by tools like uncompyle6 . For example, struct.pyc is part of the Python standard library, and it decompiles just fine — only a few imports.

Remapping Python Opcodes
Decompiling normal struct.pyc

Now, doing the same against the struct.pyc packaged in library.zip, here’s the output:

Remapping Python Opcodes
Decompiling Druva struct.pyc

Unknown magic number 62216.

But why?

If you were to look at a .pyc file in a hex editor, the magic number is in the first 2 bytes of the file, and it will be different depending on the Python version that compiled it. For example, here is 62216:

Remapping Python Opcodes
Druva magic number

62216 is not a documented magic number. But 62211 is close to it, and that corresponds to version 2.7.

What’s strange is that the python27.dll distributed in the Druva inSync installation is  version 2.7, hence the DLL name. Yet, its magic number is different.

Remapping Python Opcodes
Druva python27.dll

This looks nearly identical to the normal 2.7.15150.1013 DLL.

Remapping Python Opcodes
Normal python27.dll

The size and modified date are a little different, but the version is the same. If the version is the same, why is the magic number not 62211?

Digging into the OpCodes

My next idea was to load up a Python interpreter using the Druva inSync libraries. I did this by first dropping the Druva python27.dll into c:\Python27. I also had to ensure that the search path pointed to the Python modules distributed with Druva inSync.

Remapping Python Opcodes
Python interpreter with Druva libraries loaded

At this point, I could load the ‘opcode’ module to view the map of opcodes.

Remapping Python Opcodes
Druva opcode map

Below is the normal Python 2.7 opcode map:

Remapping Python Opcodes
Normal opcode map

Notice that the opcodes are completely different. For example, ‘CALL_FUNCTION’ maps to 131 normally, and its opcode is 111 in the Druva distribution. As far as I can tell, this is true for every operation.

Remapping the OpCodes

In order to decompile these obfuscated .pyc files, the opcodes need to be remapped back to the original Python 2.7 mapping. Easy enough, right? It’s slightly more complicated than it appears on the surface. In order to accomplish this, one needs to understand the .pyc file format. Let’s take a look at that.

Structure of a .pyc

Let’s turn to the code to make sense of the .pyc file structure. We are looking at the py_compile module’s  compile function. This function will convert a .py file into a .pyc.

Starting at line 106, timestamp is first populated with the .py file’s last modification time (st_mtime). And on line 111, the source code is read into codestring.

Remapping Python Opcodes

Next, the source code is compiled into a code object using the builtin compile function. Since the .py likely contains a sequence of statements, the ‘exec’ mode is specified.

Remapping Python Opcodes

Assuming no errors occur, the new filename is created (cfile). If basic optimizations were turned on via the -o flag ( __debug__ tells us this), the extension will be ‘.pyo’, otherwise it will be ‘.pyc’.

Remapping Python Opcodes

Finally, the file is written to. The first 4 bytes will be the magic string value, returned by  imp.get_magic() . Next, the timestamp is written. And finally, the code object is serialized using the  marshal module.

Remapping Python Opcodes

Let’s look at an example by compiling our own module.

Example: Hello world

Here’s our friend, hello.py. It’s just a print statement.

Remapping Python Opcodes

If we compile it, it spits out a hello.pyc file.

Remapping Python Opcodes
Remapping Python Opcodes

Here is a hexdump of hello.pyc

Remapping Python Opcodes

If we were to load this up, we can actually parse out the individual components. First we read the file, and store the contents in bytes :

Remapping Python Opcodes

The magic string is 03f30d0a; however, the magic  number is 03f3. It’s always followed by 0d0a.

Remapping Python Opcodes

If we unpack this unsigned short, the magic number is 62211. We now know the .pyc was compiled by version 2.7. Let’s look at the timestamp now. It is 4 bytes long, starting at offset 4.

Remapping Python Opcodes

.

This makes sense because I created the .py file at 2:26 PM on April 30th.

Remapping Python Opcodes

And finally, the serialized code object remains to be read. It can be deserialized with the marshal module, and the object is executable. Hello world!

Remapping Python Opcodes

Let’s frame up the problem to be solved. The main goal is to decompile a .pyc file, and fix its opcodes. During decompilation, an intermediate step is to  disassemble the .pyc code objects into opcodes.

Disassembling Code Objects

Let’s use the dis module to disassemble the code object in hello.pyc.

Remapping Python Opcodes

All of these instructions are required to print ‘Hello world!’. In the first instruction, we can see “0 LOAD_CONST 0 (‘Hello world!’)”. “0 LOAD_CONST” means a LOAD_CONST operation starts at offset 0 in the bytecode. And “0 (‘Hello world!’)” means that the constant at index 0 is loaded (the string is just shown in the disassembly output for clarity). Technically speaking, LOAD_CONST pushes a constant onto the stack.

Looking at the code object, the bytecode ( co_code ) and constants ( co_consts ) are accessible (and variables, etc).

Here is the raw bytecode:

Remapping Python Opcodes

Here the opcode at offset 0 is ‘d’, which is actually decimal 100 in ascii. This can be looked up in the opname sequence.

Remapping Python Opcodes

The next two bytes, “\x00\x00” represent the index of the ‘Hello world!’ constant (operand).

Remapping Python Opcodes

We’ve now established that code objects can be disassembled with the dis module. The disassembly displays instructions consisting of operation names and operands. We can also inspect the raw bytecode ( co_code ) and constants ( co_consts ) stored in code objects (other stuff as well). It gets tricky when code objects contain nested code objects.

Since we have the opcode mappings for both Druva and the normal Python 2.7, we can develop a basic strategy for opcode conversion. My strategy was to disassemble the code object in a .pyc file, convert each operation, and stuff all of this into a new code object. No need to remap operands. However, it’s just a bit more complicated than that. Let’s look at nested code objects.

Nested Code Objects

Most of the modules you encounter will be more complex than the hello world. They will contain functions and classes as well.

Here is a more advanced hello world example:

Remapping Python Opcodes

Breaking it down, we have a class named “HelloClass”. The class contains functions named “__init__” and “sayHello.” Let’s disassemble the code object after reading the .pyc.

Remapping Python Opcodes

Notice the LOAD_CONST instruction at offset 9. A HelloClass code object is loaded. This  HelloClass code object is stored at index 1 in  co_consts .

Remapping Python Opcodes

Let’s disassemble that too.

Remapping Python Opcodes

More code objects? Yep. The __init__ and sayHello functions are code objects as well. A code object can have many layers of nested code objects. This requires the opcode remapping algorithm to be recursive !

The Algorithm

For reference, here are the opcode mappings again:

Remapping Python Opcodes
Druva opcode mapping
Remapping Python Opcodes
Normal Python 2.7 opcode mapping

Here’s my general algorithm.

Starting with the outer code object in the .pyc file ( code_obj_in ), convert all of the opcodes using the mappings above and store into  new_co_code . For example, if a CALL_FUNCTION is encountered, the opcode will be converted from 111 to 131. We will then inspect the  co_consts sequence and recursively remap any code objects found in there. n ew_co_consts will be added into the output code object.

Remapping Python Opcodes

When the new .pyc file is created (not shown), it will have a magic number of 62211, and all code objects will be populated with remapped opcodes. Let’s see the script in action.

Running the process converts a total of 1773 .pyc files. Notice I copied the Druva python27.dll into C:\Python27. Bytecode was disassembled using the Druva opcode mappings, and then converted.

Remapping Python Opcodes
Converting opcodes

And after conversion, we can successfully decompile the .pyc’s in the inSyncClient folder! Prior to opcode conversion, this was not possible.

Remapping Python Opcodes
Decompilation is successful

Closing Thoughts

I hope this serves as a useful introduction to how Python opcodes might be obfuscated. There are other tools (e.g. pyREtic ) out there that do the same kind of remapping process we’ve discussed here. In fact, after writing this code, I found out that this logic had already been implemented specifically for Druva inSync in the  dedrop repository .

I’m sure there are more elegant approaches to opcode conversion, but the script definitely got the job done. If you’re interested in checking out the full source code, I’ve dropped it on our GitHub . Thanks for reading, and check out the  Tenable TechBlog for more technical blogs and vulnerability write-ups. Give me a shout on  Twitter as well!

-Chris Lyne ( @lynerc )

Реклама


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Programming Ruby中文版

Programming Ruby中文版

托马斯 / 孙勇、姚延栋、张海峰 / 电子工业出版社 / 2007-3 / 99.00元

《Programming Rudy》(中文版)(第2版)是它的第2版,其中包括超过200页的新内容,以及对原有内容的修订,涵盖了Ruby 1.8中新的和改进的特性以及标准库模块。它不仅是您学习Ruby语言及其丰富特性的一本优秀教程,也可以作为日常编程时类和模块的参考手册。Ruby是一种跨平台、面向对象的动态类型编程语言。Ruby体现了表达的一致性和简单性,它不仅是一门编程语言,更是表达想法的一种简......一起来看看 《Programming Ruby中文版》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试