Principles of De-Compilation .fr

Almost all computer scientists know at least a little bit about compilers, but few have heard about decompilers. In this paper, we explain the reason to decompile ...
306KB taille 80 téléchargements 362 vues
Principles of De-Compilation Philippe Giabbanelli Bishop’s University, Undergraduate Compiler Course [email protected] Abstract

Almost all computer scientists know at least a little bit about compilers, but few have heard about decompilers. In this paper, we explain the reason to decompile code, the structure of this program, and the problems that have to be solved. For some of these problems, such as the structuring of a control flow graph, we will see further in detail the research that have been done, especially in the ‘70s. In order to have a deep understanding of the problems that can be encountered, we will also see what the state of arts in obfuscators is. Finally, we will discuss some new techniques concerning optimization.

Scanner & Parser

Compilation

A program can be considered as a description of a behaviour. If we want to analyze it, we prefer a high level of abstraction rather than something close to the machine. When a program is compiled, it becomes a set of 0 and 1 and we can no longer work with it. Therefore, to analyze a program for which we do not have the source code, we have to go from a very machinedependent level to a higher level of abstraction; we are reverse engineering the program, which is carried out by a decompiler and a disassembler.

Abstract Syntax Tree Optimization

1

Decompiler Assembly Code

Assembly Code Assembler Machine Code

There are many situations in which we want to analyze a program without having the source : - Virus. A quick understanding is also very important in this situation. - We want to prove that somebody violated a patent. Therefore, we prove that he is using a copyrighted algorithm. - Discover vulnerabilities in an application. However, if a virus could be understood easily then it would not be very efficient ; the same is true for a security system. Therefore, most people have a strong interest in protecting their software from reverse engineering : they use obfuscators. They are officialy sold under the name of software protection systems, and a more common word is packers. These modify the machine code resulting from the compilation process so that the code becomes harder to uncompile. For example, packers insert code that will never be executed (garbage code), modify the return address of a function (use of branch functions), and other techniques. Philippe Giabbanelli

Source code

Disassembler Obfuscator

Reverse Engineering

Source code

1. Introduction

Machine Code

For example, we need to decode expressions. In an ideal world, each computation would be canonical : it would not be possible to reduce it through algebraic simplifications. This means that each computation matters; they are all useful. lw x, $t3 li $t2, 1 add $t2, $t3, $t2 mul $t2, $t2, $t2 We immediately have $t2 = (x+1)². However, obfuscators can create computations only to confuse the reader, such as inserting the following lines : add $t0, $0, $t2 mul $t0, $t0, $1 move $t2, $t0 If we do algebraic simplifications on this resulting code, then we see that this code does nothing [1]. This result is quite obvious in this example, but we can easily have thousands of useless computations all over the code, and then decompiling begins to be more difficult. Therefore, we must always think about obfuscators when we are designing a decompiler. Principles of De-Compilation

2. Structure of a Decompiler

3. Control Flow Graph

At the beginning, our program is only a set of bytes. We turn it into an intermediate form so that the decompiler can operate on it. Typically, a suitable intermediate form is assembly. Therefore, we first use a disassembler to turn our set of bytes into assembly. It is not possible to clearly understand a program written in assembly. One of the problems is the lack of control structures : there are only jumps. Therefore, the decompiler will analyze the control flow of the program in order to have higher-level control structures, such as while. If we want to analyze the behaviour of a program, we do not like machine structures, such as the stack, and sometimes we are not even interested in memory management. We also want to analyze real information : a part of the code is perhaps garbage; it will never be executed, and we should not waste time figuring that out. The other main goal of the decompiler is to find the roles of variables (type…), and to clean the code.

In a control flow graph, each node is a basic block : a sequence of statements with only one entrance at the beginning and one exit at the end. B0: beqz $t4, B1 sub $t4, 1 move $v0, $a2 add $a0, 4 add $a1, 4 j B0 B1 : li $v0, 10 syscall

Optimization Control Flow Analyzer Data Flow Analyzer Control Flow Graph Generator

sub $t4, 1 move $v0, $a2 add $a0, 4 add $a1, 4 j B0 li $v0, 10 syscall

We can consider that a Control Flow Graph (CFG) represents the way to move in a program. Therefore, every control structure can be represented with a CFG [2], as below. Action

A1 A1

A1

Composition

S1 P S2

The goal of the decompiler is to produce a higher-level clean code : computations that cannot be reduced by any algebraic simplification, higher-level structures such as loops, and typed variables. Sometimes, we want more : simplification of structures or names for variables ; this is in the optimization process.

If-then-else statement

While-do loop

P S2

aiCall is a software that automatically calculates the CFG of an application written in C, which can be viewed with aiSee (see also LLVM [3]).

Intermediate Code Generator Semantic Analyzer

When the target intermediate code is assembly, then it is performed by a disassembler.

Syntax Analyzer Binary Program

Philippe Giabbanelli

D-structures (for Dijkstra)

If a program can find higher-level structures and types of variables in an intermediate level code, then it is a decompiler. As for compilers, there are also optimization steps : these steps are not required, but they would improve the readability of the code. Typically, we would like to give suitable names for variables : it is better to think about an object when we have a good name for it ; for example, with ‘sum’, we can easily figure out what it does. Code Generator

beqz $t4, B1

2

Principles of De-Compilation

Definition 2. A basic block b = [i1, …, in-1, in], n ≥ 1 is an instruction sequence that satisfies the condition : ● [i1, …, in-1] ∈ NTI and in ∈ TI Or ● [i1, …, in-1, in] ∈ NTI and in+1 is the first instruction of another basic block.

4. Control Flow Graph Generation This part is mainly based on the corresponding chapter of C. Cifuentes [4] and add explanations. Definition 1. Let ● P be a program ● I = {i1, …, in} be the instructions of P. Then S is an instruction sequence iff : S = [ij, ij+1, …, ij+k], with i ≤ j < j + k ≤ n, is in a consecutive memory location.

The first condition means that there can only be one exit, and it is at the end of the block. The second condition means that if in+1 is an entry point, then we have to begin a new block for it. An easy way to consider it from a clean assembly code is to begin a block with a label, and to end it when we meet a transfer instruction. However, we will see some obfuscation methods such as branch functions that need a more accurate model to create a suitable CFG.

In other words, S is an instruction sequence iff each instruction is immediately after the other in memory, which we express by ‘consecutive’. The default behaviour is that an instruction transfers the flow of control to the address in memory of the next instruction. However, some instructions (jumps, calls…) can transfer the flow of control to another address. Therefore, we need to classify the instructions in two sets : - Transfert Instructions (TI). The flow of control is not transfered to the next instruction. It happens with : Unconditional Jumps. Transfer to the target jump adress. jmp in Intel. Conditional Jumps. Transfer to the target jump address only if the condition is true; otherwise we transfer to the next instruction in the sequence. je, jl in Intel. Indexed Jumps. The program counter is updated from a table of byte displacements located in the code segment. If the index is out of range, no jump occurs ; otherwise, we extract the displacement from the table and add it to the program counter. The flow is transfered to one of these many targets. Subroutine call. Transfer to the invoked subroutine. call in Intel, jal in MIPS. Subroutine return. The flow of control is transfered at the return address, which is normally where the subroutine has been invoked. ret in Intel, jr $ra in MIPS. End of program. Non transfert instruction (NTI). The flow of control is transferred to the next instruction in the sequence. All instruction that does not belong to the previous set has this behaviour. Philippe Giabbanelli

Basic blocks are classified into different types, according to the last instruction of the block. We continue to use the classification of C. Cifuentes: - 1-way. The last instruction is an unconditional jump. The block has one out-edge. - 2-way. The last instruction is a conditional jump ; therefore the block has two out-edges. - n-way. The last instruction is an indexed jump ; the n branches located in the table become the n out-edges of the node. - cell. The last instruction is a call to a subroutine. There are two out-edges : one to the subroutine that is called, and the other to the instruction following the call (if the subroutines return). - return. The last instruction is a procedure return or an end of program; no out-edge. - fall. The next instruction is the target address of a branching instruction ; typically, it has a label but it is not the only case. This node is seen as a node that falls through the next one; thus there is only one out-edge.

-

3

Principles of De-Compilation

0 1 2 3 4 5 6 7 8 9 10 L1 : 11 12 13 14 15 16 17 18 19 20 21 22 L2 : 23 24 25 26 27 28 29

PUSH MOV SUB MOV MOV MOV LEA PUSH CALL POP MOV CMP JNE PUSH MOV PUSH CALL POP POP MOV POP RET MOV CMP JGE LEA PUSH CALL POP JMP

bp bp, sp sp, 4 ax, 0Ah [bp-2], ax [bp-4], ax ax, [bp-4] ax near ptr proc_1 cx ax, [bp-4] ax, [bp-2] L2 word ptr [bp-4] ax, 0AAh ax near ptr printf cx cx sp, bp bp ax, [bp-4] ax, [bp-2] L1 ax, [bp-2] ax near ptr proc_1 cx L1

An implementation using a matrix would be highly inefficient because the degree of the graph is only 2. Therefore, we use a list of predecessors and a list of successors. If we use Java or C++ STL, we do not need to create other variables for the quantity ; otherwise we do, as in C, because it is not included in the list’s structure.

0 to 8 : Call (end by a call)

We also need to do a little bit of optimization. Typically, if the target of a jump is also a jump, then we need to condense it. For example : jmp Lx jmp Lz … … Lx: jmp Ly jmp Lz … … Ly: jmp Lz jmp Lz

9 : Fall (next is L1) 10 to 12 : 2-way (conditional) 13 to 16 : call (end by a call)

Therefore, we check the target address of every jump and we modify it only if it is an unconditional jump. Indeed, we cannot do a modification if our target is a conditional jump because it involves the rearrangement of several instructions, which we do not wish to do yet.

17 to 21 : ret (end by a ret)

22 to 24 : 2-way (conditional)

Very close to a control flow graph is the call graph : each node represents a subroutine of the program, and each edge is a call. We will not go further in detail on the topic; the following sources can provide a good introduction : ● B. G. Ryder, Constructing the call graph of a program. IEEE Transactions on Soft., May 1979. ● D. Callahan, A. Carle, M. W. Hall and K. Kennedy, Constructing the procedure call multigraph. IEEE Transactions on Software Engineering, April 1990. ● M. W. Hall and K. Kennedy, Efficient call graph analysis. Letters on Programming Languages and Systems, september 1992. ● J. Bohnet and J. Döllner, Visual Exploration of Function Call Graphs, ACM September 2006.

25 to 27 : call (end by a call) 28 to 29 : 1-way (unconditional)

Example from the thesis of C. Cifuentes, p 79.

Philippe Giabbanelli

4

A framework for call graph construction algorithms D. Grove, C. Chambers, ACM November 2001 Principles of De-Compilation

The second method uses code replication to structure the graph [12,13,14,15]. As with the previous method, the result is functionally equivalent to the original code, but the semantic and even the structure has been modified.

5. Problems of structuring a CFG We have a control flow graph of an assembly program, containing only jumps, and we would like to find higher-level structures from it, such as loops and if-else statements. Before trying to solve this problem, we would like to know if a solution is possible: can we find higher-level structures for every program with jumps (or goto)?

Some people concluded that it was not possible to structure the graph with only control structures such as while, do while, and if-then-else without modifying the semantic. Therefore, they investigated other control structures.

If it was not possible, it would mean that there are some structures which are only available in low-level programming; in other words, structured programming would be quite weak. Hopefully, it is not the case. Corrado Bohm and Giuseppe Jacopini [5] have proven that all programs with goto can be transformed so that there are only choices and loop, possibly with duplicated code and/or the addition of Boolean variables. The goto-free problem was very wellknown in the 70’s : a lot of people worked on algorithms which could eliminate the goto in a program, such that it becomes structured.

Brenda S. Baker [16] used the following control structures : if-then-else, repeat (an infinite loop), multilevel break (a branch to the statement following an enclosing repeat statement), multilevel next (a branch to the next iteration of an enclosing repeat statement), stop, and goto whenever the graph cannot be structured using the previous structures. However, these structures are not available in all languages ; for example, her algorithm cannot be applied if our target high-level language is C or Java. We also consider that the resulting program, even if it improves from assembly, is still not very clear for the control structures, as we can see below.

The first reaction of a student at this point would be to say : “we saw that every control structure has its own pattern, therefore we could do pattern-matching in the CFG”. However, finding an homomorphic image of a graph (the pattern) in another graph (the target) is known as the subgraph homomorphism problem and is NPcomplete [6]. For example, a an unthought-out algorithm generates each possible mapping from the n nodes in the pattern to the m nodes in the target, and tests whether these mappings are graph homomorphisms; in the worst case: O(mn) tests. Therefore, we need to look for other methods. The resulting structured program must remain functionally equivalent to the original by the transformation. Otherwise, the resulting program can give a result different from the original for some sets of data, which does not make any sense. The first method is the introduction of new Boolean Variables [7,8,9,10,11]. This is functionally equivalent, but it modifies the semantic. Indeed, some variables will be in the decompiled program even though they were not in the original one. The thought behind the algorithm is not the same, and we can have a difficult time trying to understand a program with a lot of Boolean variables. Philippe Giabbanelli

5

Our goal would be to have a structured program which is functionally and semantically equivalent to the original. Therefore we would not like to create or duplicate any part of the code. However, such algorithms produce control structures that are not available in most languages (except Ada), and are also not very clear. Therefore, we need the interval theory. Principles of De-Compilation

Furthermore, this example highlights another important point : a function does not necessarily return to the instruction following the call instruction. Indeed, at the line 12 we transfer the flow of control to funct133, but we will never execute the next line. For comparison, this is not the case in modern languages, such as Java : after the call of a method foo(), soon or later it will return and the next instruction will be executed (the only exception being the exit of the program).

6. Some Obfuscation Techniques The design of decompilation algorithms is based on properties of the intermediate code, which is assembly in our case. For example, once we have studied the mechanisms for the flow of control, then we are able to create a CFG generator and a Call Graph generator. The better that an obfuscator can do is to modify a property: if a property is no longer true, then everything which is based on it will lead to fake results, or even the decompilation algorithm will not terminate.

An obfuscator can modify some returns addresses so that some parts of the code are unreachable, and then insert garbage code in these parts. It is hard to detect that these parts will never be executed, therefore we will try to decompile it : it can be slow because it has not meaning, and the result can confuse the reader.

For example, one can make the assumption that a basic block is only an instruction sequence where the last instruction transfers the flow of control, or where the first instruction is a label. Some tests are realized on classic samples of codes, and everything work. However, this model is flawed. Indeed, we chose to begin a block when we see a label, because it can be the target of an instruction, but it is not the only case : a target can also be given by an address, and then any part of the code can be targeted, as we see below.

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72

sub $sp, 4 sw $ra, ($sp) jal funct133 add $a0, $a0, 4 jal funct400 lw $ra, ($sp) add $sp, 4 jr $ra

One may suggest that we should do special cases in the decompilation algorithm to process the calls of function, with a highest cost in time because the number of calls should be reasonable. However, every unconditional jump can be turned into a call, using a branch function, as defined in [17]. Definition 3. Let φ be a finite map over locations in a program: φ = {a1→ b1, …, an→ bn} A branch function fφ is a function that, whenever it is called from one of the locations ai, causes the flow of control to be transferred to the corresponding location bi, 1 ≤ i ≤ n.

This part will never be executed.

funct133: add $ra, $ra, 48 jr $ra

This function add $a0, 4 should be sub $sp, 4 divided in sw $ra, ($sp) two blocks sub $a0, 4 because lw $ra, ($sp) there is an add $sp, 4 entry point jr $ra at line 60. Therefore, the definition of a basic block is done so that we begin a new block when we have an entry point. In the case of a label, it is easy to detect; otherwise, it depends on the complexity of the techniques use to obfuscate the flow of control. The one illustrated above is called ‘modification of the return address’, and is used, among other tricks, by virus. Philippe Giabbanelli funct134:

Given a branch function fφ, n unconditional jumps (or branches) can be replaced by calls to the branch function, as illustrated below : a1 : a2 : an : a1 : j b1



a2 : j b2 … an : j bn 6

a1 :

j b1 … j b2 … j bn

a2 : an : b1 b2 bn

a1 : jal fφ … a2 : jal fφ … an : jal fφ

Principles of De-Compilation

jal fφ … jal fφ … jal fφ b1 fφ

b2 bn

Bibliography [1]

Joel Moses, Algebraic Simplification: A Guide for the Perplexed, Project MAC, MIT

[2]

Henry F. Ledgard and Michael Marcotty, A Genealogy of Control Structures, December 1974.

[3]

http://www-sal.cs.uiuc.edu/~vadve/, homepage of Associate Professor Vikram Adve

[4]

Cristina Cifuentes, Reverse Compilation Techniques, PhD Thesis, Queensland University

[5]

C. Böhm and G. Jaccopini, Flow Diagrams, Turing machines and languages with only two formation rules. Communications of the ACM, May 1966.

[6]

G. Valiente and C. Martinez, An algorithm for graph pattern-matching, International Informatics Series, Carleton University Press, 1997

[7]

E. Ashcroft and Z. Manna, The translation of ‘go to’ programs to ‘while’ programs. Technical report, Stanford University, Department of Computer Science. 1971

[8]

M. H. Williams and H. L. Ossher, Conversion of unstructured flow diagrams to structured form. The computer journal, 1978.

[9]

A. L. Baker and S. H. Zweben, A comparison of measures of control flow complexity. IEEE Translations on Software Engineering, November 1980.

[10]

M. H. Williams and G. Chen, Restructuring pascal programs containing goto statements. The computer journal, 1985.

[11]

A. M. Erosa and L. J. Hendren, Taming control flow : A structured approach to eliminating goto statements. Technical Report ACAPS Technical Memo 76, School of Computer Science, McGil University, September 1993.

[12]

D. E. Knuth and R. W. Floyd, Notes on avoiding go to statements, Information Processing Letters, 1971.

[13]

M. H. Williams, Generating structured flow diagrams: the nature of unstructuredness, The Computer Journal, 1977.

[14]

M. H. Williams and H. L. Ossher, Conversion of unstructured flow diagrams to structured form, The Computer Journal, 1978.

[15]

G. Oulsnam, Unravelling unstructured programs, The Computer Journal, 1982.

[16]

Brenda S. Baker, An Algorithm for Structuring Programs, Bell Laboratories, 1977.

[17]

C. Linn and S. Debray, Obfuscation of Executable Code to Improve Resistance to Static Disassembly, University of Arizona

Philippe Giabbanelli

9

Principles of De-Compilation