Dynamic binary analysis using myrrh

There has been substantial progress in the development of myrrh since the previous blog article. Development has focused primarily on the implementation of myrrh’s support for Remote Procedure Calls (RPC) which can be issued by external applications.

Some of myrrh’s new features are listed below:

  • myrrh currently contains both a JSON decoder and encoder, which are both nearly fully compliant to the JSON standard. JSON support enables external applications to encode their request in a JSON object and send it through the established TCP/IP connection
  • A VNC server has been implemented (so as to void the necessity of having one run myrrh on his or hers local system in order to see the screen output)
  • There are many more remote procedure calls available to the client (thus enabling the client to exercise even more freedom over the emulated system)
  • Breakpoints can now be grouped in combinations
  • Full support for requesting register values of both before and after hitting a breakpoint
  • The phasing out of myrrh’s internal debugging console has commenced while having added an implementation of a similar console written entirely in Python has been added to the project’s design goals. Eventually myrrh will work as a “headless” emulation server.

I will now focus on several Python scripts for interaction with myrrh that I have written. These examples focus on the dynamic or runtime analysis of binary code.

A script to generate control-flow graphs

I’ll start out with this assembly language program:

global main

main:

mov cx, 5

first:

push cx
mov cx, 10

second:
loop second

pop cx

loop first
jmp main

function:
nop
ret

I compile it and save it to program.bin. Now, consider the following Python script:

import myrrh
import json
import matplotlib.pyplot as plt
import networkx as nx

def create_edge_data(previous):

    if previous == False:
        address = m.get_absolute_address()
    else:
        address = m.get_absolute_address_p()

    datastr = hex(address)
    disasm = json.loads(m.disassemble(address, 1))["ReturnValues"]["1"]

    datastr = datastr + " (" + disasm + ")"

    return datastr

G = nx.Graph()

# Create a class instance
m = myrrh.myrrh()

# Connect to the myrrh server
m.connect("localhost", 5000)

# Configure (do not load bioses and ROM Basic)
print m.configure(m.EXEC_FLAG_NONE)

print m.start()

# Load the binary at absolute address 0
print m.load_binary(0x0000, "program.bin")

# Let CS:IP point to absolute address 0
print m.set_register_value(m.R_CS, 0x0000)
print m.set_register_value(m.R_EIP, 0x0000)

# Set a breakpoint on every type of branch
m.set_breakpoint_branch(m.BRANCH_FLAG_ALL)

print "Collecting branch node data, please wait.."

# Encounter a branch 1000 times
for x in range(0, 1000):

    # Retrieve the current CS and EIP
    fromtotal = create_edge_data(False)

    # Run until a breakpoint is encountered
    m.run()

    # Breakpoint encountered; we want to know the CS and EIP values
    # of before the branch, hence CS_p(revious)() and EIP_(previous)()
    tototal = create_edge_data(True)

    print str(x) + "|" + fromtotal + " - " + tototal

    # Add it to the graph
    G.add_edge(fromtotal, tototal)

    # Now graph a line from the previous position to the current position
    # (ie. the position where the branch jumped to)
    fromtotal = tototal

    tototal = create_edge_data(False)

    G.add_edge(fromtotal, tototal)

print "Done"

# Call the exit function of myrrh to make it halt
m.exit()

# Draw the graph

pos=nx.spring_layout(G, scale=3)
# nodes
nx.draw_networkx_nodes(G,pos,node_size=60)
# edges
nx.draw_networkx_edges(G,pos, width=1)
# labels
nx.draw_networkx_labels(G,pos,font_size=35,font_family='sans-serif')

plt.axis('off')
plt.savefig("myrrh_code_path.png") # save as png
plt.show() # display

As you can see the script above puts a breakpoint on every type of branch. A branch can be JMP, CALL, RET and so forth: anything that alters (E)IP. Within the block of code that is looped a thousand times, it gathers the location of the code (CS:IP) of when before the running was started and the location of the code right before it branched. These two points are added using add_edge. Then the points from right before the code branched to the current CS:IP (where it branched to) are recorded. This way we get a nice graph of the code’s code paths:

blogpost1

Let’s modify the assembly language program slightly:

[assembly]
global main

main:

mov cx, 5

first:

push cx
mov cx, 10

second:
call function
loop second

pop cx

loop first
jmp main

function:
nop
ret
[/assembly]

As you can see a call to ‘function’ was added in the inner loop of the program. When we run the script now, the image it produces is as follows:

blogpost2

Although this is a simple Python script that uses a very basic assembly language program, it does show the power of scripting the emulator: it enables one to yield data and produce interesting results with very few lines of code.

Runtime detection of self-modifying code

Now for a more advanced example. Consider the following program:

[assembly]
global main

org 0x1000

main:

mov cx, 5
first:

push cx

mov cx, 10
second:
mov byte [thenop], 0x90
call function
loop second

pop cx

loop first

inc byte [abyte]
jmp main

abyte db 0
function:
thenop db 0
ret
[/assembly]

And consider the following Python script:

import myrrh
import json

def format_current_instruction(previous = False):
    if previous == False:
        addr = m.get_absolute_address()
    else:
        addr = m.get_absolute_address_p()

    disasm = m.disassemble(addr, 1)

    line = hex(addr) + " - " + json.loads(disasm)["ReturnValues"]["1"].lower()

    return line

def load_program():
    m.reboot()
    # Load the binary at absolute address 0
    m.load_binary(0x1000, "program.bin")

    # Let CS:IP point to absolute address 0
    m.set_register_value(m.R_CS, 0x0000)
    m.set_register_value(m.R_EIP, 0x1000)

def find_code_bytes():
    load_program()

    code_bytes = ()

    for x in range(1000):
        addr = m.get_absolute_address()

        code_bytes = code_bytes + (addr, )

        m.run(1)

    code_bytes = tuple(set(code_bytes))

    return code_bytes

def has_self_modifying_code(code_bytes):
    load_program()

    for code_byte in code_bytes:
        m.set_breakpoint_memory_write(code_byte, code_byte)

    self_modifying_instructions = ()

    found = False

    for x in range(1, 100):
        runreturn = json.loads(m.run(1000))

        if "ReturnValues" in runreturn:
            self_modifying_instructions = self_modifying_instructions + (format_current_instruction(True), )
            found = True

    self_modifying_instructions = tuple(set(self_modifying_instructions))

    return (found, self_modifying_instructions)

def find_reads_writes():
    load_program()

    m.set_breakpoint_memory_read(0x00000, 0xFFFFF)
    m.set_breakpoint_memory_write(0x00000, 0xFFFFF)

    reads = []
    writes = []

    for x in range(0, 1000):

        X = json.loads(m.run())

        line = format_current_instruction(True)

        if X["ReturnValues"]["1"] == 1:
            reads.append(line)
        else:
            writes.append(line)

    reads = list(set(reads))
    writes = list(set(writes))

    m.delete_breakpoint(1)
    m.delete_breakpoint(2)

    return (reads, writes)

m = myrrh.myrrh()

m.connect("localhost", 5000)
m.configure(0)
m.start()

reads, writes = find_reads_writes()

print "Reads occurred from these addresses:"
print
for X in reads:
    print X
print

print "Writes occurred from these addresses:"
print
for X in writes:
    print X
print

code_bytes = find_code_bytes()

contains, disasm = has_self_modifying_code(code_bytes)

if contains == True:
    print "Contains self-modifying code:"
    print
    for line in disasm:
        print line
else:
    print "Does not contain self-modifying code"

m.exit()

I will now explain its functioning.

First, after a connection with the myrrh server has been established, the script gathers a list of addresses from which reads and another list from which writes occurred. This is accomplished by putting both a memory read breakpoint and a memory write breakpoint on the entire emulated memory. Using a loop, the code is run a thousand times, and each time a breakpoint is hit it is recorded whether either a read or a write occurred. These lists are returned to the caller and its contents is output to the screen.

Then a list of addresses which contain code that is actually executed during the program’s lifetime is gathered. This is done by stepping (running a single instruction) a thousand times and recording the CS:IP each time. The resulting list is purged from duplicates and returned to the caller.

The function has_self_modifying_code() will then put a write breakpoint on all the code bytes. If the program should try to modify its own code, the breakpoint is triggered. Finally a list of instructions that cause the modification of code is returned to the caller.

If I compile the assembly language program listed above to program.bin and run the Python script above, the following output is displayed:

Reads occurred from these addresses:

0x101c - ret
0x1011 - pop cx

Writes occurred from these addresses:

0x100c - call 000c
0x1014 - inc [101a]
0x1003 - push cx
0x1007 - mov [101b], 90

Contains self-modifying code:

0x1007 - mov [101b], 90

Notice that:

[assembly]
inc byte [abyte]
[/assembly]

is not flagged as being code-modifying code since abyte is never executed.

Again, this Python script again demonstrates the ease (and conciseness of the script) with which dynamic analysis on binaries can be performed, a goal that would likely result in a much more tedious effort using other methods or software.

P.S. I am aware of the fact that the code above does not detect code that modifies code bytes that are not the first byte of an instruction, but it would be easy to modify for such a purpose and the script above serves only as a proof of concept.