[RevEng] Learn reverse engineering: but where to start?

I've heard many times that x86 assembly is hard to learn.

One reason is that x86 arch (or visible API) is baroque. Linus Torvalds used this term as well: baroque instruction encoding

Also: It is well-known that the x86 instruction set is baroque, overcomplicated, and redundantly redundant.

Corresponding entries from The Jargon File: baroque, rococo.

16-bit 8086 CPU was toyish, not taken very seriously at the time of introduction (1976). No one could imagine this arch will be number one one day. It was extension already, of 8-bit 8080 CPU -- this is why 8086 has 8-bit parts of registers, like AL and AH, etc. Since then, it was extended at least twice -- to 32-bit arch, then to 64-bit arch.

Now you have RAX register that has 32-bit part (EAX). In turn, EAX register has 16-bit part (AX). Which, in turn, has 8-bit parts (AH, AL).

Many other x86 parts also reminds matryoshka/russian/nested doll.

And all this may be very confusing.

(I myself started learning 16-bit 8086 in ~1993-1994, ignoring 32-bit 80386 arch. It was easier at the time.)

When CPU is developed from scratch, no one would reintroduce such rudiments. RISC CPUs never had anything like that.

So a better idea is to start learning RISC arch, ignoring x86 for a moment.

Bottom line (TL;DR): I would recommend starting at ARM64. 32-bit ARM is complicated with thumb modes. MIPS is complicated with delay slots. So you have ARM64 left. ARM64 devices are popular these days, pick any.

Then switch to x86/x64 -- there are much more (software) code for this arch, even today. But after ARM64, this transition would be smooth.

Also, I would say that understanding big endian arch is easier than little endian. BE is natural order. LE is so confusing for newbies. In fact, little endian representation was an ad-hoc hack for old toy 8086 CPUs from Intel.

BE is still used in many places. TCP/IP protocol (AKA 'network order'). Bignums are often encoded in BE, like Diffie-Hellman values in SSH and SSL/TLS protocols. Hexdumps containing bignums in BE form are easier to understand.

So if you can first learn ARM64 switched to BE, that can be easier, I think.

Should you learn pure C/C++ before assembly? This is the question I hear most often.

Now this is my story.

End of 1980's and beginning of 1990's. My first PL was toyish Basic. My second PL was Pascal.

Pascal wasn't regarded as a 'serious' PL, so I begin to learn pure C. It was painful, mainly because of pointers.

So I compiled small C programs and compared it with assembly output. I used early versions of Borland C++. At that moment I had no intention to learn assembly, because I've heard that it's so hard.

But learning pure C and assembly 'in parallel' helped me a lot. In other words, I learnt them simultaneously.

Then there was C++ and I had problems with OOP, especially, virtual methods. Again, I compiled small programs and dug into assembly code.

Then I used Watcom C++ 11, which had that 'fastcall' method of passing function arguments via registers. Also, it optimized assembly code agressively, so it was shorter and somewhat easier to understand.

Also, I liked to write C programs, compile them and then hand-optimizing assembly code and recompiling it using Turbo Assembler (tasm.exe, what a nostalgic memories). Sometimes it wasn't (completely) workable after my modifications, but it was such a great fun. I think, this experience is very important to everyone who wants to feel assembly better.

This my experience may be also helpful to anyone else.

Good knowledge of pure C is critical for reverse engineer, because both Hex-Rays and Ghidra sometimes produce horrible output. Often, both these decompilers can't detect arrays and use pointer arithmetic instead. The code is correct, after all, but harder to understand.

(The post first published at 20221109. Updated in Oct-2023.)

List of my other blog posts.

Subscribe to my news feed

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.