Compiling to Assembly from Scratch

121 points by AlexeyBrin 2 days ago

God, the title reminds me of when I when I took an x86 assembly class in college about a decade ago. Only 6 of the dumbest souls in the CS program dared to take the class the semester. The professor for the class was an ex-NASA computer engineer. Our test used to be writing assembly by hand. We were graded for accuracy too. I swear, at that point in time, I could convert between Hex, Dec, Oct, and Binary almost without thinking. I made a D in class with 40+ hours a week of studying. However, I learned more in that one class than my entire degree.

Anyway, thank you, OP, for sharing this. I have been looking into picking up ARM as way to crawl out of burnout from my career. It's been a years since I even touched x86 with any seriousness. I will add this book to my list resources.

alok-g a day ago

Ever tried writing machine code (not assembly) by hand? I used to do that a few decades back for an 8-bit microprocessor. I am still looking for good resources on how to do that for a modern processor.
- ReleaseCandidat a day ago
  
  You look up the opcodes for each assembler instructions in the ISA specifications, like https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-.... Information about the machine code format starts at "2.2 Base Instruction Formats". The actual opcodes you can find in chapter 9,"RV32/64G Instruction Set Listings". But be aware that some assembler instructions (called e.g. pseudo instructions) generate more than one.
  The linked document is an old one, current ones are at https://riscv.org/technical/specifications/
  - alok-g a day ago
    
    Awesome! Thanks!
- zmodem a day ago
  
  Here's an example for x86: https://www.hanshq.net/ones-and-zeros.html
- tomcam a day ago
  
  I’ve told this before, but it’s so amazing I like to give Tim credit. I worked on Visual Basic at Microsoft with Tim Paterson, who also created the operating system that became MS-DOS. He worked on code generation and debugged by looking at the opcodes in a hex dump. Assembly was too slow for him him.
- pjc50 a day ago
  
  "ARM Architecture reference manual": https://documentation-service.arm.com/static/5f8daeb7f86e165... (assuming that link works)
  Start at section A5 describing the encoding. The instruction set is very much designed for clean decode, so instructions are grouped by bit pattern; every instruction in the manual has its bit pattern described.
  Very much a "but why?" situation, since translation from assembly to machine code is so easily automatable and doing it by hand adds so little value.
  - alok-g 20 hours ago
    
    >> but why?
    Agreed. More of a curiosity for me from learning and research purposes.
- eterps a day ago
  
  You should definitely look at: https://github.com/akkartik/mu/blob/main/subx.md
  - eterps a day ago
    
    Also, running machine code directly can be done like this:
    https://github.com/eterps/loader/blob/master/syscall.nim
    
    alok-g a day ago
    
    This is cool! This is how I used to do also by embedding hand-written machine code within a BASIC program and calling it to run natively.
    
    eterps 20 hours ago
    
    > This is how I used to do also by embedding hand-written machine code within a BASIC program and calling it to run natively
    Me too; that's also the reason why I wanted that possibility back.
  - alok-g a day ago
    
    SubX seems like an assembly language itself following a subset of x86 32-bit instructions. Would looking into this help me understand how to translate from assembly to machine code manually? Thanks.
    
    akkartik 17 hours ago
    
    SubX is a weird thing (I built it) that is somewhere between machine code and Assembly language. You have to type in the opcodes directly, which people typically associate with machine code. But it smooths some aspects of programming in machine code. You'll get nice errors if you accidentally write invalid machine code, it won't just go off and run data as code or something like that.
    I'd be happy to support you if you choose to try it out! Ask as many questions as you like.
    Even if you choose not to, you might like the cheatsheet in the repo (from https://net.cs.uni-bonn.de/fileadmin/user_upload/plohmann/x8...)
    
    alok-g 13 hours ago
    
    Thanks! I understand better now.
    And that cheat sheet PDF is cool, exactly what I was looking for. Any chance you are aware of something similar for x64? Thanks.
    
    akkartik 12 hours ago
    
    No, sorry. Honestly I spent a long time going to the source of the 3-volume Intel manual. (There's link to them as well in my Readme.) I think that's really what you need to do for machine code, if you're not using Assembly or Assembly-ish that has done that work for you. But then any Assembly language will have its own manual you need to bone up on.. That's mostly why I built SubX: the manual is like 10 pages, and I distilled down the parts of the Intel manual you need to know. But yeah, only for 32-bit. I always found x64 very hacky with the register bits split up between bytes and whatnot. 32-bit is a legitimately nice ergonomic machine.
- bitwize 16 hours ago
  
  Oh man you gave me flashbacks. The TRS-80 Model II's OS, TRSDOS-II, had a built-in debugger that was little more than a monitor. You could step through instructions, examine and write to memory, set breakpoints to absolute memory locations, and that was it. I remember hand-assembling tiny Z80 programs in that thing and jumping into them, just to test my understanding of how machine code programs worked and how the computer executed them, and being super thrilled when I could get an A to appear somewhere on the screen or something.
  The machine had a much more complete assembly language programming toolkit which I also used to write more sophisticated programs, employing this debugger to examine them. But I felt like I'd "cracked the code" of the computer when I plugged hex numbers into RAM and then ran them straight from there.
  Most CPU ISA documentation should give you the opcodes that correspond to instruction mnemonics. You may have to plug in your own operands (registers, etc.) into bit fields in the instruction encoding. If you're serious about hand-assembling to begin with this should be no problem.
- kragen 18 hours ago
  
  arm is the nicest instruction encoding you can get actual hardware for (not thumb or aarch64). risc-v is pretty okay at the assembly level but the instruction encoding is almost deliberately sadistic. amd64 isn't too terrible but not nearly as nice as arm
  older official arm documentation is a lot better than recent, which is very poor quality (though still pretty reliable.) oldnewthing and azeria-labs have good tutorials, though she got some of the condition flags wrong
kragen 18 hours ago
this could be easy or hard depending on what you had to write in assembly on the test. like it wouldn't be that hard to write a subroutine to add up an array of integers or something
```
    addem:   xor eax, eax
        loop:
             test ecx, ecx
             jnz ok
             ret
        ok:  add eax, [ebx + ecx * 4]
             dec ecx
             jmp loop
```
(i haven't tested this, it'd be hilarious if i got it wrong)
- hirvi74 14 hours ago
  
  The assembly on the test was much larger than the snippet you provided. I wish a kept a copy of the old test so I could just copy and example problem. Honestly though, that wasn't the worst part of the tests. It was the most dangerous though because many of the questions/instructions relied on previously completed questions/instruction to be correct.
  So, like if you passed a wrong value to a certain register, then downstream, every problem that used that register would be off.
  The tests were often 4 pages front and back with the handwritten assembly, definition matching, word problems, essay responses, etc.. We had like 75 minutes to take the test too. This was all at a public, no-name state university too. I was no MIT student or anything.
  Anyway, I'll never forget the first day of class. Our professor said we were to have 4 tests and a final and some labs.
  My friend in the class: "If we make a 100 on all 4 tests, do we have to take the final?"
  The professor: "Hell, if you make 100 on 4 test, then I will let you write the final."
  My friend: "Why? Has no one ever done that before?"
  Professor: "No, in fact, in the 20 years I have taught this class, no one has ever scored a 100 on any test."
  We all knew we were in for hell after that.
  - kragen 14 hours ago
    
    that's awesome! i wish you'd kept a copy too
- gus_massa 17 hours ago
  
  Assuming ebx is the pointer to the array and ecx is the length, doesn't this sum the slots from 1 to ecx (incusive) instead of 0 to ecx-1 (inclusive)?
  - kragen 17 hours ago
    
    hahaha, yes! i guess it wasn't as trivial as i thought. that's what i get for trying to be clever — guess i wouldn't have done that well on that exam ;)

Joker_vD 13 hours ago

Wait a bloody second. Why does GAS looks like this for ARM:

    push {ip, lr}

    ldr r0, =hello
    bl printf

    mov r0, #41
    add r0, r0, #1  // Increment

    pop {ip, lr}
    bx lr

    str r0, [r1]         /* M[r1] = r0; */
    ldr r0, [r1]         /* r0 = M[r1]; */
  
    str r0, [r1, #8]     /* M[r1 + 8] = r0; */
    ldr r0, [r1, #8]     /* r0 = M[r1 + 8]; */
  
    str r0, [r1, -r2]    /* M[r1 - r2] = r0; */
    ldr r0, [r1, -r2]    /* r0 = M[r1 - r2]; */

with the destination on the left-hand side, with no "%" before the register names, and with the square brackets for addresses (with relatively sane looking expressions inside those brackets) — basically the same syntax that ARM's own assembler uses, — while x86 gets some absolutely unhinged syntax that looks like it just fell out of the sky (or an abyss for that matter) since it has almost no relation to what's written in the Intel's docs? I always assumed that GAS used some "unified" style for all its targets and disregarded the conventions of the CPU manufacturers but apparently no, it only did that for x86?

notepad0x90 6 hours ago

it isn't GAS but at&t that I think you're thinking of. but the difference between at&t (with the percent and dest on right) and intel is so minimal, it isn't worth getting frustrated much over, at least in my opinion.
Arm can get a bit gnarly too. that bx lr is attempting to switch between arm32 and thumb based on the MSB of 'ip' based on the pop prior. the 'str' is moving data from left to right, but the ldr is moving data from right to left. you also end up doing weird and cumbersome things sometimes because of the instruction width limitations. I think arm64 looks and is much nicer than arm32 and thumb.
halst 12 hours ago

Such a sensible syntax, right? The official ARM docs use it too. Not even sure who invented it, ARM or GAS. Anyway, that's another good reason to start with ARM.
There's also legacy ARMASM syntax that is barely worth mentioning.
kragen 7 hours ago
as i understand it, for x86 gas had to be compatible with vendor assemblers that used shitty at&t assembly syntax, but on arm it had to be compatible with arm's assemblers instead
btw try
```
    .intel_syntax noprefix
```

ggorlen a day ago

As is often the case, the title is unfortunately overloaded. I initially read this as writing code in the Scratch programming language[1] that compiles to assembly.

[1]: https://en.wikipedia.org/wiki/Scratch_(programming_language)

userbinator a day ago

Given that we've seen things like C compilers written in Bash[1] and TLS libraries in Visual Basic[2], I think it's only a matter of time before someone with the right skills and motivation actually does it (or writes a compiler for $language in Scratch).
[1] https://github.com/otakuto/bashcc
[2] https://news.ycombinator.com/item?id=35882985
hyperbolablabla 21 hours ago

Yes, capitalisation of scratch was incredibly misleading in this case. It's easily checkable but slightly annoying. Hard to avoid though!
HumblyTossed 17 hours ago

Even thought the S in Scratch is uppered, I still read it correctly.
ithkuil a day ago

Ironically the word "overloaded" is overloaded too.
__MatrixMan__ a day ago

Me also. The actual thing is probably pretty cool, but now that my hopes were up I'm finding it insufficiently whimsical.
MiguelX413 a day ago

I also misunderstood it in that manner.

pjc50 a day ago

Was slightly assuming from the title that this would be something like "you have brought up a new machine with no tooling whatsoever; how do you bootstrap up to an assembler and your first real language?" such as the PDP-11 front panel toggles. But this version is probably more useful.

evnix a day ago

I really liked how short and concise these chapters are. What took me months of effort has been condensed to these few chapters and It's well worth the read.

Though Why use 32bit instead of 64, why add so much friction for a first time learner.

halst 20 hours ago

ARM32 is much simpler to explain compared to ARM64. Registers are so much simpler, conditional execution is orthogonal, three-operand form is consistent. ARM64 made all the right practical choices, but has doubled the complexity. Still much better than x86 with its 10x complexity.
ARM32 is a gem, in my opinion. A truly simple instruction set, and easy to get your hands on with Raspberry Pi or emulation.

pjmlp a day ago

I got the ebook when it came out, and is relatively nice as ramp up into the world of compiler development.

stonethrowaway a day ago

Speaking of embedded systems, I have an old board, an offshoot of Zilog Z80 called Rabbit. I think recently Dave from EEVBlog took apart one of his ancient projects and I was floored to see a Rabbit. Talk about a left hook. Assuming this is Hacker News, I suspect someone probably knows what I’m talking about. The language used (called “Dynamic C”) has some unconventional additions, a kind of coroutine mechanism in addition to chaining functions to be called one after another in groups. It’s mostly C otherwise, so I suspect some macro shennanigans with interrupt vector priority for managing this coroutine stuff.

Anyhow, so I’ve got a bunch of .bin files around for it, no C source code, just straight assembly output ready to be flashed. And the text and data segments are often interwoven, each fetch being resolved to data or instruction in real time by the rabbit processor. So I’ve been thinking of sitting down, going through the assembly/processor manual for the board and just writing a board simulator hoping to get it back to source code by blackbox reversing in a sense. I’d have to rummage through JEDEC for the standard used by the EEPROM to figure out what pins it’s using there and the edge triggering sequences. Once I can execute it and see what registers and GPIOs are written when, I can probably figure out the original code path. Not sure if anyone has tips or guides or suggestions here.

Something1234 a day ago

Any chance a decompiler like ghidra might get you part ways there?
- stonethrowaway a day ago
  
  The closest I came across was a Z80 simulator, I forget the name of it. But it allowed you to step through command by command giving you ability to query the processor state in a terminal.
  So I don’t know if it would be easier trying to find this, and updating it to support rabbit, vs trying to wedge something into ghidra which itself is an undertaking of a behemoth platform. As far as I know ghidra does not have Z80 support.
  - Jtsummers a day ago
    
    Ghidra has Z80 support (based on the installation on my laptop), but I've never used it.
  - stonethrowaway a day ago
    
    Ended up answering my own question. The Z88-DK[0] seems to be what I found. It supports most Rabbit processors, so I ran one of binaries through it and it spat out assembly my way which seems to make sense. Will have to run it through the simulator and see if I can get it to act like a microcontroller with setting voltage levels on particular pins.
    [0] https://github.com/z88dk/z88dk
  - westurner a day ago
    
    An Emu86.assembler.virtual_machine.Z8Machine (like MIPSMachine, RISCVMachine, and IntelMachine) could be stepped through in a notebook: https://news.ycombinator.com/item?id=41576922

neuroelectron a day ago

Ctrl-F "bootstrap" zero hits. That seems strange.

ReleaseCandidat a day ago

For a "first" compiler? No, not at all.