Why do C to Z80 compilers produce poor code? (2018)

栏目: IT技术 · 发布时间: 4年前

内容简介：Not the answer you're looking for? Browse other questions taggedz80

8 Answers

If you try translating C into Z80, you'll see that Z80 index registers and stack don't behave quite as you expect. So, let us begin with

Arrays

Suppose you have a standard C construction

int c[10];
for (int i=0; i<10; i++)
    c[i]=0;

Your compiler is pretty much required to use 16-bit value for i. So, you have &c somewhere, maybe even in your index register!, so let us have IX=&c. However, the operations with index registers only allow constant offsets, which are also single signed bytes. So, you do not have a command to read from (IX+16-bit value in a register). Thus, you would end up using things like

ld ix,c_addr            ; the array address
ld de,(i_addr)          ; the counter value
add ix,de
ld a,0
ld (ix+0),a             ; 14+20+15+7+19 = 75t (per byte)

Most compilers will output code that is pretty close to what I wrote. Actually, experienced Z80 programmers know - IX and IY are hopeless for most operations with memory - they are far too slow and awkward. A good compiler writer would probably make his/her compiler do something like

ld hl,c_addr            ; the array address
ld de,(i_addr)          ; the counter value
add hl,de
ld a,0
ld (hl),a               ; 10+20+11+7+7 = 55t (per byte)

which is 25% faster without breaking a sweat. Nevertheless, this is far from great Z80 code even though I made my i variable static to make my - and the compiler's - life easier!.

A good Z80 programmer would simply write the equivalent loop as

ld hl,c_addr
         ld b,10
         xor a
loop:    ld (hl),a
         inc hl
         djnz loop

The actual full loop would take (7+6+13)*10-5 = 255/10 ~ 25.5 t-states per byte. And this is really not optimized code, this is a kind of code one writes where optimization does not matter. One can do partial unrolling, one can make sure that array c does not cross 256 byte boundaries and replace INC HL by INC L. The fastest filling is actually done using the stack. In other words, Z80 does not fit the C paradigm.

Of course, one can write a similar loop in C (using a pointer instead of an array, using countdown loop instead of counting up), which would then increase chances of it being translated into a decent Z80 code. However, this would not be you writing a regular C code; this would be you trying to work around limitations of C when it is meant to be translated into Z80.

Let me give you another example.

Local variables.

Raffzahn is correct when he says that one does not have to use stack for local variables. But there must be a stack of some kind if you want recursive functions. So let us try to do it the PC way, via the stack. How do you implement a call to something like

int inc(int x) {
  return x+1;
}

Suppose even that current value for x is in one of your registers, say HL. So, you'd have something like

push hl
call addr_inc
...

How do we actually recover the address (and value) of x? It is stored at SP+2. However, we have to be careful with SP, because we want to return back to the calling program, so maybe we do something like

addr_inc:   ld hl,2
            add hl,sp
            ld e,(hl)
            inc hl
            ld d,(hl)               ; 10+11+7+6+7 = 41t

Now we have x in DE. You can see how much work this was.

So, when people complain about C compilers for Z80, they do not mean it would not be possible to do. It is something else entirely. In any kind of programming, there are patterns, some are good, some are not so good. My point is, a lot of things that C does are simply bad patterns from the point of view of Z80 coding. One simply does not do things on Z80 that C pretty much requires you to be fluent at.

share | improve this answer | |

edited Oct 2 '18 at 22:32

answered Mar 28 '18 at 19:45

introspec

2,989 1 gold badge 13 silver badges 17 bronze badges

5

IDK about Z80 but if the compiler uses 16-bit for such i values then it's a garbage compiler. Most modern compilers for 8-bit microcontrollers know to optimize for those cases when you don't take i 's address – phuclv Mar 29 '18 at 4:30
13

Re, "Your compiler is pretty much required to use 16-bit value for i." Simply not true. Any modern compiler would be smart enough to know that the values of i in your example all fall in the range 0..9, and any modern compiler would be smart enough to allocate whatever register was the most appropriate to hold those values and use them as array indices. The only question is, whether any compiler exists with that much smarts, and the ability to target the Z80. – Solomon Slow Mar 29 '18 at 14:13
5

– supercat Mar 29 '18 at 18:46
7

Compilers already know how to turn array-indexing into pointer-increments, and do so to save a register, and to reduce the size of the instruction on x86 (where an index takes an extra byte). Also other advantages, like not breaking micro-fusion on Sandybridge-family or being able to use the port7 AGU on Haswell for stores. It's entirely reasonable to expect a compiler to make a loop like your inc hl / djnz loop for this case where the trip-count is a compile-time constant. Somewhat reasonable otherwise. – Peter Cordes Mar 30 '18 at 3:51
6

@phuclv Most modern compilers for 8-bit microcontrollers know to optimize for those cases when you don't take i's address -- modern 8 bit microcontrollers typically have somewhere between 32 and 128 general purpose registers. The Z80 has 6(ish), and 2 of those basically have to be reserved for use as a pointer for almost all nontrivial code. This gives compilers for those architectures a lot more scope to optimize. – Jules Jun 19 '18 at 21:40

The main downside of "historic" CPU's (non?)-suitability for C programs is the lack of capability to form more than one register into an address without using the ALU.

Most more modern CPUs can use base + index + offset register addressing modes to address complex data structures like arrays and structures - The Z80 needs to painstakingly go through the 4-bit ALU to add an offset + an index to a base register like HL - most modern CPUs use separate address calculation instances for the various addressing modes.

Another reason is the lack of real multipurpose registers - You simply cannot do everything with every register in the Z80 - Its pure register count is somewhat impressive, but using the alternate register set is probably too complicated for a compiler, and thus the possible choice of registers for a compiler is limited. This is even more valid for the 6502 that has even fewer registers.

Yet another downside is: You can't get a decently modern C compiler for the Z80 - clang or GCC with their aggressive optimizers don't bother for this old CPUs, and hobbyists' produces are just not that sophisticated. Even if you could, GCC and clang concentrate to optimize for code locality , something a CPU without a cache can't even benefit from, but really boosts a modern CPU.

I personally don't think (even non-optimal) compilers would be useless for old CPUs - There is always a lot of stuff in a program that isn't fun to do anyhow and just tedious to write in assembler (and after all, the only reason why we would still do this would be fun, wouldn't it?) - So I tend to write the boring, non-time-critical parts of a program in C, the other, the "fun" part in assembly. Perfect of both worlds.

share | improve this answer | |

edited Mar 28 '18 at 19:43

answered Mar 28 '18 at 16:51

tofro

19.8k 3 gold badges 42 silver badges 106 bronze badges

6

– Rich Mar 29 '18 at 2:54
3

@LưuVĩnhPhúc What do you consider LDRLS x,[r1,r0,LSL #2] then (ARM)? – tofro Mar 29 '18 at 5:14
4

I'm not familiar with ARM ISA but Even though the ARM is a RISC architecture, it does not strictly follow the RISC principles as does the MIPS... In addition, it provides a large number of addressing modes and uses a somewhat complex instruction format – phuclv Mar 29 '18 at 8:55
4

ARM is not really a RISC ISA. It's somewhat RISCy, or shares some of their features, like fixed-width instructions (except Thumb2...), but an ISA with an instruction that does anywhere from 1 to 16 loads or stores depending on bits in a bit-field in the instruction is not a RISC. (I'm talking about ARM's push {r4, r5, r6, ..., lr} aka STMDB and corresponding pop instruction. The load/store-multiple instructions are microcoded because they're too complex and do a variable amount of work. – Peter Cordes Mar 30 '18 at 3:22
6

– Raffzahn Mar 30 '18 at 12:35

Quite often people don't know how to use the compilers or don't understand fully the consequences of code they write. There is optimization going on in the z80 c compilers but it's not as complete as, say, gcc. And I often see people fail to turn up the optimization when they compile.

There is an example here in introspec's post that I am not allowed to comment on due to reputation points:

char i,data[10];

void main(void) 
{
  for (i=0; i<10; i++)
    data[i]=0;
}

There are lots of problems with this code that he is not considering. By declaring i as char, he's possibly making it signed (that is the compiler's discretion). That means, in comparisons, the 8-bit quantity is sign extended before being compared because normally, unless you specify in code properly, the c compiler may promote to ints before doing those comparisons. And by making it global, he makes sure the compiler cannot hold the for-loop index in a register inside the loop.

There are two c compilers in z88dk. One is sccz80 which is the most advanced iteration of Ron Cain's original compiler from the late 70s; it's mostly C90 now. This compiler is not an optimizing compiler - it's intention is to generate small code instead. So you will see many compiler primitives being carried out in subroutine calls. The idea behind it is that z88dk provides a substantial c library that is written entirely in asm language so the c compiler is intended to produce glue code while the execution time is spent in hand-written assembler.

The other c compiler is a fork of sdcc called zsdcc. This one has been improved on and produces better & smaller code than sdcc itself does. sdcc is an optimizing compiler but it tends to produce larger code than sccz80 and overuses the z80's index registers. The version in z88dk, zsdcc, fixes many of these sorts of issues and now produces comparable code size to sccz80 when the --opt-code-size switch is used.

This is what I get for the above when I compile using sccz80:

zcc +zx -vn -a -clib=new test.c

(the -O3 switch is for code size reduction but I prefer the default -O2 most of the time)

._main
    ld  hl,0    ;const
    ld  a,l
    ld  (_i),a
    jp  i_4
.i_2
    ld  hl,_i
    call    l_gchar
    inc hl
    ld  a,l
    ld  (_i),a
    dec hl
.i_4
    ld  hl,_i
    call    l_gchar
    ld  de,10   ;const
    ex  de,hl
    call    l_lt
    jp  nc,i_3
    ld  hl,_data
    push    hl
    ld  hl,_i
    call    l_gchar
    pop de
    add hl,de
    ld  (hl),#(0 % 256)
    ld  l,(hl)
    ld  h,0
    jp  i_2
.i_3
    ret

Here you see the subroutine calls for compiler primitives and the fact the compiler is forced to use memory to hold the for-loop index. "l_lt" is a signed comparison.

A zsdcc compile with optimization turned up:

zcc +zx -vn -a -clib=sdcc_iy -SO3 --max-allocs-per-node200000 test.c

_main:
    ld  hl,_i
    ld  (hl),0x00
l_main_00102:
    ld  hl,(_i)
    ld  h,0x00
    ld  bc,_data
    add hl,bc
    xor a,a
    ld  (hl),a
    ld  hl,_i
    ld  a,(hl)
    inc a
    ld  (hl),a
    sub a,0x0a
    jr  C,l_main_00102
    ret

By default char is unsigned in zsdcc and it's noticed that the comparison "i<10" can be done in 8-bits. C rules say both sides should be promoted to int but it's ok not to do that if the compiler can figure out the comparison can be equivalently done another way. When you don't specify that your chars are unsigned, this promotion can lead to insertion of sign extension code.

If I now make the char explicitly unsigned and declare i inside the for-loop:

unsigned char data[10];

void main(void)
{
  for (unsigned char i=0; i<10; i++)
    data[i]=0;
}

sccz80 does this:

zcc +zx -vn -a -clib=new test.c

._main
    dec sp
    pop hl
    ld  l,#(0 % 256)
    push    hl
    jp  i_4
.i_2
    ld  hl,0    ;const
    add hl,sp
    inc (hl)
.i_4
    ld  hl,0    ;const
    add hl,sp
    ld  a,(hl)
    cp  #(10 % 256)
    jp  nc,i_3
    ld  de,_data
    ld  hl,2-2  ;const
    add hl,sp
    ld  l,(hl)
    ld  h,0
    add hl,de
    ld  (hl),#(0 % 256 % 256)
    ld  l,(hl)
    ld  h,0
    jp  i_2
.i_3
    inc sp
    ret

The comparison is now 8-bit and no subroutine calls are used. However, sccz80 cannot put the index i into a register - it does not carry enough information to do that so it instead makes it a stack variable.

The same for zsdcc:

zcc +zx -vn -a -clib=sdcc_iy -SO3 --max-allocs-per-node200000 test.c

_main:
    ld  bc,_data+0
    ld  e,0x00
l_main_00103:
    ld  a, e
    sub a,0x0a
    ret NC
    ld  l,e
    ld  h,0x00
    add hl, bc
    ld  (hl),0x00
    inc e
    jr  l_main_00103

Comparisons are unsigned and 8-bit. The for loop variable is kept in register E.

What about if we walk the array instead of indexing it?

unsigned char data[10];

void main(void)
{
  for (unsigned char *p = data; p != data+10; ++p)
      *p = 0;
}

zcc +zx -vn -a -clib=sdcc_iy -SO3 --max-allocs-per-node200000 test.c

_main:
    ld  bc,_data
l_main_00103:
    ld  a, c
    sub a,+((_data+0x000a) & 0xFF)
    jr  NZ,l_main_00116
    ld  a, b
    sub a,+((_data+0x000a) / 256)
    jr  Z,l_main_00105
l_main_00116:
    xor a, a
    ld  (bc), a
    inc bc
    jr  l_main_00103
l_main_00105:
    ret

The pointer is held in BC, the end condition is a 16-bit comparison and the result is the main loop takes about the same amount of time.

Then the question is why isn't this done with a memset?

#include <string.h>

unsigned char data[10];

void main(void)
{
    memset(data, 0, 10);
}

zcc +zx -vn -a -clib=sdcc_iy -SO3 --max-allocs-per-node200000 test.c

_main:
    ld  b,0x0a
    ld  hl,_data
l_main_00103:
    ld  (hl),0x00
    inc hl
    djnz    l_main_00103
    ret

For larger transfers this becomes an inlined ldir.

In general the c compilers cannot currently generate the common z80 cisc instructions ldir, cpir, djnz, etc but they do in certain circumstances as shown above. They are also not able to use the exx set. However, the substantial c library that comes with z88dk does make full use of the z80 architecture so anyone using the library will benefit from asm level performance (sdcc's own library is written in c so is not at the same performance level). However, beginner c programmers are usually not using the library either because they're not familiar with it and that's on top of making performance mistakes when they don't understand how the c maps to the underlying processor.

The c compilers are not able to do everything, however they're not helpless either. To get the best code out, you have to understand the consequences of the kind of c code you write and not just throw something together.

share | improve this answer | |

answered Mar 30 '18 at 18:29

aralbrec

561 3 silver badges 5 bronze badges

6

– wizzwizz4 ♦ Mar 30 '18 at 19:17
2

Lovely answer! I'd like to add here that I made my variable global specifically to confirm to the recommendations on z88dk website: item 2 at z88dk.org/wiki/doku.php?id=optimization I am not using memset intentionally because there is no ready-made memset for every small loop that you write, so it is the generic behaviour on compiler on small loops that concerns me. – introspec Mar 31 '18 at 8:45
7

And by making it global, he makes sure the compiler cannot hold the for-loop index in a register inside the loop. Again this is purely a limitation of compilers that don't know how to optimize well. It's not volatile , and the compiler can prove the stores into data[] don't alias it (because it's also a global array, not a pointer, and the compiler knows that two globals don't overlap each other). So the compiler is allowed to sink the stores to the counter out of the loop and do one store of 10 after the loop. The "as-if" rule allows compile-time reordering of loads/stores. – Peter Cordes Mar 31 '18 at 19:54
3

But well spotted, that is a seriously bad way to write code that makes life difficult for compilers. It's disappointing (but not too surprising considering their age) that real Z80 compilers can't do that optimization, or turn simple array indexing into pointer increments. gcc could turn the loop into a memset call and/or inline known good memset code :P – Peter Cordes Mar 31 '18 at 19:56
1

– introspec Apr 1 '18 at 10:41

Simple answers one easily gets to this question are The Z80 Sucks and C Sucks - depending on the side someone is on. While they are of course, untrue (*1), there are real issues. A major argument for both sides is that

C is at core tied to a PDP-11(ish) CPU architecture and the Z80 isn't one.
The Z80 is a rather special CPU, created with a focus on maxing abilities, not beauty.
C is a language without, or at best a very minimum runtime (*2).

All these points are linked. Like the question mentioned, C implies a simple and rather symmetric pointer model which is originated in what the PDP-11 offered. This includes the direct conversion to a memory address which in turn allowed to skip the creation of a more sophisticated data model and the use of pointers to realize functions that would otherwise be handled by some language runtime.

Now the Z80 is (like its predecessor, the 8080) quite able to perform everything needed. Due to its (inherited) structure of a single memory pointer it does, however, need to replace a single (PDP-11 based) C-operation with several machine instructions. So far not a real issue. Except, when an assembly programmer looks at the result, he immediately sees Z80 specific ways to improve the result - like holding two pointers and exchanging HL/DE when needed. That's hard to 'understand' for a C compiler, as it is based on semantics - the knowledge 'why' something is done - not just being told 'how' it's done.

It is not strictly a C problem,

but an issue with all high-level languages. They compile best to a simple symmetric CPU model with a set of equal resources, offering exactly the operations the abstraction layer needs. The higher the language's abstraction is, the better the underlying 'CPU' level can perform. That's why the UCSD P-Code System did perform so well across many platforms. The offering of its virtual CPU was exactly what a compiler wants. Despite being an interpreter at the core, performance was, on many machines, comparable to native code generated from the same language source. The reason for this platform optimization lies within the interpreter. Here, each rather abstract function gets performed by optimized routines. A string move might have the same invocation (due to the P-Code) across all platforms, but its implementation is CPU specific, using all advantages the specific CPU offers - like the mentioned working of 8-bit register pointers and only increasing memory base pointers every 256th cycle on a 6502. Operating on a greater abstraction in a language allows the compiler and/or runtime do employ greater optimization than fixing low level-detail within the source code.

C, in turn, exaggerates this by being tied to very specific low-level operations and using them all over and in every application source. Much without an intermediate runtime layer. In this respect, C is way less a high -level language than others, and way more prone to CPU specific issues.

Learning from History

Looking back (*3), the last 30 years do show two developments to bridge the problems of less than 'simple' CPUs and too simple languages. The 8086 family is not only an important, but eventually the best, example for changes in CPUs, as it is a not a simple CPU at first. Sure, compared to the Z80, it is much more powerful and symmetric - still, not as simple as C assumes it to be.

Over time, the x86 got not only instruction set additions such as scaling factors to move array indexing calculations into microcode, but the whole CPU got redesigned in a way that instruction sequences are analyzed, reordered and reformed to make C-like operations perform better. Bottom Line, the 8086 became more PDP-11ish. One way to close the gap.

At the same time, the C Standard development worked hard to define a common set of data types and functions thereon that now can be used by the compiler to get a glimpse of the why instead of the how . These source statements (may) no longer be directly translated into function calls, but be used by the compiler to generate different, more specialized, target optimized code. In the end, a way to make C a bit more high level than originally intended.

What's the Lesson for Z80 Users?

Well, one might be not using C at all :) (*4)

Another, more practical, way is to go the same path that standard C is doing: Use more task-specific high-level functions and optimize them (in assembly) for the Z80.

The last would be to optimize existing C compilers for the Z80 to generate a more CPU-embracing code structure. For example with different ways of parameter passing depending on functions' use and so on.

BTW: The 6502's short call stack is often cited here, but there is no relation to C. C doesn't require the usage of the return stack for parameters. It can as well be a separate parameter stack. In fact, strictly speaking, C doesn't require a stack at all.

C does require a way of bookkeeping for nested calls, some way of parameter passing (with undefined length) and a way to handle local variables. How this is done is up to the compiler (or its creator). Using some hardware stack is one (simple) way, but not necessarily the best with a given CPU.

*1 - As a 6502 and Assembly guy I do feel deep down they are not false :))

*2 - No, the C-LIB isn't a runtime as part of the language: it is a collection of standard functions, itself (almost) completely written in C, and compiled/linked at compile time.

*3 - Looking back is rather rare in IT, but we are Retrocomputing - we not only play nostalgia but also try to learn from history, don't we?

*4 - A serious choice could be Ada . Due its declarative nature, code generation can be way better optimized for individual CPUs. After all, it was one of the main goals of Ada's development to be able to produce good code no only for mainframes but also for little bastards like an 8048. There have been several special Z80 compilers during the 1980s; most prominent may be RR Software 's Janus/Ada 83 . While no longer mentioned, there was also a Z80 version.

share | improve this answer | |

edited Jun 20 '18 at 14:03

answered Mar 28 '18 at 17:38

Why do C to Z80 compilers produce poor code? (2018)

Raffzahn

97.5k 12 gold badges 247 silver badges 402 bronze badges

7

okay but not ADA, Ada. It's a noun, not initials. – Jean-François Fabre Mar 29 '18 at 8:02
5

– Max Barraclough Mar 29 '18 at 14:55
5

The Z80 sucks a bit and C sucks a bit, but contemporary C compilers sucked a lot . Yesterday I tried compiling a simple C program with Hisoft C on a Spectrum +3. What a pain! And the code sucked. A much better compiler could be developed, but it would take a lot more effort (and be less enjoyable) than just continuing to code in assembler. – Bruce Abbott Mar 29 '18 at 19:33
3

– T.E.D. Mar 30 '18 at 15:46
3

@JdeBP I think you try to argue with today's compiler technology and philosophy against stuff from 20, 30 years ago. The use of compiler intrinsics for standard features like memcpy et all, for example, only started seriously about 10, 15 years ago. So, for a today gcc or clang, you are absolutely right. For a HISOFT C compiler in 1985, quite not so. – tofro Mar 31 '18 at 12:55

The Motorola 6809 is probably the only legacy CPU of the 80's which is well suited for C compiler, thanks to several advanced features (for the time) : - orthogonal instruction set - rich addressing mode - hardware multiplier, to quickly compute addresses - position independant code

This kind of CPU (and the improved 6309) can be find in some home computers (Vectrex, Tandy Coco, Thomson, ...) and a lot of embedded systems.

share | improve this answer | |

answered Mar 29 '18 at 15:16

Emmanuel

91 1 bronze badge

3

– Chenmunka ♦ Mar 29 '18 at 15:28
@Chenmunka that would be a strange argument, as C wasn't any important language back then . Even less a reason to make a CPU fit it. But yes, the 6809 was (much like the 8086) especially designed with high level languages producing linkable modularized code in mind. – Raffzahn Mar 29 '18 at 20:19
1

This could answer the question with a little re-wording. These are features that the 6809 had that made it well suited to C, but what features does the Z80 not have that makes it not well suited? – wizzwizz4 ♦ Mar 31 '18 at 8:10
– Chris Stratton Mar 31 '18 at 19:41
@ChrisStratton: Perhaps he meant the one 8-bit CPU of that era. Microchip has added some features to some of their 8-bit line in an effort to make them compiler-friendly, though IMHO they made some significant missteps in their design. – supercat Mar 31 '18 at 23:41

Well, I personally find it annoying reading so many comments here about what modern compilers can and cannot easily do. It is terrible what wishful thinking does to your brain. OK. Let me show why people who still remember how to code Z80 hate C compilers. This is a trivial C code that I was hoping to compile:

int i,data[10];
main() {
  for (i=0; i<10; i++)
    data[i]=0;
}

This is the Z88DK output using zcc -O3 -a trivial.c :

._main
    ld  hl,0    ;const              ; i=0
    ld  (_i),hl
    jp  i_5
.i_3
    ld  hl,(_i)                     ; i++
    inc hl
    ld  (_i),hl
    dec hl
.i_5
    ld  hl,(_i)                     ; if i>=10 GOTO i_4
    ld  de,10   ;const
    ex  de,hl
    call    l_lt
    jp  nc,i_4

    ld  hl,_data                    ; HL = data + i
    push    hl
    ld  hl,(_i)
    add hl,hl
    pop de
    add hl,de

    ld  de,0    ;const              ; (HL) = DE
    ex  de,hl
    call    l_pint
    jp  i_3
.i_4
    ret

I am not counting t-states and not including the code in the case when i and data[10] are declared as char , because I do not have a goal to embarrass the compiler authors.

OK, maybe SDCC can do better? At least it can deal with char data type in a sane way. So we create

char i,data[10];
main() {
  for (i=0; i<10; i++)
    data[i]=0;
}

and SDCC compiles it using sdcc -mz80 --opt-code-speed into

;trivial.c:21: for (i=0; i<10; i++)
    ld  hl,#_i + 0
    ld  (hl), #0x00
    ld  bc,#_data+0
00102$:
;trivial.c:22: data[i]=0;
    ld  hl,(_i)
    ld  h,#0x00
    add hl,bc
    ld  (hl),#0x00
;trivial.c:21: for (i=0; i<10; i++)
    ld  iy,#_i
    inc 0 (iy)
    ld  a,0 (iy)
    sub a, #0x0a
    jr  C,00102$

So, the addition of char to pointer is done in 16 bits, the index registers are used for some unknown reason, but otherwise this at least begins to look like an assembly program. So, if I ignore the preamble and just count t-states per iteration of the main loop from 00102$ :

16+7+11+10 + 14+23+19+7+12 = 119 t-states per byte

As a comparison, this is what a relatively inefficient assembly code may look like (I wrote this very closely to what my C for-loop implies, so that compiler at least has a chance of getting this right):

ld hl,data_addr
         ld a,0
loop:    ld (hl),0
         inc hl
         inc a
         cp 10
         jr nz,loop    ; 10+6+4+7+12 = 39t

If counter is allowed to go in the opposite direction, a similar loop in my other answer to this question does the job in 25.5 t-states per byte. The fastest Z80 code for memory filling can average below 10 t-states per byte, but this is not an exercise in memory-filling, this is a simple test of what some trivially simple code tends to be compiles into.

So, this is my brutally honest answer to your question why people like myself say that C compilers for Z80 produce poor code: BECAUSE THEY DO.

share | improve this answer | |

edited Jun 20 '18 at 13:14

Why do C to Z80 compilers produce poor code? (2018)

Toby Speight

696 6 silver badges 18 bronze badges

answered Mar 29 '18 at 20:10

introspec

2,989 1 gold badge 13 silver badges 17 bronze badges

Just to finish off the thought, presumably if you were writing itself you'd store a zero byte then LDIR the rest? Without being explicit, it's not likely to be clear to everyone why 119 is a bad number. – Tommy Mar 29 '18 at 20:12
1

Actually, not too worth getting involved in whether a C compiler should use LDIR here, because I think the answer is likely to be: it should, but you should use memset or some other overly-specific take on the example when the point is clear as is. But I just meant: to the casual reader, coming along and reading this answer, you assert that the generated code is awful — and I'm not disputing that — but it might be more convincing if you showed non-awful code for comparison. That's all. No dispute as to information and data stated. – Tommy Mar 29 '18 at 20:23
2

– cup Mar 30 '18 at 5:16
1

@introspec: my comments on other answers saying what modern compilers (e.g. for x86) can do were making the same point that you are here. Efficient compilation would be possible given a smart optimizing compiler, so the terrible code-gen from real Z80 compilers is more a result of massive missed-optimizations, not of C being inherently impossible to compile efficiently (although C source with multiple pointers used at once would be a problem!) – Peter Cordes Mar 30 '18 at 15:55
1

e.g. a Z80 backend for modern gcc or LLVM could do a lot better cross-compiling from a powerful computer (if anyone put in the amount of development time it would take to find target-specific optimizations, too), vs. real historical Z80 compilers. Writing an optimizing compiler is a huge challenge / amount of work . My point was always that compilers could do whatever optimizations (and do for x86 / ARM / whatever), not that any good Z80 compilers exist or could be made easily. – Peter Cordes Mar 30 '18 at 15:58

While the Z80 is definitely an 8-bit processor rather than a 16-bit one, the instruction set makes some operations easier with 16-bit values than 8-bit values. For example, given something like: a=b+c+d; with all variables being 16 bit types and having static duration could be realized as:

ld  hl,(_b)
    ld  de,(_c)
    add hl,de
    ld  de,(_d)
    add hl,de
    ld (_a),hl

but trying to do it as 8 bits would require a different approach:

ld  a,(_b)
    ld  hl,_c
    add a,(hl)
    ld  hl,_d
    add a,(hl)
    ld  (_e),a

It's possible to generate efficient code if all operations use 8-bit math or if all use 16-bit math, but 8-bit and 16-bit operations require totally different approaches, and trying to combine them gets awkward (e.g. if b and c were 16-bit values, but d was an 8-bit one, the most efficient way to add d would be to load it and the following byte into DE, then clear D, and then add DE to HL). If a compiler wants to try to handle 8-bit math efficiently, it will have to use code generation logic that's very different from what's needed for 16-bit math, and a lot of compiler writers aren't going to want to massively increase the size of their code generator for that.

share | improve this answer | |

edited Sep 17 '18 at 15:37

Stéphane Gourichon

341 1 silver badge 8 bronze badges

answered Mar 30 '18 at 4:03

supercat

14k 2 gold badges 23 silver badges 60 bronze badges

Interesting elements, answer requires proofreading. Typo misses closing parenthesis in first code paragraph. In second paragraph, mov is not Z80 asm keyword, and there's no addition at all, so that code can't do what it's supposed to do. Can you clarify? Thanks. – Stéphane Gourichon Sep 17 '18 at 9:16
@StéphaneGourichon: Does that make more sense? – supercat Sep 17 '18 at 14:46
That's better yet addresses only 2 of the 3 items in my comment: mov is not part the usual Z80 ASM syntax. The second paragraph of code still does not make sense. Only one Z80 ADD operation cannot add b+c+d. – Stéphane Gourichon Sep 17 '18 at 15:04
1

@StéphaneGourichon: Incidentally, after writing the answer above, I discovered that the Z80 has some 8-bit and even 16-bit internal data paths, its primary ALU is only 4 bits. An instruction like INC HL uses a 16-bit limited-purpose ALU which takes two cycles to perform an operation, but INC HL takes six cycles because that ALU gets used twice during each instruction fetch (once to increment PC, and once to increment R), thus requiring that two cycles actually performing the operation get added to that. – supercat Sep 17 '18 at 20:12
1

@StéphaneGourichon: Something line INC A actually requires using the four-bit ALU twice, but it's faster than INC HL because both operations can be done at the same time as the 16-bit ALU is being used to increment PC and R. – supercat Sep 17 '18 at 20:13

The answer to this question must be opinion-based anyway, and written by the specialist who was designing Z80 C compiler. I will give it a try though.

I used MSX-C compiler made by ASCII together with Microsoft back in old 80-90's days; the platform was MSX. I do not recall if it used stack to pass arguments, however it would be logical given compiler can use IX and IY assigning them to stack pointer and addressing arguments by bytes through (IX+n). I am more than sure Turbo-C version 2.0 for PC XT/AT I have used back in 90s was doing the same using register BP.

One remarkable thing I recall from using MSX-C was that its output was not Z80 code, but 8080 code. Most probably compiler was originally designed for 8080, and then just ported to Z80, thus was not aware about IX and IY registers.

Regarding (IX+n) and (IY+n) commands. N is signed byte, thus you can address -128 to +127 from the base of the index register. Then, n must be a constant, thus changing it is possible within RAM by replacing byte of the executable code, which is another level of the optimization which most probably was not considered those old days.

So what are the reasons C fits badly

My personal opinion:

For old compiler software developed back on old days, compiler developers were (1) focusing on reliability of the compiler's job; (2) speed of compilation; also keeping in mind that (3) register set is not so big to have much optimization with it.
For new compiler software it must be either developed by the real enthusiasts who are also experts in compilers (that is, to my knowledge, special field in computing), or have commercial interest (questionable if it is possible though these days).

So what are the reasons C fits badly

In general I would like to see example. MSX-C did job in four steps (yes, four!).

CF.COM was parsing the C code, creating some output file;
CG.COM was "code generator" which generated assembly language text file;
M80.COM was creating .REL object file, which then
linked by the L80 with other object code (e.g. libraries).

There're pros and cons for this architecture, and there should be also historical reasons. CF and CG are about 30-40KB each, thus you can not "merge" them into one executable because it will then simply not fit into the RAM (not talking about work area); M80 used human-readable assembly text files, thus programmer had an opportunity to look at assembly code and get an idea what real executable could look like and what s/he can do to improve it, or inject own assembler routines at the linking stage.

share | improve this answer | |

edited Mar 29 '18 at 20:16

Why do C to Z80 compilers produce poor code? (2018)

Raffzahn

97.5k 12 gold badges 247 silver badges 402 bronze badges

answered Mar 28 '18 at 15:48

Why do C to Z80 compilers produce poor code? (2018)

Anonymous

1,140 4 silver badges 9 bronze badges

5

– Jules Mar 28 '18 at 16:15
2

– Anonymous Mar 28 '18 at 18:16
– phuclv Mar 29 '18 at 4:39
– phuclv Mar 29 '18 at 8:59
1

– Thorbjørn Ravn Andersen Dec 23 '18 at 11:50

Not the answer you're looking for? Browse other questions taggedz80 c compilers or ask your own question .

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Why do C to Z80 compilers produce poor code? (2018)

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Data Mining

Jiawei Han、Micheline Kamber、Jian Pei / Morgan Kaufmann / 2011-7-6 / USD 74.95

The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier......一起来看看《Data Mining》这本书的介绍吧!

码农工具