On microcontrollers, however, and especially 8 / 16 bit ones, the C compilers aren't "all that", and the additional overhead imposed by a C compiler can kill your application stone dead.
Let's take a concrete example.
In my (very non-optimal) assembler example posted before, I need to clear the interrupt pending flag for the Timer interrupt I'm servicing. This is easily done in assembler, it's a one-line, one clock cycle instruction, as follows:
bres 0x5255, #0x00 ; Clear TIM1 Interrupt pending bit
Simple, right? Now, let's look at what the C compiler gives us.
Here's the "C" code we use:
// Clear the interrupt pending bit for TIM1.
Simple enough, right? There's obviously the overhead of a function call, but we might expect the guts of the function to do a simple bit of inline assembler as above. Let's look.
void TIM1_ClearITPendingBit(TIM1_IT_TypeDef TIM1_IT)
/* Check the parameters */
/* Clear the IT pending Bit */
TIM1->SR1 = (u8)(~(u8)TIM1_IT);
So, let's look at what that C code produces.
0xA601 LD A,#0x01 LD A,#0x01
0xCD8C9C CALL 0x8c9c CALL _TIM1_ClearITPendingBit
stm8s_tim1.c:2156 TIM1->SR1 = (u8)(~(u8)TIM1_IT);
0x8c9c <.ClearITPendingBit> 0x43 CPL A CPL A
0x8c9d <.earITPendingBit+1> 0xC75255 LD 0x5255,A LD 0x5255,A
0x8ca0 <.earITPendingBit+4> 0x81 RET RET
So, we load the accumulator with a value, that's one cycle. Call a function, 4 cycles. Complement the accumulator, one cycle. Store the accumulator in the flag, 1 cycle. Return from function, 4 cycles.
In total, that's 11 cycles and 10 bytes to do the same thing we did in one cycle and 3 bytes. "inlining" this function doesn't make our code any fatter, either, as the 3 bytes we're using are the same as the 3 bytes we would have used to call the function.
What we do lose is readability, but that's easily enough got back by writing macros.