Link Time Optimization (LTO) - Shrinking Embedded Firmware

June 14, 2025

zephyr arm kconfig linking lto optimization

I’m currently building a firmware application in Zephyr for the STM32WB55, a microcontroller with limited flash space (~400KB when using MCUBoot). While exploring kernel configuration options, I came across the Link Time Optimization (LTO).

Enabling it was as simple as enabling a couple of config flags:

CONFIG_ISR_TABLES_LOCAL_DECLARATION=y
CONFIG_LTO=y

This alone reduced the firmware size by 4%, with zero code changes 🤯.

Before and After: Size Comparison
What is Link Time Optimization (LTO)?
How it works:
Potential Benefits of LTO
Drawbacks and Limitations
LTO in Action: A Minimal Example
Disassembly Comparison
Final Thoughts
References
Extras

Before and After: Size Comparison

Without LTO:

FLASH:     136784 B / 401072 B   (34.10%)
RAM:        45924 B / 192 KB     (23.36%)

With LTO:

FLASH:     120956 B / 401072 B   (30.16%)
RAM:        45860 B / 192 KB     (23.33%)

What is Link Time Optimization (LTO)?

LTO is a type of interprocedural optimization done at the linking stage, not during regular compilation. I know, I know. It sounds confusing, but stick with me.

Normally, compilers treat each source file as an independent unit. With LTO, however, the compiler considers all translation units together, enabling more aggressive and global optimizations, such as inlining and dead code elimination across module boundaries.

How it works:

Without LTO:

Each .c file is compiled into its own object (.o) file.
The linker stitches them together into a final binary.

With LTO:

Each file is compiled into LLVM bitcode, a bitstream format, instead of regular object files.
The linker passes these bitcode files to an LTO optimizer (e.g., libLTO).
The optimizer performs cross-module analysis and emits a highly optimized object file.
The final binary is linked as usual, but with all the global optimizations included.

Potential Benefits of LTO

Smaller binary size: As seen in my project, immediate 4% reduction in flash usage.
Better inlining: Frequently called small functions are embedded directly at call sites.
Dead code removal: Unused functions or variables across files are removed.
Improved runtime performance: Especially in tight loops or performance-critical code.

Drawbacks and Limitations

Nothing is perfect, and LTO has some trade-offs:

Static libraries must be archived using compatible tools (e.g., armllvm-ar).
Partial linking is not supported — all object files must be linked at once.
Version compatibility: Bitcode files must be generated using the same compiler version.
Stack size impact: Increased inlining may increase stack usage. You may need to increase stack size or disable inlining for large functions.

LTO in Action: A Minimal Example

This simple benchmark shows a real improvement in runtime thanks to LTO:

/* main.c */
#include <stdio.h>
#include <time.h>

extern int square(int x);

int main() {
    int sum = 0;
    clock_t start = clock();
    for (int i = 0; i < 100000; i++) {
        sum += square(i);
    }
    clock_t end = clock();
    printf("Sum: %d\n, Time taken: %f seconds\n", sum, (double)(end - start) / CLOCKS_PER_SEC);
    return 0;
}

/* math_utils.c */
int square(int x) {
        return x * x;
}

With -flto, square() is inlined and optimized out — the result: faster execution and smaller binary.

$ ./program_no_lto
Sum: 216474736 , Time taken: 0.000073 second

$ ./program_lto
Sum: 216474736 , Time taken: 0.000024 seconds

Binary sizes:

-rwxr-xr-x 1 user user 15760 May 24 20:52 program_no_lto
-rwxr-xr-x 1 user user 15664 May 24 20:52 program_lto

Disassembly Comparison

LTO significantly simplifies the machine code by inlining and optimizing loops. Here’s a snippet showing how square() is removed in the LTO version and replaced by an inline multiply.

No LTO:

$ objdump -D example/program_no_lto
0000000000401050 <main>:
  401050:	41 54                	push   %r12
  401052:	55                   	push   %rbp
  401053:	31 ed                	xor    %ebp,%ebp
  401055:	53                   	push   %rbx
  401056:	31 db                	xor    %ebx,%ebx
  401058:	e8 d3 ff ff ff       	call   401030 <clock@plt>
  40105d:	49 89 c4             	mov    %rax,%r12

  /* Loop to calculate sum of squares */
  401060:	89 df             j   	mov    %ebx,%edi
  401062:	83 c3 01             	add    $0x1,%ebx
  401065:	e8 36 01 00 00       	call   4011a0 <square> /* call to square() */
  40106a:	01 c5                	add    %eax,%ebp
  40106c:	81 fb a0 86 01 00    	cmp    $0x186a0,%ebx ; /* 0x186a0 = 100000 */
  401072:	75 ec                	jne    401060 <main+0x10>
  /* End of loop */

  401074:	e8 b7 ff ff ff       	call   401030 <clock@plt>
  401079:	66 0f ef c0          	pxor   %xmm0,%xmm0
  40107d:	89 ea                	mov    %ebp,%edx
  40107f:	bf 02 00 00 00       	mov    $0x2,%edi
  401084:	4c 29 e0             	sub    %r12,%rax
  401087:	48 8d 35 7a 0f 00 00 	lea    0xf7a(%rip),%rsi        # 402008 <_IO_stdin_used+0x8>
  40108e:	f2 48 0f 2a c0       	cvtsi2sd %rax,%xmm0
  401093:	b8 01 00 00 00       	mov    $0x1,%eax
  401098:	f2 0f 5e 05 90 0f 00 	divsd  0xf90(%rip),%xmm0        # 402030 <_IO_stdin_used+0x30>
  40109f:	00 
  4010a0:	e8 9b ff ff ff       	call   401040 <__printf_chk@plt>
  4010a5:	5b                   	pop    %rbx
  4010a6:	31 c0                	xor    %eax,%eax
  4010a8:	5d                   	pop    %rbp
  4010a9:	41 5c                	pop    %r12
  4010ab:	c3                   	ret
  4010ac:	0f 1f 40 00          	nopl   0x0(%rax)

  /* Function square */
00000000004011a0 <square>:
  4011a0:	0f af ff             	imul   %edi,%edi
  4011a3:	89 f8                	mov    %edi,%eax
  4011a5:	31 ff                	xor    %edi,%edi
  4011a7:	c3                   	ret

With LTO:

$ objdump -D example/program_lto
0000000000401060 <main>:
  401060:	55                   	push   %rbp
  401061:	53                   	push   %rbx
  401062:	31 db                	xor    %ebx,%ebx
  401064:	48 83 ec 08          	sub    $0x8,%rsp
  401068:	e8 c3 ff ff ff       	call   401030 <clock@plt>
  40106d:	48 89 c5             	mov    %rax,%rbp
  401070:	31 c0                	xor    %eax,%eax 
  401072:	66 66 2e 0f 1f 84 00 	data16 cs nopw 0x0(%rax,%rax,1) #    
  401079:	00 00 00 00 
  40107d:	0f 1f 00             	nopl   (%rax)

  /* Loop to calculate sum of squares */
  401080:	89 c2                	mov    %eax,%edx  /* edx = eax  (temp = i) */
  401082:	0f af d0             	imul   %eax,%edx  /* multiplication  edx = edx * eax (temp*i = i*i) */
  401085:	83 c0 01             	add    $0x1,%eax  /*  eax = eax + 1 (i++) */
  401088:	01 d3                	add    %edx,%ebx  /* ebx = ebx + edx (ebx + i*i ) , ebx = sum */
  40108a:	3d a0 86 01 00       	cmp    $0x186a0,%eax  /* 0x186a0 = 100000 */
  40108f:	75 ef                	jne    401080 <main+0x20>
  /* End of loop */

  401091:	e8 9a ff ff ff       	call   401030 <clock@plt>
  401096:	66 0f ef c0          	pxor   %xmm0,%xmm0
  40109a:	89 da                	mov    %ebx,%edx 
  40109c:	bf 02 00 00 00       	mov    $0x2,%edi
  4010a1:	48 29 e8             	sub    %rbp,%rax
  4010a4:	48 8d 35 5d 0f 00 00 	lea    0xf5d(%rip),%rsi        # 402008 <_IO_stdin_used+0x8>
  4010ab:	f2 48 0f 2a c0       	cvtsi2sd %rax,%xmm0
  4010b0:	b8 01 00 00 00       	mov    $0x1,%eax
  4010b5:	f2 0f 5e 05 73 0f 00 	divsd  0xf73(%rip),%xmm0        # 402030 <_IO_stdin_used+0x30>
  4010bc:	00 
  4010bd:	e8 7e ff ff ff       	call   401040 <__printf_chk@plt>
  4010c2:	48 83 c4 08          	add    $0x8,%rsp
  4010c6:	31 c0                	xor    %eax,%eax
  4010c8:	5b                   	pop    %rbx
  4010c9:	5d                   	pop    %rbp
  4010ca:	c3                   	ret
  4010cb:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

The inlining is clear in the assembly code. Without LTO, there is a function call at 0x401065 to square(), but with LTO, that function call is replaced at 0x401082 with

/* From the LTO version */
call   4011a0 <square> 

/* To Inline multiplication */
mov %eax, %edx   -> edx = eax
imul %eax, %edx  -> edx = edx * eax = edx * edx

Final Thoughts

LTO is a simple optimization that has in most cases only benefits in embedded development:

Smaller firmware
Faster execution

References

Extras

Here is the Makefile used to build the example:

# Compiler and flags
CC = gcc
CFLAGS = -O2
LTO_FLAGS = -flto

# Directories
EXAMPLE_DIR = example

EXAMPLE_SRC = $(EXAMPLE_DIR)/main.c $(EXAMPLE_DIR)/math_utils.c
EXAMPLE_OBJ = $(EXAMPLE_SRC:.c=.o)
EXAMPLE_OBJ_LTO = $(EXAMPLE_SRC:.c=_lto.o)

# Targets
all: example1_no_lto example1_lto

# Example 1: Without LTO
example1_no_lto: $(EXAMPLE_OBJ)
	$(CC) $(EXAMPLE_OBJ) -o $(EXAMPLE_DIR)/program_no_lto

# Example 1: With LTO
example1_lto: $(EXAMPLE_OBJ_LTO)
	$(CC) $(LTO_FLAGS) $(EXAMPLE_OBJ_LTO) -o $(EXAMPLE_DIR)/program_lto

# Compile source files to object files (without LTO)
$(EXAMPLE_DIR)/%.o: $(EXAMPLE_DIR)/%.c
	$(CC) $(CFLAGS) -c $< -o $@

# Compile source files to object files (with LTO)
$(EXAMPLE_DIR)/%_lto.o: $(EXAMPLE_DIR)/%.c
	$(CC) $(CFLAGS) $(LTO_FLAGS) -c $< -o $@

# Clean up
clean:
	rm -f $(EXAMPLE_DIR)/*.o $(EXAMPLE_DIR)/program_*