This is Xamarin's bug tracking system. For product support, please use the support links listed in your Xamarin Account.
Bug 707 - Thumb breaks things in many ways
Summary: Thumb breaks things in many ways
Alias: None
Product: iOS
Classification: Xamarin
Component: General (show other bugs)
Version: 4.x
Hardware: Macintosh Mac OS
: --- major
Target Milestone: Untriaged
Assignee: Bugzilla
: 539 1568 1686 1992 (view as bug list)
Depends on:
Reported: 2011-09-09 19:49 UTC by Chris Toshok
Modified: 2012-02-20 17:55 UTC (History)
11 users (show)

See Also:
Is this bug a regression?: ---
Last known good build:

partial assembly (32.52 KB, text/plain)
2011-10-27 08:46 UTC, Rolf Bjarne Kvinge
gdb log (1.40 KB, text/plain)
2011-10-27 16:36 UTC, Rolf Bjarne Kvinge
patch (4.69 KB, patch)
2011-10-28 12:24 UTC, Zoltan Varga
Details | Diff
fix 1/3: Follow jumps when getting the plt entry for a call. (1.61 KB, patch)
2011-11-11 08:06 UTC, Rolf Bjarne Kvinge
Details | Diff
fix 2/3: Use direct method addresses instead of calculating offsets since the linker may move/reorder methods (4.87 KB, patch)
2011-11-11 08:07 UTC, Rolf Bjarne Kvinge
Details | Diff
fix 3/3: Use global symbols so the arm linker relocates them properly when methods are reordered. (851 bytes, patch)
2011-11-11 08:08 UTC, Rolf Bjarne Kvinge
Details | Diff

Description Chris Toshok 2011-09-09 19:49:00 UTC
So far we have a number of possible problems scenarios with thumb:

1) the thumb code we generate might be broken

2) if the user links to .a's that are built with thumb enabled, and we might fail to deal with this (or apple's linker might fail)

both of these scenarios involving the app crashing on device in impossible to diagnose ways.

there is really no helpful info anywhere, and the only way to figure out if things are broken in case 2) above is to run otool -h on each .a and figure out if thumb is involved.  when we find out it is, historically our response is "sorry, don't use thumb."  This sucks when the .a isn't built by the user.

we need to figure out the root causes for 1 and 2 and either see if we can fix them, or add whatever checks are necessary in MD to disable the use of thumb in MT apps entirely.
Comment 2 Rolf Bjarne Kvinge 2011-10-20 08:22:22 UTC
*** Bug 1568 has been marked as a duplicate of this bug. ***
Comment 3 Jamie Briant 2011-10-20 21:16:32 UTC
Is there no easy way to detect this? Coz this plagued me for months.
Comment 4 Jamie Briant 2011-10-20 21:17:09 UTC
Comment 5 Rolf Bjarne Kvinge 2011-10-21 18:05:45 UTC
The only known way to do it right is is to run:

  $ otool -tv thelibrary.a | grep ^.......2

and if anything is printed, it's thumb.
Comment 6 Rolf Bjarne Kvinge 2011-10-26 16:33:14 UTC
*** Bug 1686 has been marked as a duplicate of this bug. ***
Comment 9 Rolf Bjarne Kvinge 2011-10-27 08:46:53 UTC
Created attachment 777 [details]
partial assembly

Zoltan, I hope this description can help you fix / workaround this bug:

The issue is that the linker inserts code in between methods for an assembly, and since we store offsets to methods in the aot compiled assembly those offsets end up all wrong once the linker has done its job.

To illustrate attached is a partial disassembled mscorlib.

For System_Runtime_Remoting__Lease_CheckNextSponsor the code offset is correct. The next method we aot compiled is  System_Runtime_Remoting_Lifetime_Lease_ProcessSponsorResponse_object_bool (line 969), which in mscorlib.s (and mscorlib.o) is just after System_Runtime_Remoting__Lease_CheckNextSponsor, and the offset we store in the aot image makes the runtime return line 162 as the start of System_Runtime_Remoting_Lifetime_Lease_ProcessSponsorResponse_object_bool, but that is where the linker has decided to do some magic and has inserted some other code. This causes us to randomly execute random code.

I just tried storing the actual location of the method in code_offsets instead of just a calculated offset, and the code_offsets array is then correctly linked to point to the actual methods and not random memory (but this has other consequences so it doesn't work as-is):

	.long .LDIFF_SYM0
Comment 10 Rolf Bjarne Kvinge 2011-10-27 08:48:18 UTC
Zoltan: last comment has a description of the problem
Comment 11 Rolf Bjarne Kvinge 2011-10-27 08:49:41 UTC
Sorry, the attachment is a from the final linked binary, not mscorlib.
Comment 12 Zoltan Varga 2011-10-27 12:46:57 UTC
Storing the actual methods can work, however, it might require runtime relocations, so it might be slower. Can you check that at runtime, the array
contains the actual methods ? i.e. at aot-runtime.c, around line 1209, the
'amodule->code_offsets' array should point to the methods themselves.
Comment 13 Rolf Bjarne Kvinge 2011-10-27 16:36:52 UTC
Created attachment 783 [details]
gdb log

I can confirm that with the following patch: the elements of the code_offsets array point to the methods themselves (see attached gdb log).
Comment 14 Zoltan Varga 2011-10-28 12:23:46 UTC
Please try the attached patch with complex apps.
Comment 15 Zoltan Varga 2011-10-28 12:24:20 UTC
Created attachment 791 [details]
Comment 16 Rolf Bjarne Kvinge 2011-10-28 19:25:40 UTC
This gets me a little bit further, now it crashes with this:

* Assertion at ../../../../../mono/mono/mini/aot-runtime.c:3257, condition `plt_entry' not met
  at MonoTouch.ObjCRuntime.Runtime..cctor () <0x00023>
  at (wrapper runtime-invoke) object.runtime_invoke_dynamic (intptr,intptr,intptr,intptr) <0xffffffff>
Native stacktrace:
 0   test1                        0x00e02e58 mono_handle_native_sigsegv + 456
 1   test1                        0x00e311b8 sigabrt_signal_handler + 168
 2   libsystem_c.dylib  0x309ec539 _sigtramp + 48
 3   libsystem_c.dylib  0x309e1f5b pthread_kill + 54
 4   libsystem_c.dylib  0x309dafeb abort + 94
 5   test1                        0x00f80ed4 monoeg_g_logv + 296
 6   test1                        0x00f80fd4 monoeg_assertion_message + 80
 7   test1                        0x00df7604 mono_aot_get_plt_info_offset + 96
 8   test1                        0x00e05758 mono_aot_plt_trampoline + 48
 9   test1                        0x0087eb98 generic_trampoline_aot_plt + 136
 10  test1                       0x007fb830 wrapper_runtime_invoke_object_runtime_invoke_dynamic_intptr_intptr_intptr_intptr + 200
 11  test1                       0x00ddb5ac mono_jit_runtime_invoke + 2892
 12  test1                       0x00eef824 mono_runtime_invoke + 200
 13  test1                       0x00ee8118 mono_runtime_class_init_full + 2060
 14  test1                       0x00ee78fc mono_runtime_class_init + 28
 15  test1                       0x00dd8f7c mono_jit_compile_method_inner + 224
 16  test1                       0x00dda1a8 mono_jit_compile_method_with_opt + 660
 17  test1                       0x00ddadd0 mono_jit_runtime_invoke + 880
 18  test1                       0x00eef824 mono_runtime_invoke + 200
 19  test1                       0x00fa6e44 monotouch_init + 640
 20  test1                       0x00dd1338 main + 3740
 21  test1                       0x0000346c start + 52

I've seen this crash once in a while before too, without the aot patch, so this is likely another issue. I will continue debugging this next week.

Note that if I don't link my test project with the thumb library everything works as it used to even with the patch, so the patch is definitively not the problem.
Comment 17 Rolf Bjarne Kvinge 2011-10-28 20:06:44 UTC
It's trying to get the plt entry for this branch:

0x98bc00 <MonoTouch_ObjCRuntime_Runtime__cctor+28>:	ldr	r0, [pc, r0]
0x98bc04 <MonoTouch_ObjCRuntime_Runtime__cctor+32>:	bl	0xa07148 <p_17.island.3>
0x98bc08 <MonoTouch_ObjCRuntime_Runtime__cctor+36>:	str	r0, [sp]

but 0xa07148 isn't between amodule->plt (0xd0c300) and amodule->plt_end (0xd21560)

and a disassemble for 0xa07148 shows:

(gdb) disass 0xa07148
Dump of assembler code for function p_17.island.3:
0x00a07148 <p_17.island.3+0>:	b	0xd0c400 <p_17>
End of assembler dump.
(gdb) disass 0xd0c400
Dump of assembler code for function p_17:
0x00d0c400 <p_17+0>:	ldr	r12, [pc, #0]	; 0xd0c408 <p_17+8>
0x00d0c404 <p_17+4>:	ldr	pc, [pc, r12]
0x00d0c408 <p_17+8>:	eorseq	lr, r6, r8, lsl r0
0x00d0c40c <p_17+12>:	andeq	r3, r6, r1, asr #21
End of assembler dump.

Note that the branch in p_17.island.3 is into the plt (an extra indirection?)
Comment 18 Rolf Bjarne Kvinge 2011-10-28 20:23:49 UTC
So with this patch: I get even further, now it just exits at some point without any printf or warning. This will have to wait until next week.
Comment 19 Zoltan Varga 2011-10-29 17:04:53 UTC
That is probably an extra stub inserted by the linker for some reason.
Comment 20 Rolf Bjarne Kvinge 2011-10-31 08:34:29 UTC
It looks like it's stubs the linker inserts when the branch target is out of reach for the branch instruction - another branch instruction is inserted in between the original instruction and its target to bridge them.

ARM calls it veneers, Apple seems to be calling it islands.
Comment 21 Rolf Bjarne Kvinge 2011-10-31 08:41:49 UTC
This could also be reason for #539 - random crashes on startup with huge apps where rebuilds may or may not fix it.
Comment 22 Rolf Bjarne Kvinge 2011-10-31 08:43:09 UTC
*** Bug 539 has been marked as a duplicate of this bug. ***
Comment 23 Rolf Bjarne Kvinge 2011-11-10 20:47:28 UTC
So the bridges/islands the linker inserts are causing more problems, in pseudo c code:

void foo ()
   bar ();
void bar ()

if the linker inserts an island between foo and bar, the branch to bar in foo isn't updated so when it's executed it jumps to a random code location.

The problem seems to be that there are no relocation entries for the call to bar in foo in the .o file (I presume there is a way to add the relocation entry in the .s assembly code?).

I tried creating a test case using Xcode/C/gcc, and all the calls are fixed up correctly when the linker inserts islands (and the call locations have a corresponding relocation entry).
Comment 24 Rolf Bjarne Kvinge 2011-11-11 08:06:41 UTC
Created attachment 871 [details]
fix 1/3: Follow jumps when getting the plt entry for a call.
Comment 25 Rolf Bjarne Kvinge 2011-11-11 08:07:50 UTC
Created attachment 872 [details]
fix 2/3: Use direct method addresses instead of calculating offsets since the linker may move/reorder methods

This is the patch from comment #15 applied to mobile-master
Comment 26 Rolf Bjarne Kvinge 2011-11-11 08:08:17 UTC
Created attachment 873 [details]
fix 3/3: Use global symbols so the arm linker relocates them properly when methods are reordered.
Comment 27 Rolf Bjarne Kvinge 2011-11-11 08:09:56 UTC
All of them should probably be arch-specific somehow.

Zoltan, please review :)
Comment 28 Jamie Briant 2011-11-19 02:06:39 UTC
If I build with LLVM compiler, but *not* Thumb-2 mode, will it work, or will I still get these errors?
Comment 29 Zoltan Varga 2011-11-19 14:23:33 UTC
Jamie: You might/might not, LLVM is not able to compile all mono code, so we still fall back to our compiler, but it might "fix" the issue.
Comment 30 Rolf Bjarne Kvinge 2011-11-28 11:56:35 UTC
This bug has been fixed.

The fix will be included in the first release after 5.1 (which is already about to be released).
Comment 31 Rolf Bjarne Kvinge 2011-11-29 07:45:05 UTC
*** Bug 1992 has been marked as a duplicate of this bug. ***
Comment 32 Rolf Bjarne Kvinge 2012-01-02 18:56:45 UTC
*** Bug 2652 has been marked as a duplicate of this bug. ***
Comment 33 Rolf Bjarne Kvinge 2012-02-20 17:55:09 UTC
*** Bug 3205 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.