So far we have a number of possible problems scenarios with thumb:
1) the thumb code we generate might be broken
2) if the user links to .a's that are built with thumb enabled, and we might fail to deal with this (or apple's linker might fail)
both of these scenarios involving the app crashing on device in impossible to diagnose ways.
there is really no helpful info anywhere, and the only way to figure out if things are broken in case 2) above is to run otool -h on each .a and figure out if thumb is involved. when we find out it is, historically our response is "sorry, don't use thumb." This sucks when the .a isn't built by the user.
we need to figure out the root causes for 1 and 2 and either see if we can fix them, or add whatever checks are necessary in MD to disable the use of thumb in MT apps entirely.
*** Bug 1568 has been marked as a duplicate of this bug. ***
Is there no easy way to detect this? Coz this plagued me for months.
The only known way to do it right is is to run:
$ otool -tv thelibrary.a | grep ^.......2
and if anything is printed, it's thumb.
*** Bug 1686 has been marked as a duplicate of this bug. ***
Created attachment 777 [details]
Zoltan, I hope this description can help you fix / workaround this bug:
The issue is that the linker inserts code in between methods for an assembly, and since we store offsets to methods in the aot compiled assembly those offsets end up all wrong once the linker has done its job.
To illustrate attached is a partial disassembled mscorlib.
For System_Runtime_Remoting__Lease_CheckNextSponsor the code offset is correct. The next method we aot compiled is System_Runtime_Remoting_Lifetime_Lease_ProcessSponsorResponse_object_bool (line 969), which in mscorlib.s (and mscorlib.o) is just after System_Runtime_Remoting__Lease_CheckNextSponsor, and the offset we store in the aot image makes the runtime return line 162 as the start of System_Runtime_Remoting_Lifetime_Lease_ProcessSponsorResponse_object_bool, but that is where the linker has decided to do some magic and has inserted some other code. This causes us to randomly execute random code.
I just tried storing the actual location of the method in code_offsets instead of just a calculated offset, and the code_offsets array is then correctly linked to point to the actual methods and not random memory (but this has other consequences so it doesn't work as-is):
Zoltan: last comment has a description of the problem
Sorry, the attachment is a from the final linked binary, not mscorlib.
Storing the actual methods can work, however, it might require runtime relocations, so it might be slower. Can you check that at runtime, the array
contains the actual methods ? i.e. at aot-runtime.c, around line 1209, the
'amodule->code_offsets' array should point to the methods themselves.
Created attachment 783 [details]
I can confirm that with the following patch: https://gist.github.com/924cf21eb19aafce003c the elements of the code_offsets array point to the methods themselves (see attached gdb log).
Please try the attached patch with complex apps.
Created attachment 791 [details]
This gets me a little bit further, now it crashes with this:
* Assertion at ../../../../../mono/mono/mini/aot-runtime.c:3257, condition `plt_entry' not met
at MonoTouch.ObjCRuntime.Runtime..cctor () <0x00023>
at (wrapper runtime-invoke) object.runtime_invoke_dynamic (intptr,intptr,intptr,intptr) <0xffffffff>
0 test1 0x00e02e58 mono_handle_native_sigsegv + 456
1 test1 0x00e311b8 sigabrt_signal_handler + 168
2 libsystem_c.dylib 0x309ec539 _sigtramp + 48
3 libsystem_c.dylib 0x309e1f5b pthread_kill + 54
4 libsystem_c.dylib 0x309dafeb abort + 94
5 test1 0x00f80ed4 monoeg_g_logv + 296
6 test1 0x00f80fd4 monoeg_assertion_message + 80
7 test1 0x00df7604 mono_aot_get_plt_info_offset + 96
8 test1 0x00e05758 mono_aot_plt_trampoline + 48
9 test1 0x0087eb98 generic_trampoline_aot_plt + 136
10 test1 0x007fb830 wrapper_runtime_invoke_object_runtime_invoke_dynamic_intptr_intptr_intptr_intptr + 200
11 test1 0x00ddb5ac mono_jit_runtime_invoke + 2892
12 test1 0x00eef824 mono_runtime_invoke + 200
13 test1 0x00ee8118 mono_runtime_class_init_full + 2060
14 test1 0x00ee78fc mono_runtime_class_init + 28
15 test1 0x00dd8f7c mono_jit_compile_method_inner + 224
16 test1 0x00dda1a8 mono_jit_compile_method_with_opt + 660
17 test1 0x00ddadd0 mono_jit_runtime_invoke + 880
18 test1 0x00eef824 mono_runtime_invoke + 200
19 test1 0x00fa6e44 monotouch_init + 640
20 test1 0x00dd1338 main + 3740
21 test1 0x0000346c start + 52
I've seen this crash once in a while before too, without the aot patch, so this is likely another issue. I will continue debugging this next week.
Note that if I don't link my test project with the thumb library everything works as it used to even with the patch, so the patch is definitively not the problem.
It's trying to get the plt entry for this branch:
0x98bc00 <MonoTouch_ObjCRuntime_Runtime__cctor+28>: ldr r0, [pc, r0]
0x98bc04 <MonoTouch_ObjCRuntime_Runtime__cctor+32>: bl 0xa07148 <p_17.island.3>
0x98bc08 <MonoTouch_ObjCRuntime_Runtime__cctor+36>: str r0, [sp]
but 0xa07148 isn't between amodule->plt (0xd0c300) and amodule->plt_end (0xd21560)
and a disassemble for 0xa07148 shows:
(gdb) disass 0xa07148
Dump of assembler code for function p_17.island.3:
0x00a07148 <p_17.island.3+0>: b 0xd0c400 <p_17>
End of assembler dump.
(gdb) disass 0xd0c400
Dump of assembler code for function p_17:
0x00d0c400 <p_17+0>: ldr r12, [pc, #0] ; 0xd0c408 <p_17+8>
0x00d0c404 <p_17+4>: ldr pc, [pc, r12]
0x00d0c408 <p_17+8>: eorseq lr, r6, r8, lsl r0
0x00d0c40c <p_17+12>: andeq r3, r6, r1, asr #21
End of assembler dump.
Note that the branch in p_17.island.3 is into the plt (an extra indirection?)
So with this patch: https://gist.github.com/0b4f9dfbedd8883f1702 I get even further, now it just exits at some point without any printf or warning. This will have to wait until next week.
That is probably an extra stub inserted by the linker for some reason.
It looks like it's stubs the linker inserts when the branch target is out of reach for the branch instruction - another branch instruction is inserted in between the original instruction and its target to bridge them.
ARM calls it veneers, Apple seems to be calling it islands.
This could also be reason for #539 - random crashes on startup with huge apps where rebuilds may or may not fix it.
*** Bug 539 has been marked as a duplicate of this bug. ***
So the bridges/islands the linker inserts are causing more problems, in pseudo c code:
void foo ()
void bar ()
if the linker inserts an island between foo and bar, the branch to bar in foo isn't updated so when it's executed it jumps to a random code location.
The problem seems to be that there are no relocation entries for the call to bar in foo in the .o file (I presume there is a way to add the relocation entry in the .s assembly code?).
I tried creating a test case using Xcode/C/gcc, and all the calls are fixed up correctly when the linker inserts islands (and the call locations have a corresponding relocation entry).
Created attachment 871 [details]
fix 1/3: Follow jumps when getting the plt entry for a call.
Created attachment 872 [details]
fix 2/3: Use direct method addresses instead of calculating offsets since the linker may move/reorder methods
This is the patch from comment #15 applied to mobile-master
Created attachment 873 [details]
fix 3/3: Use global symbols so the arm linker relocates them properly when methods are reordered.
All of them should probably be arch-specific somehow.
Zoltan, please review :)
If I build with LLVM compiler, but *not* Thumb-2 mode, will it work, or will I still get these errors?
Jamie: You might/might not, LLVM is not able to compile all mono code, so we still fall back to our compiler, but it might "fix" the issue.
This bug has been fixed.
The fix will be included in the first release after 5.1 (which is already about to be released).
*** Bug 1992 has been marked as a duplicate of this bug. ***
*** Bug 2652 has been marked as a duplicate of this bug. ***
*** Bug 3205 has been marked as a duplicate of this bug. ***