This is Xamarin's bug tracking system. For product support, please use the support links listed in your Xamarin Account.
Bug 707 - Thumb breaks things in many ways
: Thumb breaks things in many ways
Status: RESOLVED FIXED
Product: iOS
Classification: Xamarin
Component: General
: 4.x
: Macintosh Mac OS
: --- major
: Untriaged
Assigned To: Bugzilla
:
:
:
:
  Show dependency treegraph
 
Reported: 2011-09-09 19:49 EDT by Chris Toshok
Modified: 2012-02-20 17:55 EST (History)
11 users (show)

See Also:
Tags:
Test Case URL:
External Submit: ---


Attachments
partial assembly (32.52 KB, text/plain)
2011-10-27 08:46 EDT, Rolf Bjarne Kvinge
Details
gdb log (1.40 KB, text/plain)
2011-10-27 16:36 EDT, Rolf Bjarne Kvinge
Details
patch (4.69 KB, patch)
2011-10-28 12:24 EDT, Zoltan Varga
Details | Diff
fix 1/3: Follow jumps when getting the plt entry for a call. (1.61 KB, patch)
2011-11-11 08:06 EST, Rolf Bjarne Kvinge
Details | Diff
fix 2/3: Use direct method addresses instead of calculating offsets since the linker may move/reorder methods (4.87 KB, patch)
2011-11-11 08:07 EST, Rolf Bjarne Kvinge
Details | Diff
fix 3/3: Use global symbols so the arm linker relocates them properly when methods are reordered. (851 bytes, patch)
2011-11-11 08:08 EST, Rolf Bjarne Kvinge
Details | Diff

Description Chris Toshok 2011-09-09 19:49:00 EDT
So far we have a number of possible problems scenarios with thumb:

1) the thumb code we generate might be broken

2) if the user links to .a's that are built with thumb enabled, and we might
fail to deal with this (or apple's linker might fail)

both of these scenarios involving the app crashing on device in impossible to
diagnose ways.

there is really no helpful info anywhere, and the only way to figure out if
things are broken in case 2) above is to run otool -h on each .a and figure out
if thumb is involved.  when we find out it is, historically our response is
"sorry, don't use thumb."  This sucks when the .a isn't built by the user.

we need to figure out the root causes for 1 and 2 and either see if we can fix
them, or add whatever checks are necessary in MD to disable the use of thumb in
MT apps entirely.
Comment 2 Rolf Bjarne Kvinge 2011-10-20 08:22:22 EDT
*** Bug 1568 has been marked as a duplicate of this bug. ***
Comment 3 Jamie Briant 2011-10-20 21:16:32 EDT
Is there no easy way to detect this? Coz this plagued me for months.
Comment 4 Jamie Briant 2011-10-20 21:17:09 EDT
s/easy/automated/
Comment 5 Rolf Bjarne Kvinge 2011-10-21 18:05:45 EDT
The only known way to do it right is is to run:

  $ otool -tv thelibrary.a | grep ^.......2

and if anything is printed, it's thumb.
Comment 6 Rolf Bjarne Kvinge 2011-10-26 16:33:14 EDT
*** Bug 1686 has been marked as a duplicate of this bug. ***
Comment 9 Rolf Bjarne Kvinge 2011-10-27 08:46:53 EDT
Created attachment 777 [details]
partial assembly

Zoltan, I hope this description can help you fix / workaround this bug:

The issue is that the linker inserts code in between methods for an assembly,
and since we store offsets to methods in the aot compiled assembly those
offsets end up all wrong once the linker has done its job.

To illustrate attached is a partial disassembled mscorlib.

For System_Runtime_Remoting__Lease_CheckNextSponsor the code offset is correct.
The next method we aot compiled is 
System_Runtime_Remoting_Lifetime_Lease_ProcessSponsorResponse_object_bool (line
969), which in mscorlib.s (and mscorlib.o) is just after
System_Runtime_Remoting__Lease_CheckNextSponsor, and the offset we store in the
aot image makes the runtime return line 162 as the start of
System_Runtime_Remoting_Lifetime_Lease_ProcessSponsorResponse_object_bool, but
that is where the linker has decided to do some magic and has inserted some
other code. This causes us to randomly execute random code.

I just tried storing the actual location of the method in code_offsets instead
of just a calculated offset, and the code_offsets array is then correctly
linked to point to the actual methods and not random memory (but this has other
consequences so it doesn't work as-is):

code_offsets:
.LDIFF_SYM0=L_m_0
    .long .LDIFF_SYM0
...
Comment 10 Rolf Bjarne Kvinge 2011-10-27 08:48:18 EDT
Zoltan: last comment has a description of the problem
Comment 11 Rolf Bjarne Kvinge 2011-10-27 08:49:41 EDT
Sorry, the attachment is a from the final linked binary, not mscorlib.
Comment 12 Zoltan Varga 2011-10-27 12:46:57 EDT
Storing the actual methods can work, however, it might require runtime
relocations, so it might be slower. Can you check that at runtime, the array
contains the actual methods ? i.e. at aot-runtime.c, around line 1209, the
'amodule->code_offsets' array should point to the methods themselves.
Comment 13 Rolf Bjarne Kvinge 2011-10-27 16:36:52 EDT
Created attachment 783 [details]
gdb log

I can confirm that with the following patch:
https://gist.github.com/924cf21eb19aafce003c the elements of the code_offsets
array point to the methods themselves (see attached gdb log).
Comment 14 Zoltan Varga 2011-10-28 12:23:46 EDT
Please try the attached patch with complex apps.
Comment 15 Zoltan Varga 2011-10-28 12:24:20 EDT
Created attachment 791 [details]
patch
Comment 16 Rolf Bjarne Kvinge 2011-10-28 19:25:40 EDT
This gets me a little bit further, now it crashes with this:

* Assertion at ../../../../../mono/mono/mini/aot-runtime.c:3257, condition
`plt_entry' not met
Stacktrace:
  at MonoTouch.ObjCRuntime.Runtime..cctor () <0x00023>
  at (wrapper runtime-invoke) object.runtime_invoke_dynamic
(intptr,intptr,intptr,intptr) <0xffffffff>
Native stacktrace:
 0   test1                        0x00e02e58 mono_handle_native_sigsegv + 456
 1   test1                        0x00e311b8 sigabrt_signal_handler + 168
 2   libsystem_c.dylib  0x309ec539 _sigtramp + 48
 3   libsystem_c.dylib  0x309e1f5b pthread_kill + 54
 4   libsystem_c.dylib  0x309dafeb abort + 94
 5   test1                        0x00f80ed4 monoeg_g_logv + 296
 6   test1                        0x00f80fd4 monoeg_assertion_message + 80
 7   test1                        0x00df7604 mono_aot_get_plt_info_offset + 96
 8   test1                        0x00e05758 mono_aot_plt_trampoline + 48
 9   test1                        0x0087eb98 generic_trampoline_aot_plt + 136
 10  test1                       0x007fb830
wrapper_runtime_invoke_object_runtime_invoke_dynamic_intptr_intptr_intptr_intptr
+ 200
 11  test1                       0x00ddb5ac mono_jit_runtime_invoke + 2892
 12  test1                       0x00eef824 mono_runtime_invoke + 200
 13  test1                       0x00ee8118 mono_runtime_class_init_full + 2060
 14  test1                       0x00ee78fc mono_runtime_class_init + 28
 15  test1                       0x00dd8f7c mono_jit_compile_method_inner + 224
 16  test1                       0x00dda1a8 mono_jit_compile_method_with_opt +
660
 17  test1                       0x00ddadd0 mono_jit_runtime_invoke + 880
 18  test1                       0x00eef824 mono_runtime_invoke + 200
 19  test1                       0x00fa6e44 monotouch_init + 640
 20  test1                       0x00dd1338 main + 3740
 21  test1                       0x0000346c start + 52

I've seen this crash once in a while before too, without the aot patch, so this
is likely another issue. I will continue debugging this next week.

Note that if I don't link my test project with the thumb library everything
works as it used to even with the patch, so the patch is definitively not the
problem.
Comment 17 Rolf Bjarne Kvinge 2011-10-28 20:06:44 EDT
It's trying to get the plt entry for this branch:

0x98bc00 <MonoTouch_ObjCRuntime_Runtime__cctor+28>:    ldr    r0, [pc, r0]
0x98bc04 <MonoTouch_ObjCRuntime_Runtime__cctor+32>:    bl    0xa07148
<p_17.island.3>
0x98bc08 <MonoTouch_ObjCRuntime_Runtime__cctor+36>:    str    r0, [sp]

but 0xa07148 isn't between amodule->plt (0xd0c300) and amodule->plt_end
(0xd21560)

and a disassemble for 0xa07148 shows:

(gdb) disass 0xa07148
Dump of assembler code for function p_17.island.3:
0x00a07148 <p_17.island.3+0>:    b    0xd0c400 <p_17>
End of assembler dump.
(gdb) disass 0xd0c400
Dump of assembler code for function p_17:
0x00d0c400 <p_17+0>:    ldr    r12, [pc, #0]    ; 0xd0c408 <p_17+8>
0x00d0c404 <p_17+4>:    ldr    pc, [pc, r12]
0x00d0c408 <p_17+8>:    eorseq    lr, r6, r8, lsl r0
0x00d0c40c <p_17+12>:    andeq    r3, r6, r1, asr #21
End of assembler dump.
(gdb) 

Note that the branch in p_17.island.3 is into the plt (an extra indirection?)
Comment 18 Rolf Bjarne Kvinge 2011-10-28 20:23:49 EDT
So with this patch: https://gist.github.com/0b4f9dfbedd8883f1702 I get even
further, now it just exits at some point without any printf or warning. This
will have to wait until next week.
Comment 19 Zoltan Varga 2011-10-29 17:04:53 EDT
That is probably an extra stub inserted by the linker for some reason.
Comment 20 Rolf Bjarne Kvinge 2011-10-31 08:34:29 EDT
It looks like it's stubs the linker inserts when the branch target is out of
reach for the branch instruction - another branch instruction is inserted in
between the original instruction and its target to bridge them.

ARM calls it veneers, Apple seems to be calling it islands.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0285b/Cchhgfaj.html
Comment 21 Rolf Bjarne Kvinge 2011-10-31 08:41:49 EDT
This could also be reason for #539 - random crashes on startup with huge apps
where rebuilds may or may not fix it.
Comment 22 Rolf Bjarne Kvinge 2011-10-31 08:43:09 EDT
*** Bug 539 has been marked as a duplicate of this bug. ***
Comment 23 Rolf Bjarne Kvinge 2011-11-10 20:47:28 EST
So the bridges/islands the linker inserts are causing more problems, in pseudo
c code:

void foo ()
{
   bar ();
}
void bar ()
{
}

if the linker inserts an island between foo and bar, the branch to bar in foo
isn't updated so when it's executed it jumps to a random code location.

The problem seems to be that there are no relocation entries for the call to
bar in foo in the .o file (I presume there is a way to add the relocation entry
in the .s assembly code?).

I tried creating a test case using Xcode/C/gcc, and all the calls are fixed up
correctly when the linker inserts islands (and the call locations have a
corresponding relocation entry).
Comment 24 Rolf Bjarne Kvinge 2011-11-11 08:06:41 EST
Created attachment 871 [details]
fix 1/3: Follow jumps when getting the plt entry for a call.
Comment 25 Rolf Bjarne Kvinge 2011-11-11 08:07:50 EST
Created attachment 872 [details]
fix 2/3: Use direct method addresses instead of calculating offsets since the
linker may move/reorder methods

This is the patch from comment #15 applied to mobile-master
Comment 26 Rolf Bjarne Kvinge 2011-11-11 08:08:17 EST
Created attachment 873 [details]
fix 3/3: Use global symbols so the arm linker relocates them properly when
methods are reordered.
Comment 27 Rolf Bjarne Kvinge 2011-11-11 08:09:56 EST
All of them should probably be arch-specific somehow.

Zoltan, please review :)
Comment 28 Jamie Briant 2011-11-19 02:06:39 EST
If I build with LLVM compiler, but *not* Thumb-2 mode, will it work, or will I
still get these errors?
Comment 29 Zoltan Varga 2011-11-19 14:23:33 EST
Jamie: You might/might not, LLVM is not able to compile all mono code, so we
still fall back to our compiler, but it might "fix" the issue.
Comment 30 Rolf Bjarne Kvinge 2011-11-28 11:56:35 EST
This bug has been fixed.

The fix will be included in the first release after 5.1 (which is already about
to be released).
Comment 31 Rolf Bjarne Kvinge 2011-11-29 07:45:05 EST
*** Bug 1992 has been marked as a duplicate of this bug. ***
Comment 32 Rolf Bjarne Kvinge 2012-01-02 18:56:45 EST
*** Bug 2652 has been marked as a duplicate of this bug. ***
Comment 33 Rolf Bjarne Kvinge 2012-02-20 17:55:09 EST
*** Bug 3205 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.