Created attachment 6534 [details]
A synthetic benchmark
The cold-start of an application takes a significant amount of time on lower-end devices, like the iPhone 4 and iPad 2.
The profiling sessions show that the time is spent in generic_trampoline_delegate, a method that will build the trampolines for generic method calls.
I’ve attached a synthetic benchmark that reproduces the issue, which contains thousands of generic methods, in a similar way as our application does work with generic methods.
On an iPad 2, the first call of the chain takes about 1150ms, whereas on the second call, it takes about 4ms (four). The caching mechanism of the mini runtime is working properly, as we can see that the time drops significantly.
However, during the first calls, the time taken to resolve the methods is significant, and seems to linearly increase, in relation to the number of generic types present in the application domain.
As a tentative performance improvement, parallelizing does not seem to have any impact, as when calling the same code on two threads, the first call takes 2330ms, where with the second call, both take 4ms (four).
Note that the ratio between cold and warm time is *very* different with latest Apple devices, like the 5S, where the cold duration drops by a factor of 4.
This has a great impact on the perceived performance of the app for the consumer, even though when the app is warmed up, the performance is great.
Thanks for the testcase.
Checked in a fix to mono master 078dc0321d53f9e161957656550fd10cc41db618/mono-3.4.0 0081c27e0d6473a83cc856abf67c4a42dc21b53d.
It improves the first run of the benchmark from 1.1s to 0.4s for me.
Thank Zoltan, that's quite an improvement :)
Would you know if that also improves the performance in multi-thread scenarios ?
It probably does.
I'm asking because of this:
Where there is contention when resolving the generic methods. The work being done inside the lock is pretty significant...
This patch introduced a regression, Mono no longer bootstraps, see:
The changes were reverted from master/3.4.0 for now.
@Jerome: Will look at reducing the work done inside the lock.
Committed a fixed fix to mono master ea490c5486af6e1ce6ce8b1a117f1d99cf988df0. It will be in a future mt version after some testing.
The corresponding change on the 3.4.0 branch is 28145e01f42317e685ad1020a47ba746f164c28b.
Using the same PoC, the run time is down from 1150ms to 268ms, same hardware.
Great improvement Zoltan, thanks !
Note that the behavior for multi-thread is vastly better, bit still slower than the single-cpu test. (2330ms down to 380ms)
This fix is part of the 7.2.6 release (in th alpha channel right now).
As per comment 9, this issue is working fine now i.e. run time is down from 1150ms to 268ms on same hardware.
Hence closing this issue.