Bug 366 - Delegate invocation inside generic method is about 50x slower than .NET
Summary: Delegate invocation inside generic method is about 50x slower than .NET
Alias: None
Product: Runtime
Classification: Mono
Component: JIT ()
Version: unspecified
Hardware: PC Linux
: --- normal
Target Milestone: ---
Assignee: Marek Safar
Depends on:
Reported: 2011-08-23 05:49 UTC by Marek Safar
Modified: 2016-04-19 08:19 UTC (History)
6 users (show)

Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.

Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:

Description Marek Safar 2011-08-23 05:49:13 UTC
using System;
using System.Diagnostics;

class Program
	static bool b;
	private static void Test<T> ()
		Action foo = () => { if (b) throw null; };
		foo ();
	static void Main ()
		var sw = new Stopwatch ();
		sw.Start ();
		for (int i = 0; i < 10000000; ++i)
			Test<int> ();
		sw.Stop ();
		Console.WriteLine (sw.Elapsed.TotalMilliseconds);

.NET 64bit: 173.5
Mono 64bit: 9363.5
Comment 1 Mark Probst 2011-08-23 08:25:19 UTC
That's because in the non-generic case mcs uses a cache to avoid generating a new delegate object every time, but not in the generic case.


    // method line 2
    .method private static hidebysig 
           default void Test<T> ()  cil managed 
        // Method begins at RVA 0x2058
	// Code size 20 (0x14)
	.maxstack 2
	.locals init (
		class [System.Core]System.Action	V_0)
	IL_0000:  ldnull 
	IL_0001:  ldftn void class Program::'<Test`1>m__1'<!!0> ()
	IL_0007:  newobj instance void class [System.Core]System.Action::'.ctor'(object, native int)
	IL_000c:  stloc.0 
	IL_000d:  ldloc.0 
	IL_000e:  callvirt instance void class [System.Core]System.Action::Invoke()
	IL_0013:  ret 
    } // end of method Program::Test


    // method line 2
    .method private static hidebysig 
           default void Test ()  cil managed 
        // Method begins at RVA 0x2058
	// Code size 37 (0x25)
	.maxstack 2
	.locals init (
		class [System.Core]System.Action	V_0)
	IL_0000:  ldsfld class [System.Core]System.Action Program::'<>f__am$cache3'
	IL_0005:  brtrue.s IL_0018

	IL_0007:  ldnull 
	IL_0008:  ldftn void class Program::'<Test>m__1'()
	IL_000e:  newobj instance void class [System.Core]System.Action::'.ctor'(object, native int)
	IL_0013:  stsfld class [System.Core]System.Action Program::'<>f__am$cache3'
	IL_0018:  ldsfld class [System.Core]System.Action Program::'<>f__am$cache3'
	IL_001d:  stloc.0 
	IL_001e:  ldloc.0 
	IL_001f:  callvirt instance void class [System.Core]System.Action::Invoke()
	IL_0024:  ret 
    } // end of method Program::Test

As an aside, if you run the generic case with SGen, it's quite a bit faster than on Boehm, because object allocation is faster.  Still nowhere near the cached case, of course.
Comment 2 Marek Safar 2011-08-23 08:28:04 UTC
I ran exactly same .exe on mono and .net, don't know how the caching could work differently on .net
Comment 3 Marek Safar 2011-08-23 08:31:27 UTC
Running the test with sgen takes it down to about 8 times slower than .net
Comment 4 Mark Probst 2011-08-23 08:35:32 UTC
Ah, so in .NET the caching is implemented in the JIT.  Do we want that?

Also, I'm curious: In which cases does csc emit a cache?  Generic vs non-generic?
Comment 5 Marek Safar 2011-08-23 08:44:56 UTC
I don't think csc uses cache in generic context, it's quite tricky to implement and we don't do it either.

Maybe their JIT generic sharing specialises the body and that way they get the caching for free
Comment 6 Mark Probst 2011-08-23 08:52:27 UTC
We also specialize in this case, but we don't have caching for this in the JIT, so it doesn't matter.

Why is the caching any harder than in the non-generic case?
Comment 7 Marek Safar 2011-08-23 09:03:13 UTC
For non-generic case I can use single field, for generic context I'd need to somehow handle the mapping between the context and the delegate instance.

When I was considering implementing it in C# I was thinking about a new cache type for every MVAR context and then instantiating it with specific type parameters to access the <>f__am$cache field. Don't know whether this is the best approach.
Comment 8 Rodrigo Kumpera 2011-08-23 09:43:19 UTC
The problem here is that we have a much much worse first time invocation penalty than MS as we have to hit the delegate trampoline.

If you change your code to manually cache the delegate, numbers will be quite different.

We can fix that by moving some of our caching logic to the JIT'd code. I argued we did that back when Zoltan implemented caching, but for some reason it wasn't done.
Comment 9 Zoltan Varga 2011-08-23 10:12:37 UTC
The test case creates a delegate object every time it is called, so you are actually benchmarking GC performance.
Moving the:
Action foo = () => { if (b) throw null; };
line to a static variable makes the test case run in 50ms instead of 5000.
Comment 10 Jonathan Shore 2011-09-01 08:33:22 UTC
One could at least optimize the cases where the lambda function does not make use of local context (i.e. does not make use of any local state variables or class scoped variables).   

In this situation, the manual approach would be to add a field in the class to hold the delegate, creating the delegate.   The compiler could do the same, adding a field to the class to hold the delegate.   I would guess that most would favor the performance gain over the footprint expense of having additional slots in the object.

This would avoid requiring a cache implementation for this class of closure.
Comment 11 Rodrigo Kumpera 2011-09-01 12:09:52 UTC
the compiler already does that
Comment 12 Rodrigo Kumpera 2014-07-29 10:42:10 UTC
Hey Ludo,

Did your work help here?
Comment 13 Ludovic Henry 2014-07-29 11:06:48 UTC
On master, I get 226.4 on my machine. Unfortunately, I don't have any windows machine so I cannot compare it to .NET
Comment 14 Rodrigo Kumpera 2016-04-19 08:19:56 UTC
We fixed the the first call perf issue.