Recently, Julie Lerman contacted me about some strange SQL that Entity Framework Core (EF Core) was generating. Now, EF Core is fairly new and works with Microsoft’s newest framework .NET Core. I haven’t had a chance to get my hands on EF Core yet so I was excited to see what it was doing. And of course when I saw the SQL my jaw hit the floor. Here’s an example of the SQL she sent me to accomplish an insert:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
exec sp_executesql N'SET NOCOUNT ON; DECLARE @toInsert0 TABLE ([IsPointy] bit, [Name] varchar(250), [_Position] [int]); INSERT INTO @toInsert0 VALUES (@p0, @p1, 0), (@p2, @p3, 1), (@p4, @p5, 2); DECLARE @inserted0 TABLE ([WeaponId] int, [_Position] [int]); MERGE [Weapons] USING @toInsert0 AS i ON 1=0 WHEN NOT MATCHED THEN INSERT ([IsPointy], [Name]) VALUES (i.[IsPointy], i.[Name]) OUTPUT INSERTED.[WeaponId], i._Position INTO @inserted0; SELECT [t].[WeaponId] FROM [Weapons] t INNER JOIN @inserted0 i ON ([t].[WeaponId] = [i].[WeaponId]) ORDER BY [i].[_Position]; ',N'@p0 bit,@p1 varchar(250),@p2 bit,@p3 varchar(250),@p4 bit,@p5 varchar(250)',@p0=1,@p1='Katana',@p2=1,@p3='Kama',@p4=0,@p5='Tessen' |
If you’re an experienced SQL tuner, you’ll notice some issues with this statement. First off the query has not one but two table variables. It’s generally better to use temp tables because table variables don’t have good statistics by default. Secondly, the statement uses a MERGE statement. The MERGE statement has had more than it’s fair share of issues. See Aaron’s Bertrand’s post “Use Caution with SQL Server’s MERGE Statement” for more details on those issues.
But that got me wondering, why would the EF team use SQL features that perform so poorly? So I decided to take a closer look at the SQL statement. Just so you know the code that was used to generate the SQL saves three entities (Katana, Kama, and Tessen) to the database in batch. (Julie used a Samurai theme so I just continued with it.)
A Deeper Look
1 2 |
SET NOCOUNT ON; DECLARE @toInsert0 TABLE ([IsPointy] bit, [Name] varchar(250), [_Position] [int]); |
1 2 3 4 |
INSERT INTO @toInsert0 VALUES (@p0, @p1, 0), (@p2, @p3, 1), (@p4, @p5, 2); |
1 |
DECLARE @inserted0 TABLE ([WeaponId] int, [_Position] [int]); |
1 2 3 4 5 6 |
MERGE [Weapons] USING @toInsert0 AS i ON 1=0 WHEN NOT MATCHED THEN INSERT ([IsPointy], [Name]) VALUES (i.[IsPointy], i.[Name]) OUTPUT INSERTED.[WeaponId], i._Position INTO @inserted0; |
1 2 3 |
SELECT [t].[WeaponId] FROM [Weapons] t INNER JOIN @inserted0 i ON ([t].[WeaponId] = [i].[WeaponId]) ORDER BY [i].[_Position]; |
With a bit of testing, I realized that this statement wasn’t only being used for batch requests, but also for entity inserts with more than one entity being saved. Also, it appears that the limit per statement is 2,000 attributes. For example, if 6,500 attributes are being initially saved to the database EF Core will create four of these statements: three statements of 2,000 attributes and one statement of 500 attributes . The number of entities that are saved in each statement depends on the number of attributes being saved. In this case, 1,000 entities were saved per statement for a total of 3,250 entities saved to the database.
Rationalizing It All
Now that we’ve taken a look at the generated SQL, shouldn’t we be enraged? Wouldn’t we rip our devs a new one if we saw table variables, merge statements, and unnecessary sorts in their SQL? But let’s take a look at the core problem that the EF team is trying to solve: performance. Yea, I said it. In previous versions of Entity Framework, data was inserted one statement at a time. So if you needed to insert 5,000 rows into a table, you got 5,000 insert statements. This was necessary because the application needed the SQL-Server-generated identity value. But with this new approach, you could insert 5,000 rows of data with one statement.
But wouldn’t it be faster with temp tables? Sure! But you can’t guarantee a temp table name would be unique in your session (because you may have done other things along the way.) You could name a temp table with a guid, but there’s no guarantee that a guid is unique either. So the safest route would be a table variable and not a temp table.
But why use the MERGE statement if it’s so buggy? I think this goes back to the performance question. We could use n number of inserts but that’s not solving the performance problem that we originally had. In short, I think I’m OK with the MERGE statement. It’s only being used for inserts, and it’s faster than 5,000 insert statements.
But what about that ORDER BY in the SELECT statement? You kinda got me there. Could they be doing the sorting in the code but they went the SQL route. But it’s not so bad, since they’re creating a new SQL statement for every 2,000 entities SQL Server will only be sorting 2,000 rows.
To decide if it’s a bad thing, let’s compare the older version of Entity Framework (EF 6) with the new version (EF Core). In my totally unscientific tests, saving 10,000 entities to the database using the EF 6 took 4.06 seconds while saving the same 10,000 entities with EF Core took 1.46 seconds. So yes, the SQL isn’t perfect, but it’s 248% faster.
Brent says: hoo, boy. This is…not ideal.
17 Comments. Leave new
The use of MERGE here is almost certainly because INSERT doesn’t allow you to OUTPUT any columns that are not from the INSERTED virtual table.
Using MERGE allows you to do exactly this – OUTPUT the generated IDENTITY values along with data from the source (@toInsert0) thus providing a way to link the new rows to the source data.
There is not likely to be a performance issue from the use of TABLE variables here as the row count is so low.
It is a fairly common pattern in my experience.
Mister Magoo – we’re less concerned with the table variable, and more concerned with this hot mess:
https://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/
If Microsoft wants to start relying on the MERGE statement, then I surely hope they fix the bugs the community has pointed out. Some of those bugs will give EF users some serious heartburn.
Hi Brent, Thanks for the link, but I can’t see any relevant bugs in that list for an INSERT only pattern of this type, but maybe I’m misunderstanding them. Most of the bugs seem to be related to multiple operations in one MERGE or UPDATE/DELETE actions? If you are aware of a relevant bug for a simple INSERT with OUTPUT I would be interested to read it.
I agree that the bugs make using MERGE for anything else a big issue, but it seems like it not only works OK for a simple INSERT pattern, but is possibly the best option as it allows you to OUTPUT columns from the source data alongside columns from INSERTED – exactly how it works here.
Mister Magoo – when you read this, imagine me having a smile on my face as I say these words. (When I say something like this, sometimes I need to be really clear that I’m having some fun.)
MERGE was introduced to accomplish the common upsert problem in data warehouses where you need to do inserts, updates, and deletes in a single operation. As we started using it, we ran across so many bugs that most of us hit the eject button. It didn’t work for the thing it was designed to do.
Now, you’re coming along and saying, “Yes, but it probably works fine to do something it wasn’t designed to do.”
Okay. I’ll make my bets, you make yours. Good luck with that.
Also with a large grin…
I understand.
Great post! You are mostly right about the motivation, and good point that what could possibly to the ordering on the client (I will create an issue to investigate feasibility and measure the impact).
We actually decided to adopt this MERGE-based solution after hitting a few issues with our previous (simpler) batching implementations:
– https://github.com/aspnet/EntityFramework/issues/4080 Order of IDENTITY values wasn’t deterministic
– https://github.com/aspnet/EntityFramework/issues/4478 Batch inserts don’t work with uncomparable columns
Thanks for the insight Diego. I’ve been following EF since the alpha of version 1 (now version 4). I actually put an application into production with the EF 1 beta. Since this is a blog for data professionals I wanted to give a bit of understanding of the challenges that EF is trying to solve. Even though MERGE has it’s problems the performance bump is impressive. I’m looking forward to playing with EF Core even more and seeing what other nuggets it stores inside.
Were Table Value Constructors[1] considered instead of the temporary
toInsert
table variable? I have had good results with this technique in the past and would be interested in your benchmarking results.Something like (apologies for the formatting and any syntax errors as I am presently mobile):
MERGE [Weapons] USING (VALUES
(@p0, @p1, 0),
(@p2, @p3, 1),
(@p4, @p5, 2)
) AS i ([IsPointy], [Name], [_Position]) ON 1=0
WHEN NOT MATCHED THEN
INSERT ([IsPointy], [Name])
VALUES (i.[IsPointy], i.[Name])
OUTPUT INSERTED.[WeaponId], i._Position
INTO @inserted0;
[1]: https://docs.microsoft.com/en-us/sql/t-sql/queries/table-value-constructor-transact-sql
I am not sure we tried this Zac. By the way, we are adding feedback from this thread in https://github.com/aspnet/EntityFramework/issues/8415. Keep it coming if you have more.
We however did try something similar with multi-row INSERT and it turned out that order wasn’t deterministic. That is what https://github.com/aspnet/EntityFramework/issues/4080 is about.
Do you have the c# code that generated this? I’m curious to see what caused this.
Yet another reason I steer clear of all flavours of EF and use a micro ORM.
I use a single bare metal file one (which should tell you which it is) and that suuports most databases you can name without the fancy syntax thst isn’t worth the limitations and when that fails me I’ll rewrite the guts to a nano ORM of my own design and port and reuse that code add infinitum from application to application.
My problems are solved and yours can be too! 😉
Isn’t there a limit of ~2000 parameters in a query? Are the 2000 rows per batch only if there is one value per row?
Great catch. I mistook the number of variables with the number of rows. Whoops. It does limit the batch to the number of variables. In this case, 1,000 rows were loaded which makes perfect sense since we’re only inserting the Name and Is_Pointy values. I’ll update the post to reflect this.
Thanks!
After reading this, I found myself wondering why we couldn’t just avoid the temp tables altogether and do a single insert instead of a merge, outputting the multiple autogenerated IDs right from that insert. The answer seems to be that although SQL Server typically inserts the data in the same order as the provided insert statement and typically outputs the ids in that order as well, those behaviors are explicitly not guaranteed. The merge statement is needed to ensure the generated Ids can be matched back up with the correct entities. More detail here:
https://dba.stackexchange.com/questions/155735/is-it-safe-to-rely-on-the-order-of-an-inserts-output-clause
Generally speaking, I would recommend against the MERGE statement. https://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/
Yes, I know you do because you already put that link in your article above and repeated it in the comments. But recall that the topic of the article is *entity framework’s* use of merge for bulk inserts.
The intention of my comment was to explain why EF uses [multiple queries, including the controversial merge, and 2 table variables] to accomplish what seems like it could be done via a single insert statement that just outputs the generated ids. The answer is so that it can match the generated ids back up with the inserted entities.
I just wanted to spare other readers from needing to do dig around elsewhere on google to find that answer, especially since Julie Lermon is still linking to this article from Pluralsight.