Cranktrain

Hi all, no bug report or feature request here, I just wanted to share some of my findings that I hope will be helpful to anyone who wants to achieve good performance when you have a game with hundreds of characters animating in a crowd.

I've recently converted my entire project of dozens of skeletons from SkeletonMecanim to SkeletonAnimation. I started to suspect that Mecanim had a CPU overhead that couldn't be justified. For context, my game is a spooky zoo management sim that needs to animate and render hundreds of zoo visitors wandering about, not to mention the player zookeeper, the staff of the zoo and the monsters themselves.

To confirm my hunch that SkeletonMecanim produced pretty poor performance when dealing with hundreds of skeletons in comparison to SkeletonAnimation, here's the test I made, first with SkeletonMecanim:



And then with SkeletonAnimation:



That's 625 visitor skeletons doing their walk-cycle in both tests. Yes, I know they look exactly the same, but trust me; the mode of driving the SkeletonRenderer is different! (This is with Unity 2020.1.9, by the way, with the latest Spine and URP shaders, and if you click the image above you can see my quite modest hardware listed at the bottom of the screen)

For reference, the skeleton I'm using is pretty simple:



One not-so-simple aspect is the Animator Controller in Unity for the SkeletonMecanim characters, where there are 10 separate layers. That's a lot! And that's because I need to mix a lot of different animations on top of one another, affecting different areas of the body. If you don't have as many layers in your Animator Controller, your results may vary. With SkeletonAnimation, you just take an animation you want to mix 'on top of' another one by mixing it into a TrackEntry.

In any case, here are my results, courtesy of Unity's Profile Analyzer, which takes 300 frames from both tests and puts them side-by-side with helpful averages and direct comparisons:



The short of it is that the mean average of the SkeletonMecanim test is 27 frames per second, and the mean average of the SkeletonAnimation test is 67 frames per second.

I knew that SkeletonMecanim was going to be slower and heavier, but I didn't imagine it would be more than twice as slow in my case. (I'm guessing that's the cost of my Animation Controller having too many layers.) A deeper dive into the individual markers on the left of the screenshot above shows that SkeletonMecanim.Update and LateUpdate are both more costly than SkeletonAnimation.Update and LateUpdate. You also have to put up with the cost of the 'Director' (Unity's Mecanim agent) running PreLateUpdate.DirectorUpdateAnimationBegin and Director.PrepareFrame and so on, and then Animators.ProcessWhatever too. There's just more overhead as Mecanim churns through it's tasks

After this test, I then immediately went and ripped Mecanim out of my project. While I miss the expressive way of describing how animations chain together which you get in Unity's Animator window, in the real world of my actual game project; it runs about 10 frames per second faster than it did before. Well worth trying this test, and the week of work it took to migrate.

Posting all this here in the hope that future programmers of games with crowds in them, won't bother with Mecanim at all - please do link to it when discussions of performance come up.
アバター
Cranktrain
  • 記事: 93

Nate

Thanks for sharing! I'd expect it to be doing more work and therefore be slower, but seeing by how much is interesting. Of course, you may not have 600 copies of your character on screen at once.

Those doing mobile apps might be interested in seeing something like this on mobile, possibly with fewer skeleton instances. Mobile hardware varies a lot, so probably best to use something with roughly average performance.
アバター
Nate

Nate
  • 記事: 11939

Cranktrain

Nate さんが書きました:Of course, you may not have 600 copies of your character on screen at once.
No, in total my game needs at most 400 copies of all the characters, monsters and creatures, having 625 is good for a bit of a stress-test and I think the results are clearer for it. As soon as I saw these numbers I felt pretty confident in putting the time and effort migrating things over, and was pleasantly surprised that it wasn't too much trouble to do, despite there being a couple dozen different skeletons that needed controlling in a whole new way.

A thought I had earlier today was - I wonder if there's room in spine-unity's tooling for some Unity UI that's like the Mecanim Animator Controller editing window: where you get the same expressive drag-and-drop power with drawing nodes, transitions, mixing curves, add variables/parameters and so on... but it's not Mecanim-powered at all, and... auto-generates some C# that controls a SkeletonAnimation? But I imagine that's a daunting amount of engineering to do, and there's much higher priority Unity-related extensions/upgrades on the docket for you guys.
アバター
Cranktrain
  • 記事: 93

Nate

It would be pretty sophisticated. We've thought about it, but we have so many other things to get to!

One thing you can do if you have hundreds of skeletons, you can update some every other frame. Or, if many can play the same animation, you can for example have 10 animations update 200 skeletons. It may not be noticeable, for example for a huge army marching.
アバター
Nate

Nate
  • 記事: 11939

Cranktrain

Nate さんが書きました:One thing you can do if you have hundreds of skeletons, you can update some every other frame.
Yess, this has been pretty big for me - there's an instant halving of the workload, and if the resulting frame-rate is high enough then the reduced animation fidelity doesn't matter.

As part of this system, I have it measure how far a character is away from the main camera, and if they're > 50 meters then they only update one-in-four frames. Characters in the distance might only take up a small pixel window on the screen and it's really not very noticeable for me.
Nate さんが書きました:Or, if many can play the same animation, you can for example have 10 animations update 200 skeletons. It may not be noticeable, for example for a huge army marching.
I'd love to hear a bit more about how to implement this! Would there be a pool of 10 dummy/invisible SkeletonAnimations that are running the walking animation (at different time offsets to achieve the visual variation)... how do the 200 skeletons draw in that frame data for themselves to display? A sketch of the implementation or a code-snippet would be very helpful.

And a follow-up question, let's say one of my zoo visitors is holding an ice-cream, can I layer on my HoldingIceCream on top of the Walking animation that's been pre-calculated in the shared pool of 10? Or would I need another second pool that's Walking+HoldingIceCream already mixed together?
アバター
Cranktrain
  • 記事: 93

Harald

Thanks very much for sharing! It is indeed very interesting to see such a high difference, our tests with e.g. Raptor and one or two animation layers typically were not too different. If I interpret your profiler screenshot correctly, you can see that in the mesh generation time (happens in LateUpdate) there is not a big difference between SkeletonAnimation and SkeletonMecanim, but the difference between applying the animation (during Update) is pretty dramatic!
I knew that SkeletonMecanim was going to be slower and heavier, but I didn't imagine it would be more than twice as slow in my case.
It might be interesting to check how much impact the Mecanim Translator - AutoReset property has in your scenario. It would be great if you could give it a quick check, or could send us your test-asset, as a zip package to contact@esotericsoftware.com. Then we will for sure perform some improvements in this regard!
[..] with the latest Spine and URP shaders [..]
I'm not sure whether you mean the very latest packages, which have been updated just 20 hours ago: now all URP and LWRP shaders are finally SRP batcher compatible, so you might give that a try for some free fps :). I guess you know that, just be sure to enable SRP batching in the settings for it to take effect.

Regarding optimization:
I'm not sure it suits your scenario, but you could also consider is using the RenderExistingMesh example component:
spine-unity Runtime Documentation: RenderExistingMesh
アバター
Harald

Harri
  • 記事: 4101

Nate

Cranktrain さんが書きました:I'd love to hear a bit more about how to implement this! Would there be a pool of 10 dummy/invisible SkeletonAnimations that are running the walking animation (at different time offsets to achieve the visual variation)... how do the 200 skeletons draw in that frame data for themselves to display? A sketch of the implementation or a code-snippet would be very helpful.
Ignoring spine-unity for a moment, there's 3 main steps that happen each frame:

1) AnimationState update increments the time for the tracks.
2) AnimationState apply applies all the timelines for all the animations to set the local pose a skeleton.
3) Skeleton updateWorldTransform sets up the world transforms for the local pose.

Multiple skeletons can be posed between calls to update. Though update is pretty fast, it's apply and updateWorldTransform that take some time.

To skip some of these steps, you could update and apply an AnimationState to a Skeleton and call updateWorldTransform, then render that same skeleton multiple times in different locations, for different GameObjects. Harald can help with the details of how to achieve the rendering for Unity.
アバター
Nate

Nate
  • 記事: 11939

Harald

In spine-unity all three of the above methods (or their respective counterparts in SkeletonMecanim) are called in Update of SkeletonAnimation / SkeletonMecanim. I would assume that since you are seeing large differences between the two components, in your case apply is the major part. updateWorldTransforms is called only once at both SkeletonAnimation and SkeletonMecanim, without any difference.
アバター
Harald

Harri
  • 記事: 4101

Cranktrain

Thanks guys, I'm going to dig into implementing a shared skeleton for Apply, and run some tests there.
Harald さんが書きました:It might be interesting to check how much impact the Mecanim Translator - AutoReset property has in your scenario.
I hadn't actually seen that whole Mecanim Translator section in the Inspector before, I'm assuming that's from an update in recent months. 'AutoReset' is On by default but toggling it off and on and running my same test I don't see any different at all in frame rates, unless it's very slight. But I'm just running that in the editor and not going through the effort of Building a player, running the profiler and comparing them in the UI, mainly because I'm personally not going to go back to Mecanim.
Harald さんが書きました:I'm not sure whether you mean the very latest packages, which have been updated just 20 hours ago: now all URP and LWRP shaders are finally SRP batcher compatible, so you might give that a try for some free fps :). I guess you know that, just be sure to enable SRP batching in the settings for it to take effect.
:o

Very excited by this! I'm already using the SRP batcher for everything else in my game that uses the out-of-the-box URP Lit shader, but have seen in the Frame Debugger that there's no grouping of Spine renderers.

Are there any limitations with this? My Material looks like this:



That's a shared material between all my Zoo Visitors, and I vertex colour all the slots to give different hair, skin and clothing colours - is this all supported by the SRP batcher?

If so, I can't wait to try it.
Harald さんが書きました:I'm not sure it suits your scenario, but you could also consider is using the RenderExistingMesh example component:
spine-unity Runtime Documentation: RenderExistingMesh
Probably not for my project - I took a look at the script and it's copying the whole MeshFilter.sharedMesh - which would lose all the per-zoo-visitor unique slot settings.

... wait, hang on, I'm realising that I can't call state.Apply(skeleton) multiple times with a shared skeleton; because that's going to end up with all the slot images and colours the same, with no unique settings possible, right? Because I'm calling skeleton.SetAttachment(...) and slot.SetColor(...). I don't think that route is going to be possible for me, sadly :(
アバター
Cranktrain
  • 記事: 93

Harald

Cranktrain さんが書きました:I hadn't actually seen that whole Mecanim Translator section in the Inspector before, I'm assuming that's from an update in recent months.
Actually no, this section has been around for a very long time. But we are happy that you didn't need to tweak any settings there yet and the that the defaults were reasonable enough ;).
Cranktrain さんが書きました: 'AutoReset' is On by default but toggling it off and on and running my same test I don't see any different at all in frame rates, unless it's very slight.
This is interesting. Thanks very much for giving that a try! Then my bet was off, then I'm afraid I cannot offer a quick performance improvement patch-fix.
Cranktrain さんが書きました:That's a shared material between all my Zoo Visitors, and I vertex colour all the slots to give different hair, skin and clothing colours - is this all supported by the SRP batcher?

If so, I can't wait to try it.
Yes, vertex color is especially "easy" to batch, as the vertex data is basically just copied over into a larger vertex buffer. But the SRP batcher should even be able to handle different material settings (color values, floating point parameters, etc), the main requirement is that all assigned textures are the same. (Under the hood the SRP batcher most likely creates an array of the material parameters as constant buffers and then just needs to track a material index per batch element.)
Cranktrain さんが書きました:Probably not for my project - I took a look at the script and it's copying the whole MeshFilter.sharedMesh - which would lose all the per-zoo-visitor unique slot settings
If you need individual slot attachments, then the RenderExistingMesh approach is definitely not an option.
Cranktrain さんが書きました:I'm realising that I can't call state.Apply(skeleton) multiple times with a shared skeleton; because that's going to end up with all the slot images and colours the same, with no unique settings possible, right?
Having a shared skeleton will unfortunately not be an option when you need different attachments assigned at each one.
The other way around, having a shared AnimationState and having shared Skeleton.Bones would theoretically be possible, but it requires some effort to be implemented correctly and to not forget anything (e.g. via writing a subclass derived from SkeletonAnimation that overrides at least the Update and LateUpdate calls).

We also have this performance improvement ticket on the roadmap:
https://github.com/EsotericSoftware/spine-runtimes/issues/1348
We would love to get to implement this, it's just that some important tasks and bugfixes are delaying it.
アバター
Harald

Harri
  • 記事: 4101

Cranktrain

I've downloaded the latest URP shaders but I can't seem to get batching to work. My scene is the same one as in my first post with the 625 zoo visitors marching in rows, but when I open up the Frame Debugger I'm seeing 625 'Draw Mesh' instructions in RenderLoop.Draw, rather than the 'SRP Batch' like I'm used to seeing for the other Lit shaders in my other game scenes.

I have the SRP Batcher Profiler turned on and I can see each frame is spending 0.00ms in SRP Batcher Code Path.

When I click on the first 'Draw Mesh' entry in the Frame Debugger, I see text that says:



Suggesting that the shader isn't yet compatible with the Batcher?

When I click on any other entry from the second Draw Mesh command onwards I see a different message:



But GPU Instancing != SRP Batching.

I've double-checked the URP Shader version, the package is: com.esotericsoftware.spine.urp-shaders-3.8-Unity2019.3-2020-12-02.zip - What's going on here?
アバター
Cranktrain
  • 記事: 93

Harald

Strangely we could not reproduce your problem, neither in Unity 2019.4 nor in 2020.1. We noticed that you have the keyword INSTANCING_ON set at your material, which is not recommended when you don't want to use GPU instancing. It might be a leftover of a previously set material, you could remove it via the debug menu in the Material Inspector, remove the string from Kewwords. Interestingly however, we still could not reproduce the problem this way, it still shows a single SRP Batch in the frame debugger.

Could you please have a look what the shader reads under SRP compatible when you select the used shader of the material via Material Inspector context menu "Select Shader". With the shader selected, it should read SRP batcher compatible. Perhaps the shader import did not recompile the shaders accordingly, could you please have a test if Reimport on the Spine Universal RP Shaders/Shaders directory has any effect?
アバター
Harald

Harri
  • 記事: 4101

Cranktrain

Hmm, okay, so firstly INSTANCING_ON is not in the Material's 'Shader Keywords' list, the contents of which is:
_ALPHAPREMULTIPLY_ON _ALPHA_CLIP _FIXED_NORMALS_VIEWSPACE _RIM_LIGHTING
I expected to find it there because, as you suggested, there was a previously set different Shader on this material. But importing the skeleton from Spine in a different folder and letting it set up a new material from scratch doesn't seem to fix the issue, somehow the Shader has an issue:



Firstly, there's the hint that Unity gives:
Material property is found in another cbuffer than "UnityPerMaterial" (_RimPower)
Not sure what that's about, if you can't reproduce it?

Secondly, I'm seeing INSTANCING_ON as a global keyword in the list on the shader:



Not sure which is the problem here, or what to do about either of them.

Oh, and reimporting all the Spine shaders doesn't change the not compatible label.

---

Oh okay, it looks like the Rim Lighting just produces a shader varient that isn't supported by SRP, for whatever reason.

If I untick 'Rim Lighting' on the Material, and reimport the Spine shader, then it says 'compatible'. Can you confirm that ticking 'Rim Lighting' causes incompatibility with the SRP Batcher?
アバター
Cranktrain
  • 記事: 93

Harald

Thanks very much for your investigations! For some reason Unity displayed compatible at the shader, although this is not true for all parameters. This issue has just been fixed, now Rim Lighting should also be SRP batcher compatible.

New 3.8 LWRP and URP shader extension packages can be downloaded here as usual:
https://esotericsoftware.com/files/runtimes/unity/com.esotericsoftware.spine.urp-shaders-3.8-Unity2019.3-2020-12-09.zip
Thanks for reporting!
アバター
Harald

Harri
  • 記事: 4101

Cranktrain

Excellent! Thanks Harald, I've updated and can confirm my game is now batching my Spine shaders.

Some performance results: On the test with 625 characters I was talking about earlier in the thread, the Frame Debugger reports an improvement from 3136 batches... down to 26. Oddly enough, my frame rate is actually lower - 49 down to 44 - in this test. Perhaps there's some overhead from the batching that lands on the CPU, but I'm also just testing this in the Editor with all the EditorLoop overhead meaning this test might not be accurate, so I'm not paying much attention to that.

In the real world of my actual game, built as a standalone player, I'm seeing ~800 batches reduced to ~430. When it comes to frame rate, I'm not seeing much difference, maybe a bit more consistency but definitely no jump upwards - oh well, perhaps batching just isn't so much of an issue on my hardware. I'll have to give it a go on my 2015 Macbook Air and see if batching helps with performance there.

To go back to your post a few days ago:
Harald さんが書きました:We also have this performance improvement ticket on the roadmap:
https://github.com/EsotericSoftware/spine-runtimes/issues/1348
We would love to get to implement this, it's just that some important tasks and bugfixes are delaying it.
Yeah, this is big for me, I remember talking about this two and a bit years ago in this thread. Between then and now I've moved some of the heavier duty routines in my game that could be more easily converted to DOTS and found the the speedup from burst compiled jobs pretty crazy. I'm not in the place to move my entire game over to DOTS but it's definitely the future - if the guts of spine-unity can just run each skeleton update/late update as a Job on a different core Worker then I think that's going to be many times faster than things are now. The API changes have really settled down into something a bit more manageable to support too.

When I have 300-400 Spine characters doing their thing in my game, SkeletonAnimation.Update + SkeletonAnimation.LateUpdate takes up 20% of every frame. Back when I was using SkeletonMecanim it was 30+, but I really do think making things run on DOTS is going to be a massive for performance.

When you're ready to publish a beta package of a DOTS Spine runtime, I'm going to be first in line to test it!
アバター
Cranktrain
  • 記事: 93

Harald

Thanks for the feedback, very much appreciated! Unfortunately, using the SRP batcher is not always beneficial.
Cranktrain さんが書きました: [..] if the guts of spine-unity can just run each skeleton update/late update as a Job on a different core Worker then I think that's going to be many times faster than things are now.
Yes, this is our plan. We are really looking forward to implementing the parallelization task, we hope to get to it soon after some more required 4.0 changes.
When you're ready to publish a beta package of a DOTS Spine runtime, I'm going to be first in line to test it!
:nerd: Thanks, we will let you know for sure!
アバター
Harald

Harri
  • 記事: 4101

foriero

Look forward to it Harald. We will put it in use extensively in our next game.
Founder & CEO Foriero s.r.o.
https://studio.foriero.com
アバター
foriero
  • 記事: 472


Return to Unity