-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in wrap/unwrap physical device object #1645
base: main
Are you sure you want to change the base?
Conversation
The Vulkan loader wraps VKPhysicalDevice into a loader_physical_device_tramp structure and returns it to the application. This is done by the function setup_loader_tramp_phys_devs. When the application passes a VKPhysicalDevice to the Vulkan loader, it is internally treated as a pointer of loader_physical_device_tramp. The loader then extracts the actual VKPhysicalDevice and forwards it to the ICD driver. However, there is an issue in the function terminator_CreateDevice. Instead of converting the VKPhysicalDevice back into a loader_physical_device_tramp, it incorrectly converts it to a loader_physical_device_term. This mismatch causes a crash in Vulkan CTS during the test case 'dEQP-VK.api.object_management.single.device_group'.
Author LeoXu240 not on autobuild list. Waiting for curator authorization before starting CI build. |
1 similar comment
Author LeoXu240 not on autobuild list. Waiting for curator authorization before starting CI build. |
|
CI Vulkan-Loader build queued with queue ID 361323. |
CI Vulkan-Loader build # 2884 running. |
CI Vulkan-Loader build # 2884 passed. |
This is where I believe this PR is currently mistaken. The loader doesn't pass down a pointer to When I tested the change locally, I get a crash in the exact same place that CI does. Does that mean that this PR isn't needed? No, I haven't finished building CTS in order to reproduce the crash you are describing. The Vulkan-Loader could very well have bad test code which explains the crash in the loader tests. |
I have finished building CTS but cannot reproduce the crash. However I am getting crashes when I use --deqp-vk-device-id= with an index greater than 1 regardless of whether I use your PR or main. The culprit appears to be that the VkPhysicalDevice vector only has 1 element yet I am trying to do index into the 2nd or 3rd element, which is out of bounds. My system has an AMD Rx 480 & Intel UHD 630, as well as lavapipe. What vendor & gpu are you seeing crashes with, that way I can better reproduce it. |
Thanks for the feedback :) the test case 'dEQP-VK.api.object_management.single.device_group' works like this: it calls vkEnumeratePhysicalDeviceGroups to get a pointer of some phydevice object and fill the pointer in VkDeviceGroupDeviceCreateInfo, then pass it to vkCreateDevice. in terminator_CreateDevice, it handles group devices like this:
So it converts the pointer to struct loader_physical_device_term. in trampoline.c, if you check function vkEnumeratePhysicalDeviceGroups,
so you can see it converts phys-devs to loader_physical_device_tramp, but in terminator_CreateDevice it convert the pointer to loader_physical_device_term. I made a simple change in terminator_CreateDevice to dump out some information:
and here is the output I got:
you can see the icd-phy-dev was converted to 0x10aded020210aded instead of 0xaaaab1cca920, this is because the offset of phys_dev in loader_physical_device_term is 16, but in real memory, offset 16 points to magic in loader_physical_device_tramp, which is always the fixed value PHYS_TRAMP_MAGIC_NUMBER (0x10aded020210aded).
I am not very familiar with the internal design of vk loader, not sure what's the difference between loader_physical_device_tramp and loader_physical_device_term, but it seems like a simple logic error in terminator_CreateDevice, if me know if I made anything wrong. Thanks. |
Hi @charles-lunarg |
I've had some time to review the loader source code, around device enumeration and device group enumeration. The logic of the loader is to 'double-wrap' the VkPhysicalDevice handles returned by ICD's. Both for normal EnumeratePhysicalDevices and EnumeratePhysicalDeviceGroups. When Create Device is called, it unwraps the application's VkPhysicalDevice handle to get a loader_physical_device_tramp. The loader calls down the layer dispatch chain to reach terminator_CreateDevice, where it unwraps the handle (which is a loader_physical_device_tramp) to get the ICD's VkPhysicalDevice handle and passes that into the driver's vkCreateDevice function. This PR undoes this double wrapping, hence the failing tests. What I can't be sure of is that there isn't a bug in the loader where this double-wrapping & unwrapping is failing in some what that I wasn't able to recreate in the loader tests. Is there more information about the hardware configuration that fails CTS? Specifics of the driver and what it returns? Does the driver support device groups? The loader implements device groups for drivers that do not support it (and pretend there is just 1 group for each VkPhysicalDevice that contains just he VkPhysicalDevice). |
@charles-lunarg thanks for your feedback, I did more debugging with the crash and found it only happened with release build. In debug build, the VkPhysicalDevice handle is unwrapped correctly and passed to icd vkCreateDevice. I tried removing -O2 in release build and the crash issue went away. @charles-lunarg did you ever hit such an optimisation issue in vkloader before? And last question is about the double-wrap, I only see one wrap in vk loader code(VkPhysicalDevice handle to loader_physical_device_tramp), but I notice in trampoline.c, in function vkEnumeratePhysicalDeviceGroups, it calls Thanks |
ok, I found the function that caused the compiler bug. Is that OK to disable optimisation for function loader_create_device_chain for non-apple platforms ? what's the root cause that this function caused the optimisation issue? Thanks |
Vulkan CTS crashes when running test case dEQP-VK.api.object_management.single.device_group. The root cause is a compiler issue when optimization (-O2) is enabled. The problem occurs on embedded Linux (aarch64) using gcc aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2020.08) 9.3.0
Author LeoXu240 not on autobuild list. Waiting for curator authorization before starting CI build. |
1 similar comment
Author LeoXu240 not on autobuild list. Waiting for curator authorization before starting CI build. |
I have been looking into this, and what the test framework does differently than the 'real world' is return the same VkPhysicalDevice handles from both vkEnumeratePhysicalDevices and vkEnumeratePhysicalDeviceGroups. I will see if fixing the test framework to not do that reveal subtly wrong loader behavior, which matches up with your description. I'd really preferably have a test that forces this issue to exist so a fix can "fix it" once and for all, since this is a really difficult part of the codebase to follow. |
The Vulkan loader wraps VKPhysicalDevice into a loader_physical_device_tramp structure and returns it to the application. This is done by the function setup_loader_tramp_phys_devs.
When the application passes a VKPhysicalDevice to the Vulkan loader, it is internally treated as a pointer of loader_physical_device_tramp. The loader then extracts the actual VKPhysicalDevice and forwards it to the ICD driver.
However, there is an issue in the function terminator_CreateDevice. Instead of converting the VKPhysicalDevice back into a loader_physical_device_tramp, it incorrectly converts it to a loader_physical_device_term.
This mismatch causes a crash in Vulkan CTS during the test case 'dEQP-VK.api.object_management.single.device_group'.