Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in wrap/unwrap physical device object #1645

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

LeoXu240
Copy link

The Vulkan loader wraps VKPhysicalDevice into a loader_physical_device_tramp structure and returns it to the application. This is done by the function setup_loader_tramp_phys_devs.
When the application passes a VKPhysicalDevice to the Vulkan loader, it is internally treated as a pointer of loader_physical_device_tramp. The loader then extracts the actual VKPhysicalDevice and forwards it to the ICD driver.

However, there is an issue in the function terminator_CreateDevice. Instead of converting the VKPhysicalDevice back into a loader_physical_device_tramp, it incorrectly converts it to a loader_physical_device_term.

This mismatch causes a crash in Vulkan CTS during the test case 'dEQP-VK.api.object_management.single.device_group'.

The Vulkan loader wraps VKPhysicalDevice into a loader_physical_device_tramp
structure and returns it to the application. This is done by the function
setup_loader_tramp_phys_devs.
When the application passes a VKPhysicalDevice to the Vulkan loader, it is
internally treated as a pointer of loader_physical_device_tramp. The loader
then extracts the actual VKPhysicalDevice and forwards it to the ICD driver.

However, there is an issue in the function terminator_CreateDevice. Instead of
converting the VKPhysicalDevice back into a loader_physical_device_tramp,
it incorrectly converts it to a loader_physical_device_term.

This mismatch causes a crash in Vulkan CTS during the test case
'dEQP-VK.api.object_management.single.device_group'.
@ci-tester-lunarg
Copy link

Author LeoXu240 not on autobuild list. Waiting for curator authorization before starting CI build.

1 similar comment
@ci-tester-lunarg
Copy link

Author LeoXu240 not on autobuild list. Waiting for curator authorization before starting CI build.

@CLAassistant
Copy link

CLAassistant commented Jan 30, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build queued with queue ID 361323.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2884 running.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2884 passed.

@dgkoch dgkoch requested a review from charles-lunarg January 31, 2025 19:26
@charles-lunarg
Copy link
Collaborator

charles-lunarg commented Jan 31, 2025

Instead of converting the VKPhysicalDevice back into a loader_physical_device_tramp, it incorrectly converts it to a loader_physical_device_term.

This is where I believe this PR is currently mistaken. The loader doesn't pass down a pointer to loader_physical_device_tramp, it passes whatever came up the vkEnumeratePhysicalDevices call chain, which starts as loader_physical_device_term but may be wrapped by a layer. That is my intuition of the situation, but I cannot be certain as the loader's logic here is very convoluted (as I'm sure you are well aware after making this PR).

When I tested the change locally, I get a crash in the exact same place that CI does. Does that mean that this PR isn't needed? No, I haven't finished building CTS in order to reproduce the crash you are describing. The Vulkan-Loader could very well have bad test code which explains the crash in the loader tests.

@charles-lunarg
Copy link
Collaborator

I have finished building CTS but cannot reproduce the crash.

However I am getting crashes when I use --deqp-vk-device-id= with an index greater than 1 regardless of whether I use your PR or main. The culprit appears to be that the VkPhysicalDevice vector only has 1 element yet I am trying to do index into the 2nd or 3rd element, which is out of bounds.

My system has an AMD Rx 480 & Intel UHD 630, as well as lavapipe.

What vendor & gpu are you seeing crashes with, that way I can better reproduce it.

@LeoXu240
Copy link
Author

LeoXu240 commented Feb 1, 2025

Instead of converting the VKPhysicalDevice back into a loader_physical_device_tramp, it incorrectly converts it to a loader_physical_device_term.

This is where I believe this PR is currently mistaken. The loader doesn't pass down a pointer to loader_physical_device_tramp, it passes whatever came up the vkEnumeratePhysicalDevices call chain, which starts as loader_physical_device_term but may be wrapped by a layer. That is my intuition of the situation, but I cannot be certain as the loader's logic here is very convoluted (as I'm sure you are well aware after making this PR).

When I tested the change locally, I get a crash in the exact same place that CI does. Does that mean that this PR isn't needed? No, I haven't finished building CTS in order to reproduce the crash you are describing. The Vulkan-Loader could very well have bad test code which explains the crash in the loader tests.

Thanks for the feedback :)
The crash happened on embedded linux with nvidia tegra chip.

the test case 'dEQP-VK.api.object_management.single.device_group' works like this:

it calls vkEnumeratePhysicalDeviceGroups to get a pointer of some phydevice object and fill the pointer in VkDeviceGroupDeviceCreateInfo, then pass it to vkCreateDevice.

in terminator_CreateDevice, it handles group devices like this:

struct loader_physical_device_term *cur_term;                                                                                 
for (uint32_t phys_dev = 0; phys_dev < cur_struct->physicalDeviceCount; phys_dev++) {                                         
                                                                                                                              
    cur_term  = (struct loader_physical_device_term *)cur_struct->pPhysicalDevices[phys_dev];                                 
    phys_dev_array[phys_dev] = cur_term->phys_dev;                                                                            
}                               

So it converts the pointer to struct loader_physical_device_term.

in trampoline.c, if you check function vkEnumeratePhysicalDeviceGroups,
it calls setup_loader_tramp_phys_dev_groups, then it calls setup_loader_tramp_phys_devs to convert icd physical device object to loader_physical_device_tramp:

// Something wasn't found, so it's new so add it to the new list                                             
if (!old_item_found) {                                                                                       
    new_phys_devs[new_idx] = loader_instance_heap_alloc(inst, sizeof(struct loader_physical_device_tramp),   
                                                        VK_SYSTEM_ALLOCATION_SCOPE_INSTANCE);                
    if (NULL == new_phys_devs[new_idx]) {                                                                    
        loader_log(inst, VULKAN_LOADER_ERROR_BIT, 0,                                                         
                   "setup_loader_tramp_phys_devs:  Failed to allocate new trampoline physical device");      
        res = VK_ERROR_OUT_OF_HOST_MEMORY;                                                                   
        goto out;                                                                                            
    }                                                                                                        
                                                                                                             
    // Initialize the new physicalDevice object                                                              
    loader_set_dispatch((void *)new_phys_devs[new_idx], inst->disp);                                         
    new_phys_devs[new_idx]->this_instance = inst;                                                            
    new_phys_devs[new_idx]->phys_dev = phys_devs[new_idx];                                                   
    new_phys_devs[new_idx]->magic = PHYS_TRAMP_MAGIC_NUMBER;                                                 
}                          

so you can see it converts phys-devs to loader_physical_device_tramp, but in terminator_CreateDevice it convert the pointer to loader_physical_device_term.

I made a simple change in terminator_CreateDevice to dump out some information:

                 if (0 < cur_struct->physicalDeviceCount && NULL != cur_struct->pPhysicalDevices) {                                               
                     VkDeviceGroupDeviceCreateInfo *temp_struct = loader_stack_alloc(sizeof(VkDeviceGroupDeviceCreateInfo));                      
                     VkPhysicalDevice *phys_dev_array = NULL;                                                                                     
@@ -5813,12 +5816,21 @@ VKAPI_ATTR VkResult VKAPI_CALL terminator_CreateDevice(VkPhysicalDevice physical                                          
                     // Before calling down, replace the incoming physical device values (which are really loader terminator                      
                     // physical devices) with the ICDs physical device values.                                                                   
                     struct loader_physical_device_term *cur_term;                                                                                
+                   struct loader_physical_device_tramp * cur_tramp;                                                                              
                     for (uint32_t phys_dev = 0; phys_dev < cur_struct->physicalDeviceCount; phys_dev++) {                                        
-                        cur_term = (struct loader_physical_device_term *)cur_struct->pPhysicalDevices[phys_dev];                                 
+                                                                                                                                                 
+                       cur_term  = (struct loader_physical_device_term *)cur_struct->pPhysicalDevices[phys_dev];                                 
+                       cur_tramp = (struct loader_physical_device_tramp *)cur_struct->pPhysicalDevices[phys_dev];                                
+                                                                                                                                                 
                         phys_dev_array[phys_dev] = cur_term->phys_dev;                                                                           
-                    }                                                                                                                            
+                                                                                                                                                 
+                       printf("  input  : pPhysicalDevice = %p \n", cur_term);                                                                   
+                       printf("  icd_phy: pPhysicalDevice = %p \n", cur_term->phys_dev);                                                         
+                       printf("  icd_phy: loader_physical_device_tramp::magic = 0x%llx / 0x%llx \n", cur_tramp->magic, PHYS_TRAMP_MAGIC_NUMBER); 
+                   }                                                                                                                             
                     temp_struct->pPhysicalDevices = phys_dev_array;                                                                              
                                                                                                        

and here is the output I got:

Test case 'dEQP-VK.api.object_management.single.device_group'..
VK Loader:: terminator_CreateDevice { 
  .physicalDeviceCount = 1
  input  : pPhysicalDevice = 0xaaaab1cca920 
  icd_phy: pPhysicalDevice = 0x10aded020210aded 
  icd_phy: loader_physical_device_tramp::magic = 0x10aded020210aded / 0x10aded020210aded 
}
Segmentation fault (core dumped)

you can see the icd-phy-dev was converted to 0x10aded020210aded instead of 0xaaaab1cca920, this is because the offset of phys_dev in loader_physical_device_term is 16, but in real memory, offset 16 points to magic in loader_physical_device_tramp, which is always the fixed value PHYS_TRAMP_MAGIC_NUMBER (0x10aded020210aded).

struct loader_physical_device_tramp {                                                   
    struct loader_instance_dispatch_table *disp;  // must be first entry in structure   
    struct loader_instance *this_instance;                                              
    uint64_t magic;             // Should be PHYS_TRAMP_MAGIC_NUMBER                    
    VkPhysicalDevice phys_dev;  // object from layers/loader terminator                 
};                                                                                      

struct loader_physical_device_term {                                                    
    struct loader_instance_dispatch_table *disp;  // must be first entry in structure   
    struct loader_icd_term *this_icd_term;                                              
    VkPhysicalDevice phys_dev;  // object from ICD                                      
};                                                                                      


I am not very familiar with the internal design of vk loader, not sure what's the difference between loader_physical_device_tramp and loader_physical_device_term, but it seems like a simple logic error in terminator_CreateDevice, if me know if I made anything wrong. Thanks.

@LeoXu240
Copy link
Author

LeoXu240 commented Feb 5, 2025

Hi @charles-lunarg
Regarding the change, please let me know if you have any questions or need further context.
Thanks.

@charles-lunarg
Copy link
Collaborator

I've had some time to review the loader source code, around device enumeration and device group enumeration.

The logic of the loader is to 'double-wrap' the VkPhysicalDevice handles returned by ICD's. Both for normal EnumeratePhysicalDevices and EnumeratePhysicalDeviceGroups.

When Create Device is called, it unwraps the application's VkPhysicalDevice handle to get a loader_physical_device_tramp. The loader calls down the layer dispatch chain to reach terminator_CreateDevice, where it unwraps the handle (which is a loader_physical_device_tramp) to get the ICD's VkPhysicalDevice handle and passes that into the driver's vkCreateDevice function.

This PR undoes this double wrapping, hence the failing tests. What I can't be sure of is that there isn't a bug in the loader where this double-wrapping & unwrapping is failing in some what that I wasn't able to recreate in the loader tests.

Is there more information about the hardware configuration that fails CTS? Specifics of the driver and what it returns? Does the driver support device groups? The loader implements device groups for drivers that do not support it (and pretend there is just 1 group for each VkPhysicalDevice that contains just he VkPhysicalDevice).

@LeoXu240
Copy link
Author

LeoXu240 commented Feb 7, 2025

@charles-lunarg thanks for your feedback, I did more debugging with the crash and found it only happened with release build. In debug build, the VkPhysicalDevice handle is unwrapped correctly and passed to icd vkCreateDevice.
But in release build, the VkPhysicalDevice handle is not correct just as I described before, the handle is always 0x10aded020210aded.

I tried removing -O2 in release build and the crash issue went away. @charles-lunarg did you ever hit such an optimisation issue in vkloader before?

And last question is about the double-wrap, I only see one wrap in vk loader code(VkPhysicalDevice handle to loader_physical_device_tramp), but I notice in trampoline.c, in function vkEnumeratePhysicalDeviceGroups, it calls
res = inst->disp->layer_inst_disp.EnumeratePhysicalDeviceGroups(inst->instance, pPhysicalDeviceGroupCount, pPhysicalDeviceGroupProperties);
and it jump into libVkLayer_MESA_device_select.so and then calls icd version of vkEnumeratePhysicalDeviceGroups,
seems like the physical-device object is wrapped as loader_physical_device_term in libVkLayer_MESA_device_select.so, is that correct ?

Thanks

@LeoXu240
Copy link
Author

LeoXu240 commented Feb 7, 2025

ok, I found the function that caused the compiler bug.
#if defined(__APPLE__) VkResult loader_create_device_chain(const VkPhysicalDevice pd, const VkDeviceCreateInfo *pCreateInfo, const VkAllocationCallbacks *pAllocator, const struct loader_instance *inst, struct loader_device *dev, PFN_vkGetInstanceProcAddr callingLayer, PFN_vkGetDeviceProcAddr *layerNextGDPA) __attribute__((optnone)) { #else VkResult loader_create_device_chain(const VkPhysicalDevice pd, const VkDeviceCreateInfo *pCreateInfo, const VkAllocationCallbacks *pAllocator, const struct loader_instance *inst, struct loader_device *dev, PFN_vkGetInstanceProcAddr callingLayer, PFN_vkGetDeviceProcAddr *layerNextGDPA) { #endif
Seems the similar issue happened in MacOS before ?
My target is embedde linux with aarch64 cpu arch. I just verified that it works without crash if I set O0( attribute((optimize("O0")))) for this function only.

Is that OK to disable optimisation for function loader_create_device_chain for non-apple platforms ? what's the root cause that this function caused the optimisation issue?

Thanks

Vulkan CTS crashes when running test case dEQP-VK.api.object_management.single.device_group.
The root cause is a compiler issue when optimization (-O2) is enabled.

The problem occurs on embedded Linux (aarch64) using gcc
aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2020.08) 9.3.0
@ci-tester-lunarg
Copy link

Author LeoXu240 not on autobuild list. Waiting for curator authorization before starting CI build.

1 similar comment
@ci-tester-lunarg
Copy link

Author LeoXu240 not on autobuild list. Waiting for curator authorization before starting CI build.

@charles-lunarg
Copy link
Collaborator

I have been looking into this, and what the test framework does differently than the 'real world' is return the same VkPhysicalDevice handles from both vkEnumeratePhysicalDevices and vkEnumeratePhysicalDeviceGroups. I will see if fixing the test framework to not do that reveal subtly wrong loader behavior, which matches up with your description.
If an optimizer settings is whats causing the crash, then the loader has some really bad UB that has "just been working".

I'd really preferably have a test that forces this issue to exist so a fix can "fix it" once and for all, since this is a really difficult part of the codebase to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants