nak: use MemScope::CTA for shared memory scoped SCOPE_WORKGROUP barriers
CTA synchronizes between all threads within the same workgroup, so we
should use that over GPU which has some more severe performance
implications.
Sadly it doesn't appear like we can rely on .CTA to work for global
memory so let's keep using GPU for those for now.
Speeds up vk_cooperative_matrix by roughly 40%
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36482>
1 file changed