amdgpu: Add SDMA copy support.

For SCANOUT images we need to use USWC memory. However USWC
memory is very slow to read from (~25 MiB/second).

At the same time for video decoding there are some images
that are allocated with SCANOUT but at the same time frequently
accessed from the CPU. Just mapping these is going to result
in non-satisfactory performance, so this patch adds a DMA step
to copy it to memory that is faster to acces from the CPU.

Benchmarked on Grunt with
android.hardware.camera2.cts.RecordingTest#testVideoPreviewSurfaceSharing

The time that an image (~500 KiB) is kept mapped for processing,
including the time for mapping and unmapping:

plain GTT (cachaeable):  1.5-2 ms
USWC:                    45-50 ms
USWC w/ memcpy:          20-30 ms
USWC w/ SDMA copy:       3.5-5.5 ms

We can clearly see that the Android video processing code only gets
a throughput of ~10 MiB/s with USWC memory. memcpy is slightly more
efficient by getting 20-30 MiB/s, but neither of these are suitable
for 30+ fps video.

Furthermore, with SDMA copy, the timing is roughly as follows:

map:
  - Allocate plain GTT BO:        ~400-800 us
  - map src & dst BO into GPU VM: ~25 us
  - submit SDMA copy:             ~80 us
  - wait till SDMA copy finishes: ~400 us
  - unmap src & dst BO from GPU:  ~15 us
  - map dst BO into CPU:          ~30 us

unmap:
  - unmapping dst BO from CPU:    ~30 us
  - Copy not benchmarked (avoided for RO map)
  - delete BO:                    ~100 us

ideas for further improvement:
  - BO cache
  - rely on implicit sync and don't wait for the copy during
    unmapping.

Alternatives that have been rejected:
  - Use radeonsi + DRI interface: each plane gets mapped into
    its own BO, which is an issue for gralloc.
  - more persistently mapping each BO into GPU VM: this needs
    proper address space management which adds complexity.
    librm_amdgpu can do it for us but brings its own can of worms
    with dedup of the drm fd. (which makes e.g. implicit sync not
    work with any radeonsi instances in the same process)
  - Use SDMA instead of DRI/Radeonsi for more images. This is an
    issue because SDMA for images is a whole mess with lots of
    corner cases and lots of changes per generation. Furthermore,
    it wouldn't work for DCC compressed images.

TEST=Run android.hardware.camera2.cts.RecordingTest#testVideoPreviewSurfaceSharing on Grunt.
BUG=b:152378755

Change-Id: I8f5e00ff4b6d9e31f78fd4de7eb62d0d3aa66438
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/minigbm/+/2256228
Tested-by: Bas Nieuwenhuizen <basni@chromium.org>
Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org>
Commit-Queue: Bas Nieuwenhuizen <basni@chromium.org>
1 file changed