protocols/vulkan/appendices/VK_NVX_device_generated_commands.txt - platform/hardware/google/gfxstream - Git at Google

 include::meta/VK_NVX_device_generated_commands.txt[]

 *Last Modified Date*::
     2017-07-25
 *Contributors*::
   - Pierre Boudier, NVIDIA
   - Christoph Kubisch, NVIDIA
   - Mathias Schott, NVIDIA
   - Jeff Bolz, NVIDIA
   - Eric Werness, NVIDIA
   - Detlef Roettger, NVIDIA
   - Daniel Koch, NVIDIA
   - Chris Hebert, NVIDIA

 This extension allows the device to generate a number of critical commands
 for command buffers.

 When rendering a large number of objects, the device can be leveraged to
 implement a number of critical functions, like updating matrices, or
 implementing occlusion culling, frustum culling, front to back sorting, etc.
 Implementing those on the device does not require any special extension,
 since an application is free to define its own data structure, and just
 process them using shaders.

 However, if the application desires to quickly kick off the rendering of the
 final stream of objects, then unextended Vulkan forces the application to
 read back the processed stream and issue graphics command from the host.
 For very large scenes, the synchronization overhead, and cost to generate
 the command buffer can become the bottleneck.
 This extension allows an application to generate a device side stream of
 state changes and commands, and convert it efficiently into a command buffer
 without having to read it back on the host.

 Furthermore, it allows incremental changes to such command buffers by
 manipulating only partial sections of a command stream -- for example
 pipeline bindings.
 Unextended Vulkan requires re-creation of entire command buffers in such
 scenario, or updates synchronized on the host.

 The intended usage for this extension is for the application to:

   * create its objects as in unextended Vulkan
   * create a slink:VkObjectTableNVX, and register the various Vulkan objects
     that are needed to evaluate the input parameters.
   * create a slink:VkIndirectCommandsLayoutNVX, which lists the
     slink:VkIndirectCommandsTokenTypeNVX it wants to dynamically change as
     atomic command sequence.
     This step likely involves some internal device code compilation, since
     the intent is for the GPU to generate the command buffer in the
     pipeline.
   * fill the input buffers with the data for each of the inputs it needs.
     Each input is an array that will be filled with an index in the object
     table, instead of using CPU pointers.
   * set up a target secondary command buffer
   * reserve command buffer space via flink:vkCmdReserveSpaceForCommandsNVX
     in a target command buffer at the position you want the generated
     commands to be executed.
   * call flink:vkCmdProcessCommandsNVX to create the actual device commands
     for all sequences based on the array contents into a provided target
     command buffer.
   * execute the target command buffer like a regular secondary command
     buffer

 For each draw/dispatch, the following can be specified:

   * a different pipeline state object
   * a number of descriptor sets, with dynamic offsets
   * a number of vertex buffer bindings, with an optional dynamic offset
   * a different index buffer, with an optional dynamic offset

 Applications should: register a small number of objects, and use dynamic
 offsets whenever possible.

 While the GPU can be faster than a CPU to generate the commands, it may not
 happen asynchronously, therefore the primary use-case is generating "`less`"
 total work (occlusion culling, classification to use specialized shaders,
 etc.).

 === New Object Types

   * slink:VkObjectTableNVX
   * slink:VkIndirectCommandsLayoutNVX

 === New Flag Types

   * elink:VkIndirectCommandsLayoutUsageFlagsNVX
   * elink:VkObjectEntryUsageFlagsNVX

 === New Enum Constants

 Extending elink:VkStructureType:

   ** ename:VK_STRUCTURE_TYPE_OBJECT_TABLE_CREATE_INFO_NVX
   ** ename:VK_STRUCTURE_TYPE_INDIRECT_COMMANDS_LAYOUT_CREATE_INFO_NVX
   ** ename:VK_STRUCTURE_TYPE_CMD_PROCESS_COMMANDS_INFO_NVX
   ** ename:VK_STRUCTURE_TYPE_CMD_RESERVE_SPACE_FOR_COMMANDS_INFO_NVX
   ** ename:VK_STRUCTURE_TYPE_DEVICE_GENERATED_COMMANDS_LIMITS_NVX
   ** ename:VK_STRUCTURE_TYPE_DEVICE_GENERATED_COMMANDS_FEATURES_NVX

 Extending elink:VkPipelineStageFlagBits:

   ** ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX

 Extending elink:VkAccessFlagBits:

   ** ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX
   ** ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX

 === New Enums

   * elink:VkIndirectCommandsLayoutUsageFlagBitsNVX
   * elink:VkIndirectCommandsTokenTypeNVX
   * elink:VkObjectEntryUsageFlagBitsNVX
   * elink:VkObjectEntryTypeNVX

 === New Structures

   * slink:VkDeviceGeneratedCommandsFeaturesNVX
   * slink:VkDeviceGeneratedCommandsLimitsNVX
   * slink:VkIndirectCommandsTokenNVX
   * slink:VkIndirectCommandsLayoutTokenNVX
   * slink:VkIndirectCommandsLayoutCreateInfoNVX
   * slink:VkCmdProcessCommandsInfoNVX
   * slink:VkCmdReserveSpaceForCommandsInfoNVX
   * slink:VkObjectTableCreateInfoNVX
   * slink:VkObjectTableEntryNVX
   * slink:VkObjectTablePipelineEntryNVX
   * slink:VkObjectTableDescriptorSetEntryNVX
   * slink:VkObjectTableVertexBufferEntryNVX
   * slink:VkObjectTableIndexBufferEntryNVX
   * slink:VkObjectTablePushConstantEntryNVX

 === New Functions

   * flink:vkCmdProcessCommandsNVX
   * flink:vkCmdReserveSpaceForCommandsNVX
   * flink:vkCreateIndirectCommandsLayoutNVX
   * flink:vkDestroyIndirectCommandsLayoutNVX
   * flink:vkCreateObjectTableNVX
   * flink:vkDestroyObjectTableNVX
   * flink:vkRegisterObjectsNVX
   * flink:vkUnregisterObjectsNVX
   * flink:vkGetPhysicalDeviceGeneratedCommandsPropertiesNVX

 === Issues

 1) How to name this extension ?

 *RESOLVED*: `VK_NVX_device_generated_commands`

 As usual, one of the hardest issues ;)

 Alternatives: `VK_gpu_commands`, `VK_execute_commands`,
 `VK_device_commands`, `VK_device_execute_commands`, `VK_device_execute`,
 `VK_device_created_commands`, `VK_device_recorded_commands`,
 `VK_device_generated_commands`

 2) Should we use serial tokens or redundant sequence description?

 Similarly to slink:VkPipeline, signatures have the most likelihood to be
 cross-vendor adoptable.
 They also benefit from being processable in parallel.

 3) How to name sequence description

 stext:ExecuteCommandSignature is a bit long.
 Maybe just stext:ExecuteSignature, or actually more following Vulkan
 nomenclature: slink:VkIndirectCommandsLayoutNVX.

 4) Do we want to provide code:indirectCommands inputs with layout or at
 code:indirectCommands time?

 Separate layout from data as Vulkan does.
 Provide full flexibilty for code:indirectCommands.

 5) Should the input be provided as SoA or AoS?

 It is desirable for the application to reuse the list of objects and render
 them with some kind of an override.
 This can be done by just selecting a different input for a push constant or
 a descriptor set, if they are defined as independent arrays.
 If the data was interleaved, this would not be as easily possible.

 Allowing input divisors can also reduce the conservative command buffer
 allocation.

 6) How do we know the size of the GPU command buffer generated by
 flink:vkCmdProcessCommandsNVX ?

 pname:maxSequenceCount can give an upper estimate, even if the actual count
 is sourced from the gpu buffer at (buffer, countOffset).
 As such pname:maxSequenceCount must always be set correctly.

 Developers are encouraged to make well use the
 slink:VkIndirectCommandsLayoutNVX's ptext:pTokens[].divisor, as they allow
 less conservative storage costs.
 Especially pipeline changes on a per-draw basis can be costly memory wise.

 7) How to deal with dynamic offsets in DescriptorSets?

 Maybe additional token etext:VK_EXECUTE_DESCRIPTOR_SET_OFFSET_COMMAND_NVX
 that works for a "`single dynamic buffer`" descriptor set and then use (32
 bit tableEntry + 32bit offset)

 added dynamicCount field, variable sized input

 8) Should we allow updates to the object table, similar to DescriptorSet?

 Desired yes, people may change "`material`" shaders and not want to recreate
 the entire register table.
 However the developer must ensure to not overwrite a registered objectIndex
 while it is still being used.

 9) Should we allow dynamic state changes?

 Seems a bit excessive for "`per-draw`" type of scenario, but GPU could
 partition work itself with viewport/scissor...

 10) How do we allow re-using already "`filled`" code:indirectCommands
 buffers?

 just use a slink:VkCommandBuffer for the output, and it can be reused
 easily.

 11) How portable should such re-use be?

 Same as secondary command buffer

 12) Should sequenceOrdered be part of IndirectCommandsLayout or
 slink:vkCmdProcessCommandsNVX?

 Seems better for IndirectCommandsLayout, as that is when most heavy lifting
 in terms of internal device code generation is done.

 13) Under which conditions is flink:vkCmdProcessCommandsNVX legal?

 Options:

 a) on the host command buffer like a regular draw call

 b) flink:vkCmdProcessCommandsNVX makes use slink:VkCommandBufferBeginInfo
    and serves as flink:vkBeginCommandBuffer / flink:vkEndCommandBuffer
    implicitly.

 c) The pname:targetCommandbuffer must be inside the "`begin`" state already
    at the moment of being passed.
    This very likely suggests a new slink:VkCommandBufferUsageFlags
    etext:VK_COMMAND_BUFFER_USAGE_DEVICE_GENERATED_BIT.

 d) The pname:targetCommandbuffer must reserve space via a new function.

 used a) and d).

 14) What if different pipelines have different DescriptorSetLayouts at a
 certain set unit that mismatches in code:token.dynamicCount?

 Considered legal, as long as the maximum dynamic count of all used
 DescriptorSetLayouts is provided.

 15) Should we add "`strides`" to input arrays, so that "`Array of
 Structures`" type setups can be supported more easily?

 Maybe provide a usage flag for packed tokens stream (all inputs from same
 buffer, implicit stride).

 No, given performance test was worse.

 16) Should we allow re-using the target command buffer directly, without
 need to reset command buffer?

 YES: new api flink:vkCmdReserveSpaceForCommandsNVX.

 17) Is flink:vkCmdProcessCommandsNVX copying the input data or referencing
 it ?

 There are multiple implementations possible:

   * one could have some emulation code that parse the inputs, and generates
     an output command buffer, therefore copying the inputs.
   * one could just reference the inputs, and have the processing done in
     pipe at execution time.

 If the data is mandated to be copied, then it puts a penalty on
 implementation that could process the inputs directly in pipe.
 If the data is "`referenced`", then it allows both types of implementation

 The inputs are "`referenced`", and should not be modified after the call to
 flink:vkCmdProcessCommandsNVX and until after the rendering of the target
 command buffer is finished.

 18) Why is this +NVX+ and not +NV+?

 To allow early experimentation and feedback.
 We expect that a version with a refined design as multi-vendor variant will
 follow up.

 19) Should we make the availability for each token type a device limit?

 Only distinguish between graphics/compute for now, further splitting up may
 lead to too much fractioning.

 20) When can the pname:objectTable be modified?

 Similar to the other inputs for flink:vkCmdProcessCommandsNVX, only when all
 device access via flink:vkCmdProcessCommandsNVX or execution of target
 command buffer has completed can an object at a given objectIndex be
 unregistered or re-registered again.

 21) Which buffer usage flags are required for the buffers referenced by
 flink:vkCmdProcessCommandsNVX

 reuse existing ename:VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT

   * slink:VkCmdProcessCommandsInfoNVX::pname:sequencesCountBuffer
   * slink:VkCmdProcessCommandsInfoNVX::pname:sequencesIndexBuffer
   * slink:VkIndirectCommandsTokenNVX::pname:buffer

 22) In which pipeline stage do the device generated command expansion
 happen?

 flink:vkCmdProcessCommandsNVX is treated as if it occurs in a separate
 logical pipeline from either graphics or compute, and that pipeline only
 includes ename:VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, a new stage
 ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX, and
 ename:VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT.
 This new stage has two corresponding new access types,
 ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX and
 ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX, used to synchronize reading
 the buffer inputs and writing the command buffer memory output.
 The output written in the target command buffer is considered to be consumed
 by the ename:VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage.

 Thus, to synchronize from writing the input buffers to executing
 flink:vkCmdProcessCommandsNVX, use:

   * pname:dstStageMask = ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX
   * pname:dstAccessMask = ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX

 To synchronize from executing flink:vkCmdProcessCommandsNVX to executing the
 generated commands, use

   * pname:srcStageMask = ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX
   * pname:srcAccessMask = ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX
   * pname:dstStageMask = ename:VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
   * pname:dstAccessMask = ename:VK_ACCESS_INDIRECT_COMMAND_READ_BIT

 When flink:vkCmdProcessCommandsNVX is used with a pname:targetCommandBuffer
 of `NULL`, the generated commands are immediately executed and there is
 implicit synchronization between generation and execution.

 23) What if most token data is "`static`", but we frequently want to render
 a subsection?

 added "`sequencesIndexBuffer`".
 This allows to easier sort and filter what should actually be processed.

 === Example Code

 Open-Source samples illustrating the usage of the extension can be found at
 the following locations:

 https://github.com/nvpro-samples/gl_vk_threaded_cadscene/blob/master/doc/vulkan_nvxdevicegenerated.md

 https://github.com/NVIDIAGameWorks/GraphicsSamples/tree/master/samples/vk10-kepler/BasicDeviceGeneratedCommandsVk

 [source,c]
 ---------------------------------------------------

   // setup secondary command buffer
     vkBeginCommandBuffer(generatedCmdBuffer, &beginInfo);
     ... setup its state as usual

   // insert the reservation (there can only be one per command buffer)
   // where the generated calls should be filled into
     VkCmdReserveSpaceForCommandsInfoNVX reserveInfo = { VK_STRUCTURE_TYPE_CMD_RESERVE_SPACE_FOR_COMMANDS_INFO_NVX };
     reserveInfo.objectTable = objectTable;
     reserveInfo.indirectCommandsLayout = deviceGeneratedLayout;
     reserveInfo.maxSequencesCount = myCount;
     vkCmdReserveSpaceForCommandsNVX(generatedCmdBuffer, &reserveInfo);

     vkEndCommandBuffer(generatedCmdBuffer);

   // trigger the generation at some point in another primary command buffer
     VkCmdProcessCommandsInfoNVX processInfo = { VK_STRUCTURE_TYPE_CMD_PROCESS_COMMANDS_INFO_NVX };
     processInfo.objectTable = objectTable;
     processInfo.indirectCommandsLayout = deviceGeneratedLayout;
     processInfo.maxSequencesCount = myCount;
     // set the target of the generation (if null we would directly execute with mainCmd)
     processInfo.targetCommandBuffer = generatedCmdBuffer;
     // provide input data
     processInfo.indirectCommandsTokenCount = 3;
     processInfo.pIndirectCommandsTokens = myTokens;

   // If you modify the input buffer data referenced by VkCmdProcessCommandsInfoNVX,
   // ensure you have added the appropriate barriers prior generation process.
   // When regenerating the content of the same reserved space, ensure prior operations have completed

     VkMemoryBarrier memoryBarrier = { VK_STRUCTURE_TYPE_MEMORY_BARRIER };
     memoryBarrier.srcAccessMask = ...;
     memoryBarrier.dstAccessMask = VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX;

     vkCmdPipelineBarrier(mainCmd,
                          /*srcStageMask*/VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
                          /*dstStageMask*/VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX,
                          /*dependencyFlags*/0,
                          /*memoryBarrierCount*/1,
                          /*pMemoryBarriers*/&memoryBarrier,
                          ...);

     vkCmdProcessCommandsNVX(mainCmd, &processInfo);
     ...
   // execute the secondary command buffer and ensure the processing that modifies command-buffer content
   // has completed

     memoryBarrier.srcAccessMask = VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX;
     memoryBarrier.dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;

     vkCmdPipelineBarrier(mainCmd,
                          /*srcStageMask*/VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX,
                          /*dstStageMask*/VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
                          /*dependencyFlags*/0,
                          /*memoryBarrierCount*/1,
                          /*pMemoryBarriers*/&memoryBarrier,
                          ...)
     vkCmdExecuteCommands(mainCmd, 1, &generatedCmdBuffer);

 ---------------------------------------------------

 === Version History

  * Revision 3, 2017-07-25 (Chris Hebert)
    - Correction to specification of dynamicCount for push_constant token in
      VkIndirectCommandsLayoutNVX.
      Stride was incorrectly computed as dynamicCount was not treated as byte
      size.
  * Revision 2, 2017-06-01 (Christoph Kubisch)
    - header compatibility break: add missing _TYPE to
      VkIndirectCommandsTokenTypeNVX and VkObjectEntryTypeNVX enums to follow
      Vulkan naming convention
    - behavior clarification: only allow a single work provoking token per
      sequence when creating a slink:VkIndirectCommandsLayoutNVX
  * Revision 1, 2016-10-31 (Christoph Kubisch)
    - Initial draft
	include::meta/VK_NVX_device_generated_commands.txt[]

	Last Modified Date::
	2017-07-25
	Contributors::
	- Pierre Boudier, NVIDIA
	- Christoph Kubisch, NVIDIA
	- Mathias Schott, NVIDIA
	- Jeff Bolz, NVIDIA
	- Eric Werness, NVIDIA
	- Detlef Roettger, NVIDIA
	- Daniel Koch, NVIDIA
	- Chris Hebert, NVIDIA

	This extension allows the device to generate a number of critical commands
	for command buffers.

	When rendering a large number of objects, the device can be leveraged to
	implement a number of critical functions, like updating matrices, or
	implementing occlusion culling, frustum culling, front to back sorting, etc.
	Implementing those on the device does not require any special extension,
	since an application is free to define its own data structure, and just
	process them using shaders.

	However, if the application desires to quickly kick off the rendering of the
	final stream of objects, then unextended Vulkan forces the application to
	read back the processed stream and issue graphics command from the host.
	For very large scenes, the synchronization overhead, and cost to generate
	the command buffer can become the bottleneck.
	This extension allows an application to generate a device side stream of
	state changes and commands, and convert it efficiently into a command buffer
	without having to read it back on the host.

	Furthermore, it allows incremental changes to such command buffers by
	manipulating only partial sections of a command stream -- for example
	pipeline bindings.
	Unextended Vulkan requires re-creation of entire command buffers in such
	scenario, or updates synchronized on the host.

	The intended usage for this extension is for the application to:

	* create its objects as in unextended Vulkan
	* create a slink:VkObjectTableNVX, and register the various Vulkan objects
	that are needed to evaluate the input parameters.
	* create a slink:VkIndirectCommandsLayoutNVX, which lists the
	slink:VkIndirectCommandsTokenTypeNVX it wants to dynamically change as
	atomic command sequence.
	This step likely involves some internal device code compilation, since
	the intent is for the GPU to generate the command buffer in the
	pipeline.
	* fill the input buffers with the data for each of the inputs it needs.
	Each input is an array that will be filled with an index in the object
	table, instead of using CPU pointers.
	* set up a target secondary command buffer
	* reserve command buffer space via flink:vkCmdReserveSpaceForCommandsNVX
	in a target command buffer at the position you want the generated
	commands to be executed.
	* call flink:vkCmdProcessCommandsNVX to create the actual device commands
	for all sequences based on the array contents into a provided target
	command buffer.
	* execute the target command buffer like a regular secondary command
	buffer

	For each draw/dispatch, the following can be specified:

	* a different pipeline state object
	* a number of descriptor sets, with dynamic offsets
	* a number of vertex buffer bindings, with an optional dynamic offset
	* a different index buffer, with an optional dynamic offset

	Applications should: register a small number of objects, and use dynamic
	offsets whenever possible.

	While the GPU can be faster than a CPU to generate the commands, it may not
	happen asynchronously, therefore the primary use-case is generating "`less`"
	total work (occlusion culling, classification to use specialized shaders,
	etc.).

	=== New Object Types

	* slink:VkObjectTableNVX
	* slink:VkIndirectCommandsLayoutNVX

	=== New Flag Types

	* elink:VkIndirectCommandsLayoutUsageFlagsNVX
	* elink:VkObjectEntryUsageFlagsNVX

	=== New Enum Constants

	Extending elink:VkStructureType:

	** ename:VK_STRUCTURE_TYPE_OBJECT_TABLE_CREATE_INFO_NVX
	** ename:VK_STRUCTURE_TYPE_INDIRECT_COMMANDS_LAYOUT_CREATE_INFO_NVX
	** ename:VK_STRUCTURE_TYPE_CMD_PROCESS_COMMANDS_INFO_NVX
	** ename:VK_STRUCTURE_TYPE_CMD_RESERVE_SPACE_FOR_COMMANDS_INFO_NVX
	** ename:VK_STRUCTURE_TYPE_DEVICE_GENERATED_COMMANDS_LIMITS_NVX
	** ename:VK_STRUCTURE_TYPE_DEVICE_GENERATED_COMMANDS_FEATURES_NVX

	Extending elink:VkPipelineStageFlagBits:

	** ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX

	Extending elink:VkAccessFlagBits:

	** ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX
	** ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX

	=== New Enums

	* elink:VkIndirectCommandsLayoutUsageFlagBitsNVX
	* elink:VkIndirectCommandsTokenTypeNVX
	* elink:VkObjectEntryUsageFlagBitsNVX
	* elink:VkObjectEntryTypeNVX

	=== New Structures

	* slink:VkDeviceGeneratedCommandsFeaturesNVX
	* slink:VkDeviceGeneratedCommandsLimitsNVX
	* slink:VkIndirectCommandsTokenNVX
	* slink:VkIndirectCommandsLayoutTokenNVX
	* slink:VkIndirectCommandsLayoutCreateInfoNVX
	* slink:VkCmdProcessCommandsInfoNVX
	* slink:VkCmdReserveSpaceForCommandsInfoNVX
	* slink:VkObjectTableCreateInfoNVX
	* slink:VkObjectTableEntryNVX
	* slink:VkObjectTablePipelineEntryNVX
	* slink:VkObjectTableDescriptorSetEntryNVX
	* slink:VkObjectTableVertexBufferEntryNVX
	* slink:VkObjectTableIndexBufferEntryNVX
	* slink:VkObjectTablePushConstantEntryNVX

	=== New Functions

	* flink:vkCmdProcessCommandsNVX
	* flink:vkCmdReserveSpaceForCommandsNVX
	* flink:vkCreateIndirectCommandsLayoutNVX
	* flink:vkDestroyIndirectCommandsLayoutNVX
	* flink:vkCreateObjectTableNVX
	* flink:vkDestroyObjectTableNVX
	* flink:vkRegisterObjectsNVX
	* flink:vkUnregisterObjectsNVX
	* flink:vkGetPhysicalDeviceGeneratedCommandsPropertiesNVX

	=== Issues

	1) How to name this extension ?

	RESOLVED: `VK_NVX_device_generated_commands`

	As usual, one of the hardest issues ;)

	Alternatives: `VK_gpu_commands`, `VK_execute_commands`,
	`VK_device_commands`, `VK_device_execute_commands`, `VK_device_execute`,
	`VK_device_created_commands`, `VK_device_recorded_commands`,
	`VK_device_generated_commands`

	2) Should we use serial tokens or redundant sequence description?

	Similarly to slink:VkPipeline, signatures have the most likelihood to be
	cross-vendor adoptable.
	They also benefit from being processable in parallel.

	3) How to name sequence description

	stext:ExecuteCommandSignature is a bit long.
	Maybe just stext:ExecuteSignature, or actually more following Vulkan
	nomenclature: slink:VkIndirectCommandsLayoutNVX.

	4) Do we want to provide code:indirectCommands inputs with layout or at
	code:indirectCommands time?

	Separate layout from data as Vulkan does.
	Provide full flexibilty for code:indirectCommands.

	5) Should the input be provided as SoA or AoS?

	It is desirable for the application to reuse the list of objects and render
	them with some kind of an override.
	This can be done by just selecting a different input for a push constant or
	a descriptor set, if they are defined as independent arrays.
	If the data was interleaved, this would not be as easily possible.

	Allowing input divisors can also reduce the conservative command buffer
	allocation.

	6) How do we know the size of the GPU command buffer generated by
	flink:vkCmdProcessCommandsNVX ?

	pname:maxSequenceCount can give an upper estimate, even if the actual count
	is sourced from the gpu buffer at (buffer, countOffset).
	As such pname:maxSequenceCount must always be set correctly.

	Developers are encouraged to make well use the
	slink:VkIndirectCommandsLayoutNVX's ptext:pTokens[].divisor, as they allow
	less conservative storage costs.
	Especially pipeline changes on a per-draw basis can be costly memory wise.

	7) How to deal with dynamic offsets in DescriptorSets?

	Maybe additional token etext:VK_EXECUTE_DESCRIPTOR_SET_OFFSET_COMMAND_NVX
	that works for a "`single dynamic buffer`" descriptor set and then use (32
	bit tableEntry + 32bit offset)

	added dynamicCount field, variable sized input

	8) Should we allow updates to the object table, similar to DescriptorSet?

	Desired yes, people may change "`material`" shaders and not want to recreate
	the entire register table.
	However the developer must ensure to not overwrite a registered objectIndex
	while it is still being used.

	9) Should we allow dynamic state changes?

	Seems a bit excessive for "`per-draw`" type of scenario, but GPU could
	partition work itself with viewport/scissor...

	10) How do we allow re-using already "`filled`" code:indirectCommands
	buffers?

	just use a slink:VkCommandBuffer for the output, and it can be reused
	easily.

	11) How portable should such re-use be?

	Same as secondary command buffer

	12) Should sequenceOrdered be part of IndirectCommandsLayout or
	slink:vkCmdProcessCommandsNVX?

	Seems better for IndirectCommandsLayout, as that is when most heavy lifting
	in terms of internal device code generation is done.

	13) Under which conditions is flink:vkCmdProcessCommandsNVX legal?

	Options:

	a) on the host command buffer like a regular draw call

	b) flink:vkCmdProcessCommandsNVX makes use slink:VkCommandBufferBeginInfo
	and serves as flink:vkBeginCommandBuffer / flink:vkEndCommandBuffer
	implicitly.

	c) The pname:targetCommandbuffer must be inside the "`begin`" state already
	at the moment of being passed.
	This very likely suggests a new slink:VkCommandBufferUsageFlags
	etext:VK_COMMAND_BUFFER_USAGE_DEVICE_GENERATED_BIT.

	d) The pname:targetCommandbuffer must reserve space via a new function.

	used a) and d).

	14) What if different pipelines have different DescriptorSetLayouts at a
	certain set unit that mismatches in code:token.dynamicCount?

	Considered legal, as long as the maximum dynamic count of all used
	DescriptorSetLayouts is provided.

	15) Should we add "`strides`" to input arrays, so that "`Array of
	Structures`" type setups can be supported more easily?

	Maybe provide a usage flag for packed tokens stream (all inputs from same
	buffer, implicit stride).

	No, given performance test was worse.

	16) Should we allow re-using the target command buffer directly, without
	need to reset command buffer?

	YES: new api flink:vkCmdReserveSpaceForCommandsNVX.

	17) Is flink:vkCmdProcessCommandsNVX copying the input data or referencing
	it ?

	There are multiple implementations possible:

	* one could have some emulation code that parse the inputs, and generates
	an output command buffer, therefore copying the inputs.
	* one could just reference the inputs, and have the processing done in
	pipe at execution time.

	If the data is mandated to be copied, then it puts a penalty on
	implementation that could process the inputs directly in pipe.
	If the data is "`referenced`", then it allows both types of implementation

	The inputs are "`referenced`", and should not be modified after the call to
	flink:vkCmdProcessCommandsNVX and until after the rendering of the target
	command buffer is finished.

	18) Why is this +NVX+ and not +NV+?

	To allow early experimentation and feedback.
	We expect that a version with a refined design as multi-vendor variant will
	follow up.

	19) Should we make the availability for each token type a device limit?

	Only distinguish between graphics/compute for now, further splitting up may
	lead to too much fractioning.

	20) When can the pname:objectTable be modified?

	Similar to the other inputs for flink:vkCmdProcessCommandsNVX, only when all
	device access via flink:vkCmdProcessCommandsNVX or execution of target
	command buffer has completed can an object at a given objectIndex be
	unregistered or re-registered again.

	21) Which buffer usage flags are required for the buffers referenced by
	flink:vkCmdProcessCommandsNVX

	reuse existing ename:VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT

	* slink:VkCmdProcessCommandsInfoNVX::pname:sequencesCountBuffer
	* slink:VkCmdProcessCommandsInfoNVX::pname:sequencesIndexBuffer
	* slink:VkIndirectCommandsTokenNVX::pname:buffer

	22) In which pipeline stage do the device generated command expansion
	happen?

	flink:vkCmdProcessCommandsNVX is treated as if it occurs in a separate
	logical pipeline from either graphics or compute, and that pipeline only
	includes ename:VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, a new stage
	ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX, and
	ename:VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT.
	This new stage has two corresponding new access types,
	ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX and
	ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX, used to synchronize reading
	the buffer inputs and writing the command buffer memory output.
	The output written in the target command buffer is considered to be consumed
	by the ename:VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage.

	Thus, to synchronize from writing the input buffers to executing
	flink:vkCmdProcessCommandsNVX, use:

	* pname:dstStageMask = ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX
	* pname:dstAccessMask = ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX

	To synchronize from executing flink:vkCmdProcessCommandsNVX to executing the
	generated commands, use

	* pname:srcStageMask = ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX
	* pname:srcAccessMask = ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX
	* pname:dstStageMask = ename:VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
	* pname:dstAccessMask = ename:VK_ACCESS_INDIRECT_COMMAND_READ_BIT

	When flink:vkCmdProcessCommandsNVX is used with a pname:targetCommandBuffer
	of `NULL`, the generated commands are immediately executed and there is
	implicit synchronization between generation and execution.

	23) What if most token data is "`static`", but we frequently want to render
	a subsection?

	added "`sequencesIndexBuffer`".
	This allows to easier sort and filter what should actually be processed.

	=== Example Code

	Open-Source samples illustrating the usage of the extension can be found at
	the following locations:

	https://github.com/nvpro-samples/gl_vk_threaded_cadscene/blob/master/doc/vulkan_nvxdevicegenerated.md

	https://github.com/NVIDIAGameWorks/GraphicsSamples/tree/master/samples/vk10-kepler/BasicDeviceGeneratedCommandsVk

	[source,c]
	---------------------------------------------------

	// setup secondary command buffer
	vkBeginCommandBuffer(generatedCmdBuffer, &beginInfo);
	... setup its state as usual

	// insert the reservation (there can only be one per command buffer)
	// where the generated calls should be filled into
	VkCmdReserveSpaceForCommandsInfoNVX reserveInfo = { VK_STRUCTURE_TYPE_CMD_RESERVE_SPACE_FOR_COMMANDS_INFO_NVX };
	reserveInfo.objectTable = objectTable;
	reserveInfo.indirectCommandsLayout = deviceGeneratedLayout;
	reserveInfo.maxSequencesCount = myCount;
	vkCmdReserveSpaceForCommandsNVX(generatedCmdBuffer, &reserveInfo);

	vkEndCommandBuffer(generatedCmdBuffer);

	// trigger the generation at some point in another primary command buffer
	VkCmdProcessCommandsInfoNVX processInfo = { VK_STRUCTURE_TYPE_CMD_PROCESS_COMMANDS_INFO_NVX };
	processInfo.objectTable = objectTable;
	processInfo.indirectCommandsLayout = deviceGeneratedLayout;
	processInfo.maxSequencesCount = myCount;
	// set the target of the generation (if null we would directly execute with mainCmd)
	processInfo.targetCommandBuffer = generatedCmdBuffer;
	// provide input data
	processInfo.indirectCommandsTokenCount = 3;
	processInfo.pIndirectCommandsTokens = myTokens;

	// If you modify the input buffer data referenced by VkCmdProcessCommandsInfoNVX,
	// ensure you have added the appropriate barriers prior generation process.
	// When regenerating the content of the same reserved space, ensure prior operations have completed

	VkMemoryBarrier memoryBarrier = { VK_STRUCTURE_TYPE_MEMORY_BARRIER };
	memoryBarrier.srcAccessMask = ...;
	memoryBarrier.dstAccessMask = VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX;

	vkCmdPipelineBarrier(mainCmd,
	/srcStageMask/VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
	/dstStageMask/VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX,
	/dependencyFlags/0,
	/memoryBarrierCount/1,
	/pMemoryBarriers/&memoryBarrier,
	...);

	vkCmdProcessCommandsNVX(mainCmd, &processInfo);
	...
	// execute the secondary command buffer and ensure the processing that modifies command-buffer content
	// has completed

	memoryBarrier.srcAccessMask = VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX;
	memoryBarrier.dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;

	vkCmdPipelineBarrier(mainCmd,
	/srcStageMask/VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX,
	/dstStageMask/VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
	/dependencyFlags/0,
	/memoryBarrierCount/1,
	/pMemoryBarriers/&memoryBarrier,
	...)
	vkCmdExecuteCommands(mainCmd, 1, &generatedCmdBuffer);

	---------------------------------------------------

	=== Version History

	* Revision 3, 2017-07-25 (Chris Hebert)
	- Correction to specification of dynamicCount for push_constant token in
	VkIndirectCommandsLayoutNVX.
	Stride was incorrectly computed as dynamicCount was not treated as byte
	size.
	* Revision 2, 2017-06-01 (Christoph Kubisch)
	- header compatibility break: add missing _TYPE to
	VkIndirectCommandsTokenTypeNVX and VkObjectEntryTypeNVX enums to follow
	Vulkan naming convention
	- behavior clarification: only allow a single work provoking token per
	sequence when creating a slink:VkIndirectCommandsLayoutNVX
	* Revision 1, 2016-10-31 (Christoph Kubisch)
	- Initial draft