EnhancePerf

From Spice

Improving Spice performance: problematic scenarios and ideas

  1. Unnecessary qxl commands rendering
    Rendering qxl commands does not only affect cpu, but potentially also affect bandwidth:
    when a command is rendered - it is removed from current. Thus, if it hasn't been sent yet to the client, and it is being hidden by another command, it will still be sent to the client.
    1. When reading a qxl command for surface X, we go over all the commands in current that
      their src/mask/bitmap are from surface X and we call update_area for their destination surface and bbox.
      These calls ensures the correctness of the rendering order in following update_area's calls,
      and the correctness of current tree manipulations.
      However, this is a brute-force solution. By extending current lists and trees to keep and handle dependencies across surfaces we can completely avoid rendering when processing qxl commands, and render only upon guest requests.
    2. update_area logic does not utilize the current tree.
      The code for utilizing the tree exists but is not used due to an unsolved rendering bug.
  2. Find sources of update_area when pressing Start button / win key in windows 7 32. causes 50 update_area calls.
    Here are some findings
    1. There are much more calls to update_area when using off-screen surfaces
    2. Most of the off-screen surfaces are short-living.
    3. Many of the off-screen surfaces are not even used, or used only once for other surfaces, and are mostly used for gdi-managed bitmaps. Hence, their relevant messages are unnecessarily rendered by the server and sent to the client.
    4. Disabling font-smoothing helps to reduce the number of update_area calls. GDI doesn't support hw accelerated text rendering. So the rendering is done by software on gdi-managed bitmaps.
      DirectX contains api for text rendering, so in the future we may have a solution for this. At least for apps that use it.
      The Linux driver suffers from similar problems (for font smoothing - till the support for X11 Render extension ops will be implemented).
  3. Parallelizing red_worker
    The only thing we should do when we read a command from the ring is
    to push it to the client send queue. The “current tree” is used for (1) deleting from the drawables` history drawables that are already covered by other ones (we are keeping the history for the case we need to render something in the server side).
    (2) Removing from the sending queue such drawables.
    Maintaing the tree can be done in a different thread (with lower priority?), and it should only be synchronized in events of calls to update_area (i.e., rendering surfaces).
    • Bitmaps compression can also be separated from sending. Currently compression and sending are coupled. But we can start compressing a pending bitmap while sending the current message. There can also be several bitmaps in one message. We can compress them at the same time.
    • This feature is also essential for a good support for multiple clients.
    • Parallelization can also be relevant for the sound channel and other channels.
  4. Parallelizing client channels
    If the client have enough cpu, we can decode compressed bitmaps, and render qxl commands in parallel.
  5. Video streaming
    • reevaluate the fps calculation
    • dynamic jpeg quality
  6. Faster compression
  7. Concatenating messages
    When working with Microsoft Excel and details-rich graphs, the graph drawing takes a lot of time. Many small DRAW_STROKE and DRAW_FILL commands are sent. Need to check if cancelling TCP_NODELAY when streams of small messages are send solves this.
  8. Utilization of bitmaps that are almost identical
    • When clear-type is enabled, when using Microsoft Word, each time a small number of characters is typed, the whole Word page is being sent.
    • ideas:
      • Caching by bitmaps' segments
        Add a grid to bitmaps and compute a unique id to each rectangle.
        Cache bitmaps rectangles and combine caching with compression of bitmaps.
        Note the GLZ currently does not support non memory consecutive bitmaps lines, so it will be harder to combine it with caching.
        This solution is good for bitmaps that are aligned and that the difference between them is very local.
      • Store a short history of bitmaps (maybe employ the GLZ dictionary). When new bitmap arrives, use its dimension or other properties in order to fetch a previous bitmap that may be similar to it. Partially compare the bitmaps (sampling). If the images are very similar, compress the new one by using the other (use GLZ for example).
        This solution is good for bitmaps that are aligned, but the difference between can be scattered. For example: PowerPoint presentation with a constant background picture.
    • TODO: Check if the bitmaps sent while typing in Microsoft Word are aligned.
  9. PNG or TIFF instead of ZLIB-GLZ over WAN
    • TODO: compare compression ratio
  10. Clipping - The current tree contains information about the visible part of a drawable (w.r.t the current drawables in the tree). Maybe we can send to the client only the visible part.
    First, we need to test how often we send the client a full bitmap, while we know it is partially hidden.
    Note that the same issue of GLZ not handling non-consecutive bitmaps arises here.
  11. Squash pipe
    when the size of surface X drawables inside the pipe >> size of surface X ==> render and send surface X (or only its relevant boxes).
    • Yonit: Actually, if the current tree and caching work as expected, I don't think such occurrences are common. Unless there are many non opaque operations on the same rectangle that hold large bitmaps which are not cached.
  12. QXL driver - SSE2 for 64 bit and maybe for unaligned as well. see RHBZ #705785
  13. Surfaces BAR - why do we need it? can't we get the guest to allocate continuous memory for surfaces and use that? would avoid allocating a whole BAR upfront and allow limitless (up to ram) surface memory. For opengl/directx we would probably do that anyway.
  14. spice protocol
    • add DESTROY_ALL_SURFACES msg instead of destroying the surfaces one by one when QXL_IO_DESTROY_ALL_SURFACE happens. i.e., instead of sending SPICE_MSG_DESTROY_SURFACE for each surface. In addition, don't send SPICE_MSG_DESTROY_SURFACE for surfaces that weren't created in the client.
  15. Allocate surface only on demand. Currently if a surface create is followed by a surface destroy and nothing in between, we waste the allocation effort. Can simply have a bit on the server (no need on the device or on memory) to say "unallocated", allocate when update_area occours - no need before (so a surface can have commands directed at it and from it and still not be allocated - it will still be allocated on the client).
  16. Fix slow spice client start
    1. Remove the bandwidth test
      • First, replace it with a profile that can be set by the client. i.e. the user will choose if he uses WAN/LAN etc.
      • Later, we can add automatic bandwidth monitoring which is based on actual spice traffic
    2. When the agent is up, display setting are being reloaded from registry or user setting upon spice client connection.
      This causes updates to the display, and makes the client start very slowly over WAN. We should change the display settings only if they are different from the current ones.
  17. Consider implementing DrvLineTo, DrvFillPath, DrvStrokeAndFillPath for XDDM driver (or their WDDM equivalent). Test how common they are in real user applications (and not just benchmarks like Tom2D).
  18. Handle detailed clipping
    A draw command can hold a clip that contains a lot of small rectangles (for example, since we don't implement DrvLineTo, it is replaced with DrawFill and the line is represented by such detailed clip). Creating a region for the current tree item of a command like this is intensive. Since it seems unlikely that such command will cover another one, or that only the clipped area will be covered, we can add it to the current tree as a non opaque operation and without the clip. We can classify such commands by the number of rects in the clip, and the maximal rectangle in the clip.
  19. Improve command polling
    When Spice server attempts to read from the command ring and it is empty, it goes back to epoll_wait with CMD_RING_POLL_TIMEOUT (10ms). Which means, that it will attempt another read after 10ms, or if a client is connected, when the client’s socket is ready for writing. When there is a flow of commands this timeout might be too large. Decreasing it to 2ms, improved Tom2D benchmark results.
  20. client MSGC_ACK timing
    1. spicec acks can be delayed: when the channel receives the <window>th message, it pushes the ack message to the send queue, and continues receiving and processing messages. Thus, the ack will be sent only after the channel completes receiving messages. When there are many small messages in the receive queue, this might cause a delay in the display channel. Is this relevant to spice-gtk?
    2. consider the latency and bandwidth when setting the window size and the acks intervals. In addition, should the window be configured by number of messages, or by size?