In order to simplify the lockdep annotation, as they become more complex
in the future with deferred execution and multiple paths through the
same functions, create a separate lockclass for the user timeline and
the hardware execution timeline.
We should only ever be locking the user timeline and the execution
timeline in parallel so we only need to create two lock classes, rather
than a separate class for every timeline.
v2: Rename the lock classes to be more consistent with other lockdep.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161114204105.29171-2-chris@chris-wilson.co.uk
When we release the shmem backing storage, we make sure that the pages
are coherent with the cpu cache. However, our clflush routine was
skipping the flush as the object had no pages at release time. Fix this by
explicitly flushing the sg_table we are decoupling.
Fixes: 03ac84f183 ("drm/i915: Pass around sg_table to get_pages/put_pages backend")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161111145809.9701-2-chris@chris-wilson.co.uk
A small selection of macros which can only accept dev_priv from
now on and a resulting trickle of fixups.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: David Weinehall <david.weinehall@linux.intel.com>
As a side product, had to split two other files;
- i915_gem_fence_reg.h
- i915_gem_object.h (only parts that needed immediate untanglement)
I tried to move code in as big chunks as possible, to make review
easier. i915_vma_compare was moved to a header temporarily.
v2:
- Use i915_gem_fence_reg.{c,h}
v3:
- Rebased
v4:
- Fix building when DEBUG_GEM is enabled by reordering a bit.
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1478861034-30643-1-git-send-email-joonas.lahtinen@linux.intel.com
At the moment we allocate enough sg table entries assuming we
will not be able to do any coalescing. But since in practice
we most often can, and more so very effectively, this ends up
wasting a lot of memory.
A simple and effective way of trimming the over-allocated
entries is to copy the table over to a new one allocated to the
exact size.
Experiments on my freshly logged and idle desktop (KDE) showed
that by doing this we can save approximately 1 MiB of RAM, or
when running a typical benchmark like gl_manhattan I have
even seen a 6 MiB saving.
More complicated techniques such as only copying the last used
page and freeing the rest are left to the reader.
v2:
* Update commit message.
* Use temporary sg_table on stack. (Chris Wilson)
v3:
* Commit message update.
* Comment added.
* Replace memcpy with copy assignment.
(Chris Wilson)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/1478704423-7447-1-git-send-email-tvrtko.ursulin@linux.intel.com
We always flush the chipset prior to executing with the GPU, so we can
skip the flush during ordinary domain management.
This should help mitigate some of the potential performance regressions,
but likely trivial, from doing the flush unconditionally before execbuf
introduced in commit dcd79934b0 ("drm/i915: Unconditionally flush any
chipset buffers before execbuf")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/20161106130001.9509-1-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
During resume we will reset the SW/HW tracking for each ring head/tail
pointers and so are not prepared to replay any pending requests (as
opposed to GPU reset time). Add an assert for this both to the suspend
and the resume code.
v2:
- Check for ELSP port idle already during suspend and check !gt.awake
during resume. (Chris)
v3:
- Move the !gt.awake check to i915_gem_resume().
v4:
- s/intel_lr_engines_idle/intel_execlists_idle/ (Chris)
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/1478510405-11799-4-git-send-email-imre.deak@intel.com
We assume that the GPU is idle once receiving the seqno via the last
request's user interrupt. In execlist mode the corresponding context
completed interrupt can be delayed though and until this latter
interrupt arrives we consider the request to be pending on the ELSP
submit port. This can cause a problem during system suspend where this
last request will be seen by the resume code as still pending. Such
pending requests are normally replayed after a GPU reset, but during
resume we reset both SW and HW tracking of the ring head/tail pointers,
so replaying the pending request with its stale tail pointer will leave
the ring in an inconsistent state. A subsequent request submission can
lead then to the GPU executing from uninitialized area in the ring
behind the above stale tail pointer.
Fix this by making sure any pending request on the ELSP port is
completed before suspending. I used a polling wait since the completion
time I measured was <1ms and since normally we only need to wait during
system suspend. GPU idling during runtime suspend is scheduled with a
delay (currently 50-100ms) after the retirement of the last request at
which point the context completed interrupt must have arrived already.
The chance of this bug was increased by
commit 1c777c5d1d
Author: Imre Deak <imre.deak@intel.com>
Date: Wed Oct 12 17:46:37 2016 +0300
drm/i915/hsw: Fix GPU hang during resume from S3-devices state
but it could happen even without the explicit GPU reset, since we
disable interrupts afterwards during the suspend sequence.
v2:
- Do an unlocked poll-wait first. (Chris)
v3-4:
- s/intel_lr_engines_idle/intel_execlists_idle/ and move
i915.enable_execlists check to the new helper. (Chris)
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98470
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/1478510405-11799-3-git-send-email-imre.deak@intel.com
There is a small race where a new request can be submitted and retired
after the idle worker started to run which leads to idling the GPU too
early. Fix this by deferring the idling to the pending instance of the
worker.
This scenario was pointed out by Chris.
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/1478510405-11799-2-git-send-email-imre.deak@intel.com
Valleyview appears to be limited to only scanning out from the first 512MiB
of the Global GTT. Lets presume that this behaviour was inherited from the
display block copied from g4x (not Ironlake) and all earlier generations
are similarly affected, though testing suggests different symptoms. For
simplicity, impose that these platforms must scanout from the mappable
region. (For extra simplicity, use HAS_GMCH_DISPLAY even though this
catches Cherryview which does not appear to be limited to the low
aperture for its scanout.)
v2: Use HAS_GMCH_DISPLAY() to more clearly convey my intent about
limiting this workaround to the old style of display engine.
v3: Update changelog to reflect testing by Ville Syrjälä
v4: Include the changes to the comments as well
Reported-by: Luis Botello <luis.botello.ortega@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98036
Fixes: 2efb813d53 ("drm/i915: Fallback to using unmappable memory for scanout")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Akash Goel <akash.goel@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.9-rc1+
Link: http://patchwork.freedesktop.org/patch/msgid/20161107110128.28762-1-chris@chris-wilson.co.uk
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com
When we split a large object up into chunks for GTT faulting (because we
can't fit the whole object into the aperture) we have to align our cuts
with the fence registers. Each partial VMA must cover a complete set of
tile rows or the offset into each partial VMA is not aligned with the
whole image. Currently we enforce a minimum size on each partial VMA,
but this minimum size itself was not aligned to the tile row causing
distortion.
Reported-by: Andreas Reis <andreas.reis@gmail.com>
Reported-by: Chris Clayton <chris2553@googlemail.com>
Reported-by: Norbert Preining <preining@logic.at>
Tested-by: Norbert Preining <preining@logic.at>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Fixes: 03af84fe7f ("drm/i915: Choose partial chunksize based on tile row size")
Fixes: a61007a83a ("drm/i915: Fix partial GGTT faulting") # enabling patch
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98402
Testcase: igt/gem_mmap_gtt/medium-copy-odd
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.9-rc1+
Link: http://patchwork.freedesktop.org/patch/msgid/20161107105443.27855-1-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
commit bc0629a767 ("drm/i915: Track pages pinned due to swizzling
quirk") fixed one problem, but revealed a whole lot more. The root cause
of the pin count mismatch for the swizzle quirk (for L-shaped memory on
gen3/4) was that we were incrementing the pages_pin_count upon getting
the backing pages but then overwriting the pages_pin_count to set it to
1 afterwards. With a little bit of adjustment to satisfy the GEM_BUG_ON
sanitychecks, the fix is to replace the explicit atomic_set with an
atomic_inc.
v2: Consistently use atomics (not mix atomics and helpers) within the
lowlevel get_pages routines. This makes the atomic operations much
clearer.
Fixes: 1233e2db19 ("drm/i915: Move object backing storage manipulation")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161104103001.27643-1-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Commit 1bec9b0bda ("drm/i915/shrinker: Only shmemfs objects
are backed by swap") stopped considering the userptr objects
in shrinker callbacks.
Restore that so idle userptr objects can be discarded in order
to free up memory.
One change further to what was introduced in 1bec9b0bda is
to start considering userptr objects in oom but that should
also be a correct thing to do.
v2: Introduce I915_GEM_OBJECT_IS_SHRINKABLE. (Chris Wilson)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: 1bec9b0bda ("drm/i915/shrinker: Only shmemfs objects are backed by swap")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://patchwork.freedesktop.org/patch/msgid/1478011450-6634-1-git-send-email-tvrtko.ursulin@linux.intel.com
The shrinker may appear to recurse into obj->mm.lock as the shrinker may
be called from a direct reclaim path whilst handling get_pages. We
filter out recursing on the same obj->mm.lock by inspecting
obj->mm.pages, but we do want to take the lock on a second object in
order to reap their pages. lockdep spots the recursion on the same
lockclass and needs annotation to avoid a false positive. To keep the
two paths distinct, create an enum to indicate which subclass of
obj->mm.lock we are using. This removes the false positive and avoids
masking real bugs.
Suggested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161101121134.27504-1-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
With full-ppgtt one of the main bottlenecks is the lookup of the VMA
underneath the object. For execbuf there is merit in having a very fast
direct lookup of ctx:handle to the vma using a hashtree, but that still
leaves a large number of other lookups. One way to speed up the lookup
would be to use a rhashtable, but that requires extra allocations and
may exhibit poor worse case behaviour. An alternative is to use an
embedded rbtree, i.e. no extra allocations and deterministic behaviour,
but at the slight cost of O(lgN) lookups (instead of O(1) for
rhashtable). The major of such tree will be very shallow and so not much
slower, and still scales much, much better than the current unsorted
list.
v2: Bump vma_compare() to return a long, as we return the result of
comparing two pointers.
References: https://bugs.freedesktop.org/show_bug.cgi?id=87726
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161101115400.15647-1-chris@chris-wilson.co.uk
If we have a tiled object and an unknown CPU swizzle pattern, we pin the
pages to prevent the object from being swapped out (and us corrupting
the contents as we do not know the access pattern and so cannot convert
it to linear and back to tiled on reuse). This requires us to remember
to drop the extra pinning when freeing the object, or else we trigger
warnings about the pin leak. In commit fbbd37b36f ("drm/i915: Move
object release to a freelist + worker"), the object free path was
deferred to a worker, but the unpinning of the quirk, along with marking
the object as reclaimable, was left on the immediate path (so that if
required we could reclaim the pages under memory pressure as early as
possible). However, this split introduced a bug where the pages were no
longer being unpinned if they were marked as unneeded.
[ 231.800401] WARNING: CPU: 1 PID: 90 at drivers/gpu/drm/i915/i915_gem.c:4275 __i915_gem_free_objects+0x326/0x3c0 [i915]
[ 231.800403] WARN_ON(i915_gem_object_has_pinned_pages(obj))
[ 231.800405] Modules linked in:
[ 231.800406] snd_hda_intel i915 snd_hda_codec_generic mei_me snd_hda_codec coretemp snd_hwdep mei lpc_ich snd_hda_core snd_pcm e1000e ptp pps_core [last unloaded: i915]
[ 231.800426] CPU: 1 PID: 90 Comm: kworker/1:4 Tainted: G U 4.9.0-rc2-CI-CI_DRM_1780+ #1
[ 231.800428] Hardware name: LENOVO 7465CTO/7465CTO, BIOS 6DET44WW (2.08 ) 04/22/2009
[ 231.800456] Workqueue: events __i915_gem_free_work [i915]
[ 231.800459] ffffc9000034fc80 ffffffff8142dd65 ffffc9000034fcd0 0000000000000000
[ 231.800465] ffffc9000034fcc0 ffffffff8107e4e6 000010b300000001 0000000000001000
[ 231.800469] ffff88011d3db740 ffff880130ef0000 0000000000000000 ffff880130ef5ea0
[ 231.800474] Call Trace:
[ 231.800479] [<ffffffff8142dd65>] dump_stack+0x67/0x92
[ 231.800484] [<ffffffff8107e4e6>] __warn+0xc6/0xe0
[ 231.800487] [<ffffffff8107e54a>] warn_slowpath_fmt+0x4a/0x50
[ 231.800491] [<ffffffff811d12ac>] ? kmem_cache_free+0x2dc/0x340
[ 231.800520] [<ffffffffa009ef36>] __i915_gem_free_objects+0x326/0x3c0 [i915]
[ 231.800548] [<ffffffffa009effe>] __i915_gem_free_work+0x2e/0x50 [i915]
[ 231.800552] [<ffffffff8109c27c>] process_one_work+0x1ec/0x6b0
[ 231.800555] [<ffffffff8109c1f6>] ? process_one_work+0x166/0x6b0
[ 231.800558] [<ffffffff8109c789>] worker_thread+0x49/0x490
[ 231.800561] [<ffffffff8109c740>] ? process_one_work+0x6b0/0x6b0
[ 231.800563] [<ffffffff8109c740>] ? process_one_work+0x6b0/0x6b0
[ 231.800566] [<ffffffff810a2aab>] kthread+0xeb/0x110
[ 231.800569] [<ffffffff810a29c0>] ? kthread_park+0x60/0x60
[ 231.800573] [<ffffffff818164a7>] ret_from_fork+0x27/0x40
Moving to a separate flag for tracking the quirked pin is overkill for
the bug (since we only have to interchange the two tests in
i915_gem_free_object) but it does reduce a complicated test on all
objects and provide a sanitycheck for uncommon code paths.
Fixes: fbbd37b36f ("drm/i915: Move object release to a freelist + worker")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161101100317.11129-2-chris@chris-wilson.co.uk
With the infrastructure converted over to tracking multiple timelines in
the GEM API whilst preserving the efficiency of using a single execution
timeline internally, we can now assign a separate timeline to every
context with full-ppgtt.
v2: Add a comment to indicate the xfer between timelines upon submission.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-35-chris@chris-wilson.co.uk
A restriction on our global seqno is that they cannot wrap, and that we
cannot use the value 0. This allows us to detect when a request has not
yet been submitted, its global seqno is still 0, and ensures that
hardware semaphores are monotonic as required by older hardware. To
meet these restrictions when we defer the assignment of the global
seqno, we must check that we have an available slot in the global seqno
space during request construction. If that test fails, we wait for all
requests to be completed and reset the hardware back to 0.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-33-chris@chris-wilson.co.uk
Though we will have multiple timelines, we still have a single timeline
of execution. This we can use to provide an execution and retirement order
of requests. This keeps tracking execution of requests simple, and vital
for preserving a single waiter (i.e. so that we can order the waiters so
that only the earliest to wakeup need be woken). To accomplish this we
distinguish the seqno used to order requests per-context (external) and
that used internally for execution.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-26-chris@chris-wilson.co.uk
Before suspend, we wait for the switch to the kernel context. In order
for all the other context images to be complete upon suspend, that
switch must be the last operation by the GPU (i.e. this idling request
must not overtake any pending requests). To make this request execute last,
we make it depend on every other inflight request.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-24-chris@chris-wilson.co.uk
Our timelines are more than just a seqno. They also provide an ordered
list of requests to be executed. Due to the restriction of handling
individual address spaces, we are limited to a timeline per address
space but we use a fence context per engine within.
Our first step to introducing independent timelines per context (i.e. to
allow each context to have a queue of requests to execute that have a
defined set of dependencies on other requests) is to provide a timeline
abstraction for the global execution queue.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-23-chris@chris-wilson.co.uk
In preparation to support many distinct timelines, we need to expand the
activity tracking on the GEM object to handle more than just a request
per engine. We already use the struct reservation_object on the dma-buf
to handle many fence contexts, so integrating that into the GEM object
itself is the preferred solution. (For example, we can now share the same
reservation_object between every consumer/producer using this buffer and
skip the manual import/export via dma-buf.)
v2: Reimplement busy-ioctl (by walking the reservation object), postpone
the ABI change for another day. Similarly use the reservation object to
find the last_write request (if active and from i915) for choosing
display CS flips.
Caveats:
* busy-ioctl: busy-ioctl only reports on the native fences, it will not
warn of stalls (in set-domain-ioctl, pread/pwrite etc) if the object is
being rendered to by external fences. It also will not report the same
busy state as wait-ioctl (or polling on the dma-buf) in the same
circumstances. On the plus side, it does retain reporting of which
*i915* engines are engaged with this object.
* non-blocking atomic modesets take a step backwards as the wait for
render completion blocks the ioctl. This is fixed in a subsequent
patch to use a fence instead for awaiting on the rendering, see
"drm/i915: Restore nonblocking awaits for modesetting"
* dynamic array manipulation for shared-fences in reservation is slower
than the previous lockless static assignment (e.g. gem_exec_lut_handle
runtime on ivb goes from 42s to 66s), mainly due to atomic operations
(maintaining the fence refcounts).
* loss of object-level retirement callbacks, emulated by VMA retirement
tracking.
* minor loss of object-level last activity information from debugfs,
could be replaced with per-vma information if desired
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-21-chris@chris-wilson.co.uk
Having moved the locked phase of freeing an object to a separate worker,
we can now declare to the core that we only need the unlocked variant of
driver->gem_free_object, and can use the simple unreference internally.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-20-chris@chris-wilson.co.uk
We want to hide the latency of releasing objects and their backing
storage from the submission, so we move the actual free to a worker.
This allows us to switch to struct_mutex freeing of the object in the
next patch.
Furthermore, if we know that the object we are dereferencing remains valid
for the duration of our access, we can forgo the usual synchronisation
barriers and atomic reference counting. To ensure this we defer freeing
an object til after an RCU grace period, such that any lookup of the
object within an RCU read critical section will remain valid until
after we exit that critical section. We also employ this delay for
rate-limiting the serialisation on reallocation - we have to slow down
object creation in order to prevent resource starvation (in particular,
files).
v2: Return early in i915_gem_tiling() ioctl to skip over superfluous
work on error.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-19-chris@chris-wilson.co.uk
We only need struct_mutex within pwrite for a brief window where we need
to serialise with rendering and control our cache domains. Elsewhere we
can rely on the backing storage being pinned, and forgive userspace any
races against us.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-17-chris@chris-wilson.co.uk
We only need struct_mutex within pread for a brief window where we need
to serialise with rendering and control our cache domains. Elsewhere we
can rely on the backing storage being pinned, and forgive userspace any
races against us.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-16-chris@chris-wilson.co.uk
Break the allocation of the backing storage away from struct_mutex into
a per-object lock. This allows parallel page allocation, provided we can
do so outside of struct_mutex (i.e. set-domain-ioctl, pwrite, GTT
fault), i.e. before execbuf! The increased cost of the atomic counters
are hidden behind i915_vma_pin() for the typical case of execbuf, i.e.
as the object is typically bound between execbufs, the page_pin_count is
static. The cost will be felt around set-domain and pwrite, but offset
by the improvement from reduced struct_mutex contention.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-14-chris@chris-wilson.co.uk
The plan is to move obj->pages out from under the struct_mutex into its
own per-object lock. We need to prune any assumption of the struct_mutex
from the get_pages/put_pages backends, and to make it easier we pass
around the sg_table to operate on rather than indirectly via the obj.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-13-chris@chris-wilson.co.uk
The plan is to make obtaining the backing storage for the object avoid
struct_mutex (i.e. use its own locking). The first step is to update the
API so that normal users only call pin/unpin whilst working on the
backing storage.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-12-chris@chris-wilson.co.uk
A while ago we switched from a contiguous array of pages into an sglist,
for that was both more convenient for mapping to hardware and avoided
the requirement for a vmalloc array of pages on every object. However,
certain GEM API calls (like pwrite, pread as well as performing
relocations) do desire access to individual struct pages. A quick hack
was to introduce a cache of the last access such that finding the
following page was quick - this works so long as the caller desired
sequential access. Walking backwards, or multiple callers, still hits a
slow linear search for each page. One solution is to store each
successful lookup in a radix tree.
v2: Rewrite building the radixtree for clarity, hopefully.
v3: Rearrange execbuf to avoid calling i915_gem_object_get_sg() from
within an atomic section and so relax the allocation context to a simple
GFP_KERNEL and mutex.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-10-chris@chris-wilson.co.uk
We only need the active reference to keep the object alive after the
handle has been deleted (so as to prevent a synchronous gem_close). Why
then pay the price of a kref on every execbuf when we can insert that
final active ref just in time for the handle deletion?
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-6-chris@chris-wilson.co.uk
Our low-level wait routine has evolved from our generic wait interface
that handled unlocked, RPS boosting, waits with time tracking. If we
push our GEM fence tracking to use reservation_objects (required for
handling multiple timelines), we lose the ability to pass the required
information down to i915_wait_request(). However, if we push the extra
functionality from i915_wait_request() to the individual callsites
(i915_gem_object_wait_rendering and i915_gem_wait_ioctl) that make use
of those extras, we can both simplify our low level wait and prepare for
extending the GEM interface for use of reservation_objects.
v2: Rewrite i915_wait_request() kerneldocs
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-4-chris@chris-wilson.co.uk
The throttle-ioctl never touches the struct_mutex. It does, however, as
part of its ABI report whether the hardware is terminally wedged. For
that purposes, it only has to report the current state and not incur the
cost of checking/waiting every invocation, as we do not have to wait for
a reset before waiting on a request to ensure completion (that is baked
into the wait request implementation).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-3-chris@chris-wilson.co.uk
We will need to wait on DMA completion (as signaled via struct fence)
before executing our i915_gem_request. Therefore we want to expose a
method for adding the await on the fence itself to the request.
v2: Add a comment detailing a failure to handle a signal-on-any
fence-array.
v3: Pretend that magic numbers don't exist.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-1-chris@chris-wilson.co.uk
Objects can have multiple VMAs used for display in which
case assertion that objects must not be pinned for display
more times than the current VMA is incorrect.
v2: Commit message update. (Chris Wilson)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: 058d88c433 ("drm/i915: Track pinned VMA")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/1477413635-3876-1-git-send-email-tvrtko.ursulin@linux.intel.com
We do not need to set up a fence for the rotated view.
Display does not need it and no one can access it.
v2: Move code to __i915_vma_set_map_and_fenceable. (Chris Wilson)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: 05a20d098d ("drm/i915: Move map-and-fenceable tracking to the VMA")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
As well as knowing when the error occurred, it is more interesting to me
to know how long after booting the error occurred, and for good measure
record the time since last hw initialisation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161025121602.1457-1-chris@chris-wilson.co.uk
Backmerge because Chris Wilson needs the very latest&greates of
Gustavo Padovan's sync_file work, specifically the refcounting changes
from:
commit 30cd85dd6e
Author: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Date: Wed Oct 19 15:48:32 2016 -0200
dma-buf/sync_file: hold reference to fence when creating sync_file
Also good to sync in general since git tends to get confused with the
cherry-picking going on.
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
- first slice of the gvt device model (Zhenyu et al)
- compression support for gpu error states (Chris)
- sunset clause on gpu errors resulting in dmesg noise telling users
how to report them
- .rodata diet from Tvrtko
- switch over lots of macros to only take dev_priv (Tvrtko)
- underrun suppression for dp link training (Ville)
- lspcon (hmdi 2.0 on skl/bxt) support from Shashank Sharma, polish
from Jani
- gen9 wm fixes from Paulo&Lyude
- updated ddi programming for kbl (Rodrigo)
- respect alternate aux/ddc pins (from vbt) for all ddi ports (Ville)
* tag 'drm-intel-next-2016-10-24' of git://anongit.freedesktop.org/drm-intel: (227 commits)
drm/i915: Update DRIVER_DATE to 20161024
drm/i915: Stop setting SNB min-freq-table 0 on powersave setup
drm/i915/dp: add lane_count check in intel_dp_check_link_status
drm/i915: Fix whitespace issues
drm/i915: Clean up DDI DDC/AUX CH sanitation
drm/i915: Respect alternate_ddc_pin for all DDI ports
drm/i915: Respect alternate_aux_channel for all DDI ports
drm/i915/gen9: Remove WaEnableYV12BugFixInHalfSliceChicken7
drm/i915: KBL - Recommended buffer translation programming for DisplayPort
drm/i915: Move down skl/kbl ddi iboost and n_edp_entires fixup
drm/i915: Add a sunset clause to GPU hang logging
drm/i915: Stop reporting error details in dmesg as well as the error-state
drm/i915/gvt: do not ignore return value of create_scratch_page
drm/i915/gvt: fix spare warnings on odd constant _Bool cast
drm/i915/gvt: mark symbols static where possible
drm/i915/gvt: fix sparse warnings on different address spaces
drm/i915/gvt: properly access enabled intel_engine_cs
drm/i915/gvt: Remove defunct vmap_batch()
drm/i915/gvt: Use common mapping routines for shadow_bb object
drm/i915/gvt: Use common mapping routines for indirect_ctx object
...
At the moment, we have dependency on the RPM as a barrier itself in both
i915_gem_release_all_mmaps() and i915_gem_restore_fences().
i915_gem_restore_fences() is also called along !runtime pm paths, but we
can move the markup of lost fences alongside releasing the mmaps into a
common i915_gem_runtime_suspend(). This has the advantage of locating
all the tricky barrier dependencies into one location.
v2: Just mark the fence as invalid (fence->dirty) so that upon waking we
will be sure to clear the fence after use, or restore it to the correct
value before use. This makes sure that if the fence is left intact
across the sleep, we do not leave it pointing to a region of GTT for the
next unsuspecting user.
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Imre Deak <imre.deak@linux.intel.com>
Reviewed-by: Imre Deak <imre.deak@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161024124218.18252-5-chris@chris-wilson.co.uk
Now that we have reduced the access to the list to either (a) under the
struct_mutex whilst holding the RPM wakeref (so that concurrent writers to
the list are serialised by struct_mutex) and (b) under the atomic
runtime suspend (which cannot run concurrently with any other accessor due
to the atomic nature of the runtime suspend) we can remove the extra
locking around the list itself.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161024124218.18252-3-chris@chris-wilson.co.uk
We can remove the false coupling between RPM and struct mutex by the
observation that we can use the RPM wakeref as the barrier around user
mmap access. That is as we tear down the user's PTE atomically from
within rpm suspend and then to fault in new PTE requires the rpm
wakeref, means that no user access is possible through those PTE without
RPM being awake. Having made that observation, we can then remove the
presumption of having to take rpm outside of struct_mutex and so allow
fine grained acquisition of a wakeref around hw access rather than
having to remember to acquire the wakeref early on.
v2: Rejig placement of the new intel_runtime_pm_get() to be as tight
as possible around the GTT pread/pwrite.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/20161024124218.18252-2-chris@chris-wilson.co.uk