Pull fuse fix from Miklos Szeredi:
"This fixes a deadlock when fuse, direct I/O and loop device are
combined"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: direct-io: don't dirty ITER_BVEC pages
Pull overlayfs fix from Miklos Szeredi:
"This fixes a regression caused by the last pull request"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix workdir creation
Pull btrfs fixes from Chris Mason:
"I'm not proud of how long it took me to track down that one liner in
btrfs_sync_log(), but the good news is the patches I was trying to
blame for these problems were actually fine (sorry Filipe)"
* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
btrfs: do not decrease bytes_may_use when replaying extents
We've got quite a few fixes at this time, and all are stable patches.
syzkaller strikes back again (episode 19 or so), and we had to plug
some holes in ALSA core part (mostly timer). In addition, a couple of
FireWire audio fixes for the invalid copy user calls in locks, and a
few quirks for HD-audio and USB-audio as usual are included.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIrBAABCAAVBQJX0u3IDhx0aXdhaUBzdXNlLmRlAAoJEGwxgFQ9KSmk8iYP/1cy
t+Ap/5yKOGtdAoQDmB7yzXOh3Z1gbrjpSFx+/kOJnP/muHtJVcHGJkJpyMjshY0J
6qdQi6pq8HJakJWf9alZOcW260RYuXng9JwSt1xlo2z4zFjORNHyLeBA+FTwKQpl
g1yyWoAyVHZC02alLe8YpNPQl1W06cHu4Tg3iTeZcMPwUEZ9QknUVCEj26Od61qR
M2YK193F0Q32wsI26nyW/fOytE0avWc99ob0kJVC0GadduwkYODuiBDNtkEtur/h
lvNPq7tOQVFOTTGdnMMTl1QhK8UDI3IxGnVAXWI3wqRMQolJtmimoJBHq4OXroK/
Nz+XOO4UAn+Tha7Lq+q8H8rjopB+EkaLfU+zq8+GourOh/aaVOL2EkClFA75qrxt
w9F1HUMFICEgsn4vyaYxQx6hzL8YVEKXu4hCNpW36XxIc5RAnzEQmZvOp0xLLqfY
OiDAnw1ciMKbtTNfgEOCubjN8owZBqrhV5b2ZGXSmpc2cVeKOn9p25FMVGuzqqMA
cWMd/TvwU0yna/DxSRXCdRVKDX5RhcfP1qJD4R2YgzFWQHH6jzg8jy0nQ6D2TfIo
6leGUdIiDbQUO/lEezgew3BPB0fg6pzyFGisaU+v8p/Zg8C08Q2w0Pmv8pPT1BJt
P1nEjhKWALhXrOBgvIPjpjRA4oXKkJBD92FS0eX+
=Jobo
-----END PGP SIGNATURE-----
Merge tag 'sound-4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"We've got quite a few fixes at this time, and all are stable patches.
syzkaller strikes back again (episode 19 or so), and we had to plug
some holes in ALSA core part (mostly timer).
In addition, a couple of FireWire audio fixes for the invalid copy
user calls in locks, and a few quirks for HD-audio and USB-audio as
usual are included"
* tag 'sound-4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: rawmidi: Fix possible deadlock with virmidi registration
ALSA: timer: Fix zero-division by continue of uninitialized instance
ALSA: timer: fix NULL pointer dereference in read()/ioctl() race
ALSA: fireworks: accessing to user space outside spinlock
ALSA: firewire-tascam: accessing to user space outside spinlock
ALSA: hda - Enable subwoofer on Dell Inspiron 7559
ALSA: hda - Add headset mic quirk for Dell Inspiron 5468
ALSA: usb-audio: Add sample rate inquiry quirk for B850V3 CP2114
ALSA: timer: fix NULL pointer dereference on memory allocation failure
ALSA: timer: fix division by zero after SNDRV_TIMER_IOCTL_CONTINUE
virtio_console uses a small DMA buffer for control requests. Move
that buffer into heap memory.
Doing virtio DMA on the stack is normally okay on non-DMA-API virtio
systems (which is currently most of them), but it breaks completely
if the stack is virtually mapped.
Tested by typing both directions using picocom aimed at /dev/hvc0.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
We get 1 warning when building kernel with W=1:
drivers/virtio/virtio_ring.c:170:16: warning: no previous prototype for 'vring_dma_dev' [-Wmissing-prototypes]
In fact, this function is only used in the file in which it is
declared and don't need a declaration, but can be made static.
so this patch marks this function with 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
generic definition to smp_wmb() is not sufficient
- avoid a recursive loop with the graph tracer by using using
preempt_(enable|disable)_notrace in _percpu_(read|write)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJX0vFPAAoJEGvWsS0AyF7x1jsP/0PuYxMPfkBBrL6dKfcwsBXR
9eKhIui8gwkjtpvpeDMSia8+5V+UvFYhB/oaPS5tbSaP1DzS/1Uhmr8Nd17fEBZH
nJZfodCRISSy+kOIiEIMAb00pXrkJpgn4Fy6Yf5jgKjrUvLYkLaeNNR9WQ1Nrwbl
tOmK+974fTtRN+46Ht3vS4hX6kqSDr7Ze8ezFLZc5Nt1tS3R7D0gfVnE3DBQ+x9y
gg+01WvGa2K/Iz0ur7zZo5eLVV7L+JfFvkBQCNxfFMkSt8SzOR+i6bPzqTGjy9Ge
rW1l4GOCXw/Di2PfPvQ0+BpJx/HY4vjmEdItdKm1ZOySPqJkl08wOF2NFse2ZYka
cUC+vlZE7zDjoHVOVYgjYTJtXvBZnwakIZ9Oxhsjw3h93hGPOXrJF5y6uYLiMCJa
5yASzwX617hFBxO8VrIeqjVw7z2BLHLMiPE6uhiQ8TJiiwHXMxJzaVnBaGL0xidA
qPzn7DhAHUsEb3bxYixu5uA/uyvJ7l33iRjOSObTUWxtBxMEvgajnxh1rkMl7lfY
izCvwjtIU/1mAae3MA4qddzrxuPwp3GtmXC3Tb/rxzD4fgvkvqrGHrXyufeLYUWC
z40kuGyp5ZMxTArPbno6TurseXeIWJQOW9/ya28FLzNbHC7RenIHBl7bSyoDYnFu
RGCmr4g2q3hKG5pjJ+1P
=QLnH
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- smp_mb__before_spinlock() changed to smp_mb() on arm64 since the
generic definition to smp_wmb() is not sufficient
- avoid a recursive loop with the graph tracer by using using
preempt_(enable|disable)_notrace in _percpu_(read|write)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: use preempt_disable_notrace in _percpu_read/write
arm64: spinlocks: implement smp_mb__before_spinlock() as smp_mb()
- Don't alias user region to other regions below PAGE_OFFSET from Paul Mackerras
- Fix again csum_partial_copy_generic() on 32-bit from Christophe Leroy
- Fix corrupted PE allocation bitmap on releasing PE from Gavin Shan
Fixes for code merged this cycle:
- Fix crash on releasing compound PE from Gavin Shan
- Fix processor numbers in OPAL ICP from Benjamin Herrenschmidt
- Fix little endian build with CONFIG_KEXEC=n from Thiago Jung Bauermann
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJX0qKlAAoJEFHr6jzI4aWAu30QAK88plLxJ2z5Lyf7axdHf7P0
NfM6xQLyuUQ1xP3ksUB4/4eps3jINJ0bZRd4mMJ3fO/YgNsBACayB9GeE478tMbp
1KmK94qpLk1u4BUxXNtsWJLEuzVqAPr2cGh6jddmkPXGCUx1MFatEVNVJupX+Vt9
sJsmhLatUucZEQI6r4sK5wDOdLYIQgcgTIWW5qHH7jyJDKLGyJbNPtmQhbMWU0a5
zBwD+paecJSGTJEVjd3UwBic+oXt8chwiZkaHLu4Rh6JQ0yVRL4If4EYCodHIpDR
H7b0P9De9W6a+IWLjVDMhYKq9rBjjgZwcjMplkO7gBE2P+v/NGzbfORJtNXeOgKE
/RSWufpTbpiGyUzP1Lr/j0O59ZoijRGBK8zuha5FtsTlhl909ifc6KuHO5aqHY9r
I5o7ws+hSBM1u9cf0Bl011P4uToYzy1auMsZsjDW2SdDEFtJ+WK+0I2vp+M9Jv73
/F48n/EWUuul5oS2Uar+V2AUADpnYPRi50OR1zVJxdJSM8bZFue4brBFfx1bI/2/
jmK87hxNwJtYT45KiuEXr2FWMiB1iNHHxL/OEwWbitf2MfRjq8+LHbdt9FxOSj3/
+8cw3f1zyEjNsvH380HhkUBZknKmD7z8V5Ko5Dx5h8cuRlL+QEW2GnW+1NN7VMoQ
T7QTHRR4ziHSKdzAIlTe
=q0Jo
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.8-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Fixes marked for stable:
- Don't alias user region to other regions below PAGE_OFFSET from
Paul Mackerras
- Fix again csum_partial_copy_generic() on 32-bit from Christophe
Leroy
- Fix corrupted PE allocation bitmap on releasing PE from Gavin Shan
Fixes for code merged this cycle:
- Fix crash on releasing compound PE from Gavin Shan
- Fix processor numbers in OPAL ICP from Benjamin Herrenschmidt
- Fix little endian build with CONFIG_KEXEC=n from Thiago Jung
Bauermann"
* tag 'powerpc-4.8-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET
powerpc/32: Fix again csum_partial_copy_generic()
powerpc/powernv: Fix corrupted PE allocation bitmap on releasing PE
powerpc/powernv: Fix crash on releasing compound PE
powerpc/xics/opal: Fix processor numbers in OPAL ICP
powerpc/pseries: Fix little endian build with CONFIG_KEXEC=n
Pull ARM fixes from Russell King:
"A few ARM fixes:
- Robin Murphy noticed that the non-secure privileged entry was
relying on undefined behaviour, which needed to be fixed.
- Vladimir Murzin noticed that prov-v7 fails to build for MMUless
configurations because a required header file wasn't included.
- A bunch of fixes for StrongARM regressions found while testing
4.8-rc on such platforms"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: sa1100: clear reset status prior to reboot
ARM: 8600/1: Enforce some NS-SVC initialisation
ARM: 8599/1: mm: pull asm/memory.h explicitly
ARM: sa1100: register clocks early
ARM: sa1100: fix 3.6864MHz clock
Unfortunately we have a bogus dwc3 patch leaked through the cracks and
got merged into Linus' HEAD. That patch ended up causing off-by-1 error
in our TRB accounting logic. Thankfully John Youn found out the problem
and we provided a revert to the bogus dwc3 patch in no time.
Apart from this off-by-1 error, we have two fixes to the Renesas drivers,
a small fix to our generic phy driver, a NULL pointer dereference fix for
f_eem and a build warning fix in dwc3.
-----BEGIN PGP SIGNATURE-----
iQI6BAABCAAkBQJX0q5rHRxmZWxpcGUuYmFsYmlAbGludXguaW50ZWwuY29tAAoJ
EMy+uJnhGpkGH6wQAMDzaBu5uIieW8b/GmYw6Ujx7omXvPWUHPEE1AQbqjjT92dE
jR5AoVafirIlEFNesfKFPnNV/egK0tkYk7QxIcjh+4K6VCTwUCsjwyy9nocfVWT3
34aR9B4dTWCGqaUGE+p8qNgd/Zvsnb4ISW6w14k0iQdFpXrLkSmPpZBK+22a6Brl
RRRbqV+nMEPpjK6zcpuio39X5njk9yLi7Nru0zkbKr9mS9uTPs4F2VELclxApIjP
WxSrXoXl2g2FkVEVpRjtzzye5/Ie4GLLXiirLEjM9EM8txIS4lLOW5y9QvFAeWzY
m8Gmaej64Os53zMfHQgHJTGoxGhA2msKngzeqowAnwzdCHcMo1zELzOc0rWcsFMZ
85C/WEngMZEHdJXX5mcQf875x1WDfYJgGTM0OFRAEIoi8KFgpY3hUJmMPTyLlzvs
p+nlWYscGIbca9pP47fs1GQb2w2DrDacBgsKZco33tfSPL+8p90O6ywfzrsZo0oL
QPxVhrU4v+HXl8k4MSGmoEIrhupwxJWreHkS9YXPnTKJ/YDUQCG5OqIrcAod9Yur
eBBo7fQ5R3p5QbRsuKmieZGptvWmEsyrx3hG1DEc53Q4145uPzvMaZSYe5zUzHjS
IvoYwluyWFGSruLsny+WHZp0AhGvVAAQOI15f3P4B5DDC+0qdwVGWGQQK449
=UcMZ
-----END PGP SIGNATURE-----
Merge tag 'fixes-for-v4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus
Felipe writes:
usb: fixes for v4.8-rc6
Unfortunately we have a bogus dwc3 patch leaked through the cracks and
got merged into Linus' HEAD. That patch ended up causing off-by-1 error
in our TRB accounting logic. Thankfully John Youn found out the problem
and we provided a revert to the bogus dwc3 patch in no time.
Apart from this off-by-1 error, we have two fixes to the Renesas drivers,
a small fix to our generic phy driver, a NULL pointer dereference fix for
f_eem and a build warning fix in dwc3.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJX0oXaAAoJEEhZKYFQ1nG7298H/RKxcA44iet+zvv8z8EX2yZt
/H26bzu1/5S6GilXmI92KdEdpNgT1bBwp1ZqrGmMh9Zoi13X1aJJEnQGiyn9kxwK
VHNb8UZ7OT2M6Sqf29zlrNfK6zxh3ou8UMK7Gv3YHjL1s0nFtkCsED4vdbisIMnj
UuHzkT5C9l/mnJPhM9G44vXOwJFEZhWdfL5D+KSIi5JlMuBj0z9FZkGiSnchqrzb
IbQvoPWjt+eUCkktTY8yEIU8RThqzVRJrqjeT3TBt3wYZHfCqvVnmCc81QpxvBub
ZIJ1vhPd8zjP/FO6zMD//Rt/B4aTTodqWPtU5EaLwSkYASDBBYJunpyzZWZ+B7s=
=TwAO
-----END PGP SIGNATURE-----
Merge tag 'usb-ci-v4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb into usb-linus
Peter writes:
Fix the possible kernel panic when the hardware signal is bad for chipidea udc.
We have a big rework of the kxsd9 driver queued up behind the fix below and
a fix for a recent fix that was marked for stable.
Hence this fix series is perhaps a little more urgent than average for IIO.
* core
- a fix for a fix in the last set. The recent fix for blocking ops when
! task running left a path (unlikely one) in which the function return
value was not set - so initialise it to 0.
- The IIO_TYPE_FRACTIONAL code previously didn't cope with negative
fractions. Turned out a fix for this was in Analog's tree but hadn't made
it upstream.
* bmc150
- reset chip at init time. At least one board out there ends up coming up
in an unstable state due to noise during power up. The reset does no
harm on other boards.
* kxsd9
- Fix a bug in the reported scaling due to failing to set the integer
part to 0.
* hid-sensors-pressure
- Output was in the wrong units to comply with the IIO ABI.
* tools
- iio_generic_buffer: Fix the trigger-less mode by ensuring we don't fault
out for having no trigger when we explicitly said we didn't want to have
one.
-----BEGIN PGP SIGNATURE-----
iQIuBAABCAAYBQJX0GOMERxqaWMyM0BrZXJuZWwub3JnAAoJEFSFNJnE9BaI2rwQ
AIA4M6uKqsdLLdmjT3bB1mGuXedGzwrvSObEvTGKEi/LDFreKsnqmjAlltuo7kwl
RxBXOOwb9Is3HccGEsFFWBNaMEKaW0ZoH9n0dn7VdlGThloKxE57S386GUGKT0Xq
tKNHbw7ITO+UIC1CvavmFsNk5Qe7AMiSNTnlBizcVrfdtbX91aqRe4X3fP8X1kFv
zcE/sbip9tjKwrXTsZ5/D4FCAtKjkamouv3tZHsp/qwhUQ9AJwklw0pySVgi+Ou1
3xzFwVATrH5bMK0JGwKts7zaMRvrbaf4CRdHhSLUXqmSYCYiN7THFnb/xD1nbFAy
AMsHDeXDs30cQD+Wiks9oNqf37EkcMBEsYpoFvoFE3xcVdX6NTs9ZRpGe+xizB1u
yI+Howt9sb4CcdPBtpPb3ffKK9X2ccPT+q1J8OpNlwZH/yta4wogFt8eKZlvdPtc
pHstwHUvv+s5EXYI8266DIkd9ErRI2+13v08rBwex1C8S0Xt5qQFOe/k7a5wzn2O
/DNUQ/mfOXJinxbt1aBakn2PFoMjyNWEXua7haW3pbr+fpUsZTN/MT57p2W3E4Ak
PfZxvHARW0MY3ShKpRmwzhrzmddGWZYb07xOtcoktHt2esV9MLUQLCUlDPaLk2H5
gUzvzEDQQJpyj+SmxAyEejeoZWMWIIW4dfoxBCU1iSF+
=Ob6a
-----END PGP SIGNATURE-----
Merge tag 'iio-fixes-for-4.8b' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus
Jonathan writes:
Second set of IIO fixes for the 4.8 cycle.
We have a big rework of the kxsd9 driver queued up behind the fix below and
a fix for a recent fix that was marked for stable.
Hence this fix series is perhaps a little more urgent than average for IIO.
* core
- a fix for a fix in the last set. The recent fix for blocking ops when
! task running left a path (unlikely one) in which the function return
value was not set - so initialise it to 0.
- The IIO_TYPE_FRACTIONAL code previously didn't cope with negative
fractions. Turned out a fix for this was in Analog's tree but hadn't made
it upstream.
* bmc150
- reset chip at init time. At least one board out there ends up coming up
in an unstable state due to noise during power up. The reset does no
harm on other boards.
* kxsd9
- Fix a bug in the reported scaling due to failing to set the integer
part to 0.
* hid-sensors-pressure
- Output was in the wrong units to comply with the IIO ABI.
* tools
- iio_generic_buffer: Fix the trigger-less mode by ensuring we don't fault
out for having no trigger when we explicitly said we didn't want to have
one.
When debug preempt or preempt tracer is enabled, preempt_count_add/sub()
can be traced by function and function graph tracing, and
preempt_disable/enable() would call preempt_count_add/sub(), so in Ftrace
subsystem we should use preempt_disable/enable_notrace instead.
In the commit 345ddcc882 ("ftrace: Have set_ftrace_pid use the bitmap
like events do") the function this_cpu_read() was added to
trace_graph_entry(), and if this_cpu_read() calls preempt_disable(), graph
tracer will go into a recursive loop, even if the tracing_on is
disabled.
So this patch change to use preempt_enable/disable_notrace instead in
this_cpu_read().
Since Yonghui Yang helped a lot to find the root cause of this problem,
so also add his SOB.
Signed-off-by: Yonghui Yang <mark.yang@spreadtrum.com>
Signed-off-by: Chunyan Zhang <zhang.chunyan@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
smp_mb__before_spinlock() is intended to upgrade a spin_lock() operation
to a full barrier, such that prior stores are ordered with respect to
loads and stores occuring inside the critical section.
Unfortunately, the core code defines the barrier as smp_wmb(), which
is insufficient to provide the required ordering guarantees when used in
conjunction with our load-acquire-based spinlock implementation.
This patch overrides the arm64 definition of smp_mb__before_spinlock()
to map to a full smp_mb().
Cc: <stable@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Problems with the signal integrity of the high speed USB data lines or
noise on reference ground lines can cause the i.MX6 USB controller to
violate USB specs and exhibit unexpected behavior.
It was observed that USBi_UI interrupts were triggered first and when
isr_setup_status_phase was called, ci->status was NULL, which lead to a
NULL pointer dereference kernel panic.
This patch fixes the kernel panic, emits a warning once and returns
-EPIPE to halt the device and let the host get stalled.
It also adds a comment to point people, who are experiencing this issue,
to their USB hardware design.
Cc: <stable@vger.kernel.org> #4.1+
Signed-off-by: Clemens Gruber <clemens.gruber@pqgruber.com>
Signed-off-by: Peter Chen <peter.chen@nxp.com>
In commit f02db315b8 ("ipv4: IP_TOS and IP_TTL can be specified as
ancillary data") Francesco added IP_TOS values specified as integer.
However, kernel sends to userspace (at recvmsg() time) an IP_TOS value
in a single byte, when IP_RECVTOS is set on the socket.
It can be very useful to reflect all ancillary options as given by the
kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with
EINVAL after Francesco patch.
So this patch extends IP_TOS ancillary to accept an u8, so that an UDP
server can simply reuse same ancillary block without having to mangle
it.
Jesper can then augment
https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c
to add TOS reflection ;)
Fixes: f02db315b8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LLVM can generate code that tests for direct packet access via
skb->data/data_end in a way that currently gets rejected by the
verifier, example:
[...]
7: (61) r3 = *(u32 *)(r6 +80)
8: (61) r9 = *(u32 *)(r6 +76)
9: (bf) r2 = r9
10: (07) r2 += 54
11: (3d) if r3 >= r2 goto pc+12
R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
R9=pkt(id=0,off=0,r=0) R10=fp
12: (18) r4 = 0xffffff7a
14: (05) goto pc+430
[...]
from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv
R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp
24: (7b) *(u64 *)(r10 -40) = r1
25: (b7) r1 = 0
26: (63) *(u32 *)(r6 +56) = r1
27: (b7) r2 = 40
28: (71) r8 = *(u8 *)(r9 +20)
invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0)
The reason why this gets rejected despite a proper test is that we
currently call find_good_pkt_pointers() only in case where we detect
tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and
derived, for example, from a register of type pkt(id=Y,off=0,r=0)
pointing to skb->data. find_good_pkt_pointers() then fills the range
in the current branch to pkt(id=Y,off=0,r=Z) on success.
For above case, we need to extend that to recognize pkt_end >= rX
pattern and mark the other branch that is taken on success with the
appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers().
Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the
only two practical options to test for from what LLVM could have
generated, since there's no such thing as BPF_JLT (<) or BPF_JLE (<=)
that we would need to take into account as well.
After the fix:
[...]
7: (61) r3 = *(u32 *)(r6 +80)
8: (61) r9 = *(u32 *)(r6 +76)
9: (bf) r2 = r9
10: (07) r2 += 54
11: (3d) if r3 >= r2 goto pc+12
R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
R9=pkt(id=0,off=0,r=0) R10=fp
12: (18) r4 = 0xffffff7a
14: (05) goto pc+430
[...]
from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv
R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp
24: (7b) *(u64 *)(r10 -40) = r1
25: (b7) r1 = 0
26: (63) *(u32 *)(r6 +56) = r1
27: (b7) r2 = 40
28: (71) r8 = *(u8 *)(r9 +20)
29: (bf) r1 = r8
30: (25) if r8 > 0x3c goto pc+47
R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56
R9=pkt(id=0,off=0,r=54) R10=fp
31: (b7) r1 = 1
[...]
Verifier test cases are also added in this work, one that demonstrates
the mentioned example here and one that tries a bad packet access for
the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0),
pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two
with both test variants (>, >=).
Fixes: 969bf05eb3 ("bpf: direct packet access")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski says:
====================
nfp: fixes and trivial cleanup
First patch drops unnecessary version.h includes. Second one
drops support for pre-release versions of FW ABI. Removing
FW ABI 0.0 from supported set is particularly good since 0
could just be uninitialized memory. Last but not least I drop
unnecessary padding of frames on RX which makes us count bytes
incorrectly for the VF2VF traffic.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to pad frames to ETH_ZLEN on RX.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Be more strict about FW versions. Drop support for old
transitional revisions which were never used in production.
Dropping support for FW ABI version 0.0.0.0 is particularly
useful because 0 could just be uninitialized memory.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unnecessary version.h includes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 76174004a0
(tcp: do not slow start when cwnd equals ssthresh )
introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual
ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP
connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth
~100KB/s.
That is caused by stalled cwnd when cwnd equals ssthresh. This patch
fixes it by proper increasing cwnd in this case.
Signed-off-by: Artem Germanov <agermanov@anchorfree.com>
Acked-by: Dmitry Adamushko <d.adamushko@anchorfree.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Garver says:
====================
openvswitch: add 802.1ad support
This series adds 802.1ad support to openvswitch. It is a continuation of the
work originally started by Thomas F Herbert - hence the large rev number.
The extra VLAN is implemented by using an additional level of the
OVS_KEY_ATTR_ENCAP netlink attribute.
In OVS flow speak, this looks like
eth_type(0x88a8),vlan(vid=100),encap(eth_type(0x8100), vlan(vid=200),
encap(eth_type(0x0800), ...))
The userspace counterpart has also seen recent activity on the ovs-dev mailing
lists. There are some new 802.1ad OVS tests being added - also on the ovs-dev
list. This patch series has been tested using the most recent version of
userspace (v3) and tests (v2).
v22 changes:
- merge patch 4 into patch 3
- fix checkpatch.pl errors
- Still some 80 char warnings for long string literals
- refresh pointer after pskb_may_pull()
- refactor vlan nlattr parsing to remove some double checks
- introduce ovs_nla_put_vlan()
- move triple VLAN check to after ethertype serialization
- WARN_ON_ONCE() on triple VLAN and unexpected encap values
v21 changes:
- Fix (and simplify) netlink attribute parsing
- re-add handling of truncated VLAN tags
- fix if/else dangling assignment in {push,pop}_vlan()
- simplify parse_vlan()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for 802.1ad including the ability to push and pop double
tagged vlans. Add support for 802.1ad to netlink parsing and flow
conversion. Uses double nested encap attributes to represent double
tagged vlan. Inner TPID encoded along with ctci in nested attributes.
This is based on Thomas F Herbert's original v20 patch. I made some
small clean ups and bug fixes.
Signed-off-by: Thomas F Herbert <thomasfherbert@gmail.com>
Signed-off-by: Eric Garver <e@erig.me>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to simplify using double tagged vlans. This function allows all
valid vlan ethertypes to be checked in a single function call.
Also replace some instances that check for both ETH_P_8021Q and
ETH_P_8021AD.
Patch based on one originally by Thomas F Herbert.
Signed-off-by: Thomas F Herbert <thomasfherbert@gmail.com>
Signed-off-by: Eric Garver <e@erig.me>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
openvswitch: Add support for 8021.AD
Change the description of the VLAN tpid field.
Signed-off-by: Thomas F Herbert <thomasfherbert@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
Mellanox 100G mlx5 fixes 2016-09-07
The following series contains bug fixes for the mlx5e driver.
from Gal,
- Static code checker cleanup (casting overflow)
- Fix global PFC counter statistics reading
- Fix HW LRO when vlan stripping is off
From Bodong,
- Deprecate old autoneg capability bit and use new one.
From Tariq,
- Fix xmit more counter race condition
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently vlan tagged packets were not parsed correctly
and assumed to be regular IPv4/IPv6 packets.
We should check for 802.1Q/802.1ad tags and update the lro header
accordingly.
This fixes the use case where LRO is on and rxvlan is off
(vlan stripping is off).
Fixes: e586b3b0ba ('net/mlx5: Ethernet Datapath files')
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when reading global PFC statistics we left the counter
iterator out of the equation and we ended up reading the same counter
over and over again.
Instead of reading the counter at index 0 on every iteration we now read
the counter at index (i).
Fixes: e989d5a532 ('net/mlx5e: Expose flow control counters to ethtool')
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On 64 bits architectures unsigned long is longer than u32,
casting to unsigned long will result in overflow.
We need to first allocate an unsigned long variable, then assign the
wanted value.
Fixes: 665bc53969 ('net/mlx5e: Use new ethtool get/set link ksettings API')
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous an_disable_cap position bit31 is deprecated to be use in driver
with newer firmware. New firmware will advertise the same capability
in bit29.
Old capability didn't allow setting more than one protocol for a
specific speed when autoneg is off, while newer firmware will allow
this and it is indicated in the new capability location.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the xmit_more counter before notifying the HW,
to prevent a possible use-after-free of the skb.
Fixes: c8cf78fe10 ("net/mlx5e: Add ethtool counter for TX xmit_more")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the capability for a process that has CAP_NET_ADMIN on
a socket to see the socket mark in socket dumps.
Commit a52e95abf7 ("net: diag: allow socket bytecode filters to
match socket marks") recently gave privileged processes the
ability to filter socket dumps based on mark. This patch is
complementary: it ensures that the mark is also passed to
userspace in the socket's netlink attributes. It is useful for
tools like ss which display information about sockets.
Tested: https://android-review.googlesource.com/270210
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When DATA and/or FIN are carried in a SYN/ACK message or SYN message,
we append an skb in socket receive queue, but we forget to call
sk_forced_mem_schedule().
Effect is that the socket has a negative sk->sk_forward_alloc as long as
the message is not read by the application.
Josh Hunt fixed a similar issue in commit d22e153718 ("tcp: fix tcp
fin memory accounting")
Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MIPS based xilfpga platform uses this driver.
Enable it for MIPS
Signed-off-by: Zubair Lutfullah Kakakhel <Zubair.Kakakhel@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cpufreq-stats code can no longer be built as a module, so it now
appears with square brackets in menuconfig.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixes: 1aefc75b24 (cpufreq: stats: Make the stats code non-modular)
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Depending on a number of factors including:
- Which exact Rockchip SoC we're working with
- How deep we suspend
- Which i2c port we're on
We might lose the state of the i2c registers at suspend time.
Specifically we've found that on rk3399 the i2c ports that are not in
the PMU power domain lose their state with the current suspend depth
configured by ARM Tursted Firmware.
Note that there are very few actual i2c registers that aren't configured
per transfer anyway so all we actually need to re-configure are the
clock config registers. We'll just add a call to rk3x_i2c_adapt_div()
at resume time and be done with it.
NOTE: On rk3399 on ports whose power was lost, I put printouts in at
resume time. I saw things like:
before: con=0x00010300, div=0x00060006
after: con=0x00010200, div=0x00180025
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: David Wu <david.wu@rock-chips.com>
Tested-by: David Wu <david.wu@rock-chips.com>
[wsa: removed duplicate const]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
There are several ways to set the SDA hold time for i2c controller,
including: Device Tree, built-in device properties and ACPI. However,
if the SDA hold time is not specified by above method, we should
read the value, where it is preset by firmware, and save it to
sda_hold_time. This is needed because when i2c controller enters
runtime suspend, the DW_IC_SDA_HOLD value will be reset to chipset
default value. And during runtime resume, i2c_dw_init will be called
to reconfigure i2c controller. If sda_hold_time is zero, the chipset
default hold time will be used, that will be too short for some
platforms. Therefore, to have a better tolerance, the DW_IC_SDA_HOLD
value should be kept by sda_hold_time.
Signed-off-by: Zhuo-hao Lee <zhuo-hao.lee@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Steffen Klassert says:
====================
ipsec 2016-09-08
1) Fix a crash when xfrm_dump_sa returns an error.
From Vegard Nossum.
2) Remove some incorrect WARN() on normal error handling.
From Vegard Nossum.
3) Ignore socket policies when rebuilding hash tables,
socket policies are not inserted into the hash tables.
From Tobias Brunner.
4) Initialize and check tunnel pointers properly before
we use it. From Alexey Kodanev.
5) Fix l3mdev oif setting on xfrm dst lookups.
From David Ahern.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
ipsec-next 2016-09-08
1) Constify the xfrm_replay structures. From Julia Lawall
2) Protect xfrm state hash tables with rcu, lookups
can be done now without acquiring xfrm_state_lock.
From Florian Westphal.
3) Protect xfrm policy hash tables with rcu, lookups
can be done now without acquiring xfrm_policy_lock.
From Florian Westphal.
4) We don't need to have a garbage collector list per
namespace anymore, so use a global one instead.
From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
condition.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJX0WAXAAoJEEp/3jgCEfOL8bEIAIeVmaU+M/1HuPucc+xiUooy
IP+LNjhSc52neqDsN/eYjTn4zipu0ZAojPW9geGkPHOnKTKdRqkNLpyWaQcSv0AG
yOUiJxnfwZ2LuOD2L/EIzqSKWo1lrdhIP/tSyRk0cOhlXn8urKatEOQi1+8H3Jlr
gzD2PZzXL1jnpCiGPnuYLlh6wmk9Su+7+GCkNOwi1z+j+sM5AupZESPsh6Ze+bYk
xgii4e21cBQv4NXojLECnfNez3HjYJCqYAUZ55lzSneUkbTRdBoY+pye/FeFKIBe
9Wb5IRYzvih9Bcdpx+o5RZtCoKhq1XRngjjvv7cHYELQ0JVM4kFwEKrUHMoHhWU=
=VDDm
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.8-rc6' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a 4.7 performance regression, caused by a typo in an if
condition"
* tag 'ceph-for-4.8-rc6' of git://github.com/ceph/ceph-client:
ceph: do not modify fi->frag in need_reset_readdir()
Pull dmi fix from Jean Delvare.
* 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
dmi-id: don't free dev structure after calling device_register
This is a slightly larger batch of fixes that we've been sitting on a few
-rcs. Most of them are simple oneliners, but there are two sets that are
slightly larger and worth pointing out:
- A set of patches to OMAP to deal with hwmod for RTC on am33xx (beaglebone
SoC, among others). It's the only clock that ever has a valid offset of 0,
so a new flag needed introduction once this problem was discovered.
- A collection of CCI fixes for performance counters discovered once people
started using it on X-Gene CPUs.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJX0OnNAAoJEIwa5zzehBx3C80P/3Adca+87uQPI57AWJL7TIU3
i9YAf1povBg/u5PHn2XiNj2gBeYtDL6vKTbca9zDX/ezSPaGFAp1nPsUG3m4yKLG
0FjkNRan+FvFoi195TxANrBeYE39/ZNWJCsPn01+2Q2EsYnlCn9wsFglxTQGObsN
MI2rgYVxabdJC8TO5WuGUdJEa9TynU65fdqdD+KR+DfA8yOyu9wZEmE1JuHprUCQ
H2MYYHHFZD5Y41/Y/SvdDVprCOd8lq8Syt7Vk1O6pQKQSGgojgGn36YJeBIapnOy
4y1/FFqoRe5FwSbt934q/4P/QpHwnGvIFi0edS+Reb6/3gUWRchQ8Dt1I1C/2QoJ
J+7KQKR4SBSiu6LnsP8gNOAqulpyGzwE2aHBVD1mpbPIynA7G5MfI1pEJw4nVvFQ
rAlBmuVzjaysuhKAdM32fd34dVwQljDiqh4qWtk0hbgjyPpdxT7QjIwSCR6vfbzo
toOKQ2eiIqF+syOAnXbpPSmtaURavxvBim4DEXQfVHUbFHVErZ5zHmwGWjDc5vLT
WixQ62hilHc08pcalY31RF3k21GwBXBVDf6clQ/9TDCErlCAlpjyN0/4ZrRpcYwz
NHpFdbBNLzW0VJ0gHtwb+Xz4qgcUneTxXDUTaU46MYRquML/uqXMFrkJUQEOq1h5
nPGe55ItupK8REaMCJv/
=8pPd
-----END PGP SIGNATURE-----
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"This is a slightly larger batch of fixes that we've been sitting on a
few -rcs. Most of them are simple oneliners, but there are two sets
that are slightly larger and worth pointing out:
- A set of patches to OMAP to deal with hwmod for RTC on am33xx
(beaglebone SoC, among others). It's the only clock that ever has
a valid offset of 0, so a new flag needed introduction once this
problem was discovered.
- A collection of CCI fixes for performance counters discovered once
people started using it on X-Gene CPUs"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (37 commits)
arm-cci: pmu: Fix typo in event name
Revert "ARM: tegra: fix erroneous address in dts"
ARM: dts: imx6qdl: Fix SPDIF regression
ARM: imx6: add missing BM_CLPCR_BYPASS_PMIC_READY setting for imx6sx
ARM: dts: imx7d-sdb: fix ti,x-plate-ohms property name
ARM: dts: kirkwood: Fix PCIe label on OpenRD
ARM: kirkwood: ib62x0: fix size of u-boot environment partition
bus: arm-ccn: make event groups reliable
bus: arm-ccn: fix hrtimer registration
bus: arm-ccn: fix PMU interrupt flags
ARM: tegra: Correct polarity for Tegra114 PMIC interrupt
MAINTAINERS: add tree entry for ARM/UniPhier architecture
ARM: sun5i: Fix typo in trip point temperature
MAINTAINERS: Switch to kernel.org account for Krzysztof Kozlowski
ARM: imx6ul: populates platform device at .init_machine
bus: arm-ccn: Add missing event attribute exclusions for host/guest
bus: arm-ccn: Correct required arguments for XP PMU events
bus: arm-ccn: Fix XP watchpoint settings bitmask
bus: arm-ccn: Do not attempt to configure XPs for cycle counter
bus: arm-ccn: Fix PMU handling of MN
...
Make it clear that adding slave support shall not disable master
functionality. We can have both, so we should.
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
We can't use a static property for all the changesets, so we now create
dynamic ones for each changeset.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Fixes: 50a5ba8769 ("i2c: mux: demux-pinctrl: add driver")
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Rewrite the data and ack handling code such that:
(1) Parsing of received ACK and ABORT packets and the distribution and the
filing of DATA packets happens entirely within the data_ready context
called from the UDP socket. This allows us to process and discard ACK
and ABORT packets much more quickly (they're no longer stashed on a
queue for a background thread to process).
(2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead
keep track of the offset and length of the content of each packet in
the sk_buff metadata. This means we don't do any allocation in the
receive path.
(3) Jumbo DATA packet parsing is now done in data_ready context. Rather
than cloning the packet once for each subpacket and pulling/trimming
it, we file the packet multiple times with an annotation for each
indicating which subpacket is there. From that we can directly
calculate the offset and length.
(4) A call's receive queue can be accessed without taking locks (memory
barriers do have to be used, though).
(5) Incoming calls are set up from preallocated resources and immediately
made live. They can than have packets queued upon them and ACKs
generated. If insufficient resources exist, DATA packet #1 is given a
BUSY reply and other DATA packets are discarded).
(6) sk_buffs no longer take a ref on their parent call.
To make this work, the following changes are made:
(1) Each call's receive buffer is now a circular buffer of sk_buff
pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
between the call and the socket. This permits each sk_buff to be in
the buffer multiple times. The receive buffer is reused for the
transmit buffer.
(2) A circular buffer of annotations (rxtx_annotations) is kept parallel
to the data buffer. Transmission phase annotations indicate whether a
buffered packet has been ACK'd or not and whether it needs
retransmission.
Receive phase annotations indicate whether a slot holds a whole packet
or a jumbo subpacket and, if the latter, which subpacket. They also
note whether the packet has been decrypted in place.
(3) DATA packet window tracking is much simplified. Each phase has just
two numbers representing the window (rx_hard_ack/rx_top and
tx_hard_ack/tx_top).
The hard_ack number is the sequence number before base of the window,
representing the last packet the other side says it has consumed.
hard_ack starts from 0 and the first packet is sequence number 1.
The top number is the sequence number of the highest-numbered packet
residing in the buffer. Packets between hard_ack+1 and top are
soft-ACK'd to indicate they've been received, but not yet consumed.
Four macros, before(), before_eq(), after() and after_eq() are added
to compare sequence numbers within the window. This allows for the
top of the window to wrap when the hard-ack sequence number gets close
to the limit.
Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
to indicate when rx_top and tx_top point at the packets with the
LAST_PACKET bit set, indicating the end of the phase.
(4) Calls are queued on the socket 'receive queue' rather than packets.
This means that we don't need have to invent dummy packets to queue to
indicate abnormal/terminal states and we don't have to keep metadata
packets (such as ABORTs) around
(5) The offset and length of a (sub)packet's content are now passed to
the verify_packet security op. This is currently expected to decrypt
the packet in place and validate it.
However, there's now nowhere to store the revised offset and length of
the actual data within the decrypted blob (there may be a header and
padding to skip) because an sk_buff may represent multiple packets, so
a locate_data security op is added to retrieve these details from the
sk_buff content when needed.
(6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
individually secured and needs to be individually decrypted. The code
to do this is broken out into rxrpc_recvmsg_data() and shared with the
kernel API. It now iterates over the call's receive buffer rather
than walking the socket receive queue.
Additional changes:
(1) The timers are condensed to a single timer that is set for the soonest
of three timeouts (delayed ACK generation, DATA retransmission and
call lifespan).
(2) Transmission of ACK and ABORT packets is effected immediately from
process-context socket ops/kernel API calls that cause them instead of
them being punted off to a background work item. The data_ready
handler still has to defer to the background, though.
(3) A shutdown op is added to the AF_RXRPC socket so that the AFS
filesystem can shut down the socket and flush its own work items
before closing the socket to deal with any in-progress service calls.
Future additional changes that will need to be considered:
(1) Make sure that a call doesn't hog the front of the queue by receiving
data from the network as fast as userspace is consuming it to the
exclusion of other calls.
(2) Transmit delayed ACKs from within recvmsg() when we've consumed
sufficiently more packets to avoid the background work item needing to
run.
Signed-off-by: David Howells <dhowells@redhat.com>
Make it possible for the data_ready handler called from the UDP transport
socket to completely instantiate an rxrpc_call structure and make it
immediately live by preallocating all the memory it might need. The idea
is to cut out the background thread usage as much as possible.
[Note that the preallocated structs are not actually used in this patch -
that will be done in a future patch.]
If insufficient resources are available in the preallocation buffers, it
will be possible to discard the DATA packet in the data_ready handler or
schedule a BUSY packet without the need to schedule an attempt at
allocation in a background thread.
To this end:
(1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
maximum number each of the listen backlog size. The backlog size is
limited to a maxmimum of 32. Only this many of each can be in the
preallocation buffer.
(2) For userspace sockets, the preallocation is charged initially by
listen() and will be recharged by accepting or rejecting pending
new incoming calls.
(3) For kernel services {,re,dis}charging of the preallocation buffers is
handled manually. Two notifier callbacks have to be provided before
kernel_listen() is invoked:
(a) An indication that a new call has been instantiated. This can be
used to trigger background recharging.
(b) An indication that a call is being discarded. This is used when
the socket is being released.
A function, rxrpc_kernel_charge_accept() is called by the kernel
service to preallocate a single call. It should be passed the user ID
to be used for that call and a callback to associate the rxrpc call
with the kernel service's side of the ID.
(4) Discard the preallocation when the socket is closed.
(5) Temporarily bump the refcount on the call allocated in
rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
preallocation ref on service calls unconditionally. This will no
longer be necessary once the preallocation is used.
Note that this does not yet control the number of active service calls on a
client - that will come in a later patch.
A future development would be to provide a setsockopt() call that allows a
userspace server to manually charge the preallocation buffer. This would
allow user call IDs to be provided in advance and the awkward manual accept
stage to be bypassed.
Signed-off-by: David Howells <dhowells@redhat.com>
Add two tracepoints:
(1) Record the RxRPC protocol header of packets retrieved from the UDP
socket by the data_ready handler.
(2) Record the outcome of the data_ready handler.
Signed-off-by: David Howells <dhowells@redhat.com>