Commit Graph

7214 Commits

Author SHA1 Message Date
Hans Verkuil 3d58ffe2aa V4L/DVB (5867): videodev2.h: add missing <sys/time.h> for userspace
When videodev2.h is included by an application, it needs to include
<sys/time.h> for the timeval struct.

Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
2007-07-20 17:35:52 -03:00
Bob Nelson 1474855d08 [CELL] oprofile: add support to OProfile for profiling CELL BE SPUs
From: Maynard Johnson <mpjohn@us.ibm.com>

This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling code.
Exports spu_set_profile_private_kref and spu_get_profile_private_kref which
are used by OProfile to store private profile information in spufs data
structures.

Also incorporated several fixes from other patches (rrn).  Check pointer
returned from kzalloc.  Eliminated unnecessary cast.  Better error
handling and cleanup in the related area.  64-bit unsigned long parameter
was being demoted to 32-bit unsigned int and eventually promoted back to
unsigned long.

Signed-off-by: Carl Love <carll@us.ibm.com>
Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
Signed-off-by: Bob Nelson <rrnelson@us.ibm.com>
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
2007-07-20 21:42:24 +02:00
Arnd Bergmann 8e68e2f248 [CELL] spufs: extension of spu_create to support affinity definition
This patch adds support for additional flags at spu_create, which relate
to the establishment of affinity between contexts and contexts to memory.
A fourth, optional, parameter is supported. This parameter represent
a affinity neighbor of the context being created, and is used when defining
SPU-SPU affinity.
Affinity is represented as a doubly linked list of spu_contexts.

Signed-off-by: Andre Detsch <adetsch@br.ibm.com>
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
2007-07-20 21:42:15 +02:00
Roland Dreier 1d4ec7b1d6 Fix ZERO_OR_NULL_PTR(ZERO_SIZE_PTR)
The comparison with ZERO_SIZE_PTR in ZERO_OR_NULL_PTR() needs to be <=
(not just <) so that ZERO_OR_NULL_PTR(ZERO_SIZE_PTR) is 1.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
[ Duh!  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20 12:33:44 -07:00
Linus Torvalds efa7e8673c Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] Prevent people from directly including <asm/rwsem.h>.
  [IA64] remove time interpolator
  [IA64] Convert to generic timekeeping/clocksource
  [IA64] refresh some config files for 64K pagesize
  [IA64] Delete iosapic_free_rte()
  [IA64] fallocate system call
  [IA64] Enable percpu vector domain for IA64_DIG
  [IA64] Enable percpu vector domain for IA64_GENERIC
  [IA64] Support irq migration across domain
  [IA64] Add support for vector domain
  [IA64] Add mapping table between irq and vector
  [IA64] Check if irq is sharable
  [IA64] Fix invalid irq vector assumption for iosapic
  [IA64] Use dynamic irq for iosapic interrupts
  [IA64] Use per iosapic lock for indirect iosapic register access
  [IA64] Cleanup lock order in iosapic_register_intr
  [IA64] Remove duplicated members in iosapic_rte_info
  [IA64] Remove block structure for locking in iosapic.c
2007-07-20 12:02:20 -07:00
Tony Luck c36c282b88 Pull ia64-clocksource into release branch 2007-07-20 11:26:47 -07:00
Bob Picco 1f564ad6d4 [IA64] remove time interpolator
Remove time_interpolator code (This is generic code, but
only user was ia64.  It has been superseded by the
CONFIG_GENERIC_TIME code).

Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Peter Keilty <peter.keilty@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-07-20 11:23:02 -07:00
Tony Luck 0aa366f351 [IA64] Convert to generic timekeeping/clocksource
This is a merge of Peter Keilty's initial patch (which was
revived by Bob Picco) for this with Hidetoshi Seto's fixes
and scaling improvements.

Acked-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-07-20 11:22:30 -07:00
Linus Torvalds 2cb7e71422 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sfr/ofcons
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sfr/ofcons:
  Create drivers/of/platform.c
  Create linux/of_platorm.h
  [SPARC/64] Rename some functions like PowerPC
  Begin consolidation of of_device.h
  Begin to consolidate of_device.c
  Consolidate of_find_node_by routines
  Consolidate of_get_next_child
  Consolidate of_get_parent
  Consolidate of_find_property
  Consolidate of_device_is_compatible
  Start split out of common open firmware code
  Split out common parts of prom.h
2007-07-20 09:18:08 -07:00
Linus Torvalds d638d4990b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: appletouch - improve powersaving for Geyser3 devices
  Input: lifebook - fix an oops on Panasonic CF-18
  Input: document intended meaning of KEY_SWITCHVIDEOMODE
  Input: switch to using seq_list_xxx helpers
  Input: i8042 - give more trust to PNP data on i386
  Input: add driver for Fujitsu serial touchscreens
  Input: ads7846 - re-check pendown status before reporting events
  Input: ads7846 - introduce sample settling delay
  Input: xpad - add support for leds on xbox 360 pad
2007-07-20 09:16:07 -07:00
Linus Torvalds 7f46e6ca01 Merge branch 'linus' of master.kernel.org:/pub/scm/linux/kernel/git/perex/alsa
* 'linus' of master.kernel.org:/pub/scm/linux/kernel/git/perex/alsa: (102 commits)
  [ALSA] version 1.0.14
  [ALSA] remove duplicate Logitech Quickcam USB ID in usbquirks.h
  [ALSA] hda-codec - Fix input with STAC92xx
  [ALSA] hda-intel: support for iMac 24'' released on 09/2006
  [ALSA] hda-codec - Add quirk for Asus P5LD2
  [ALSA] snd-ca0106: Add support for X-Fi Extreme Audio.
  [ALSA] snd-emu10k1:Enable E-Mu 1616m notebook firmware loading.
  [ALSA] snd-emu10k1: Initial support for E-Mu 1616 and 1616m.
  [ALSA] cs46xx - Fix PM resume
  [ALSA] hda: Enable SPDIF in/out on some stac9205 boards
  [ALSA] timer: check for incorrect device state in non-debug compiles, too
  [ALSA] snd-aoa-codec-onyx: fix typo
  [ALSA] hda-codec - Add quirks for HP dx2200/dx2250
  [ALSA] hda-codec - Rename HP model-specific quirks
  [ALSA] hda-codec - Add quirk for HP Samba
  [ALSA] hda-codec - Add LG LW20 line-in capture source
  [ALSA] usb-audio - Fix AC3 with M-Audio Audiophile USB
  [ALSA] hda: stac9202 mixer fix
  [ALSA] Make s3c24xx_i2s_set_clkdiv() change the correct bits
  [ALSA] hda-codec - Add LG LW20 si3054 modem id
  ...
2007-07-20 08:52:06 -07:00
Linus Torvalds 6936b17ea0 Merge branch 'cfq' of git://git.kernel.dk/data/git/linux-2.6-block
* 'cfq' of git://git.kernel.dk/data/git/linux-2.6-block:
  cfq: Write-only stuff in CFQ data structures
  cfq: async queue allocation per priority
2007-07-20 08:50:49 -07:00
Linus Torvalds dee2383784 Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev: (29 commits)
  libata: implement EH fast drain
  libata: schedule probing after SError access failure during autopsy
  libata: clear HOTPLUG flag after a reset
  libata: reorganize ata_ehi_hotplugged()
  libata: improve SCSI scan failure handling
  libata: quickly trigger SATA SPD down after debouncing failed
  libata: improve SATA PHY speed down logic
  The SATA controller device ID is different according to
  ahci: implement SCR_NOTIFICATION r/w
  ahci: make NO_NCQ handling more consistent
  libata: make ->scr_read/write callbacks return error code
  libata: implement AC_ERR_NCQ
  libata: improve EH report formatting
  sata_sil24: separate out sil24_do_softreset()
  sata_sil24: separate out sil24_exec_polled_cmd()
  sata_sil24: replace sil24_update_tf() with sil24_read_tf()
  ahci: separate out ahci_do_softreset()
  ahci: separate out ahci_exec_polled_cmd()
  ahci: separate out ahci_kick_engine()
  ahci: use deadline instead of fixed timeout for 1st FIS for SRST
  ...
2007-07-20 08:46:42 -07:00
Dan Williams eb0645a8b1 async_tx: fix kmap_atomic usage in async_memcpy
Andrew Morton:
	[async_memcpy] is very wrong if both ASYNC_TX_KMAP_DST and
	ASYNC_TX_KMAP_SRC can ever be set.  We'll end up using the same kmap
	slot for both src add dest and we get either corrupted data or a BUG.

Evgeniy Polyakov:
	Btw, shouldn't it always be kmap_atomic() even if flag is not set.
	That pages are usual one returned by alloc_page().

So fix the usage of kmap_atomic and kill the ASYNC_TX_KMAP_DST and
ASYNC_TX_KMAP_SRC flags.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20 08:44:19 -07:00
Al Viro d046943cba fix gfp_t annotations for slub
Since we have use like ~SLUB_DMA, we ought to have the type
set right in both cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20 08:24:50 -07:00
Tejun Heo 5ddf24c5ea libata: implement EH fast drain
In most cases, when EH is scheduled, all in-flight commands are
aborted causing EH to kick in immediately.  However, in some cases
(especially with PMP), it's unclear which commands are affected by the
error condition and although aborting all in-flight commands work, it
isn't optimal and may cause unnecessary disruption.  On the other
hand, waiting for in-flight commands to drain themselves can take up
to 30seconds.

This patch implements EH fast drain to handle such situations.  It
gives in-flight commands some time to finish up but doesn't wait for
too long.  After EH is scheduled, fast drain timer is started and if
no other completion occurs in ATA_EH_FASTDRAIN_INTERVAL all in-flight
commands are aborted.  If any completion occurred in the interval, the
port is given another interval to finish up itself.

Currently ATA_EH_FASTDRAIN_INTERVAL is 3 secs which should be enough
for finishing up most commands.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-07-20 08:26:26 -04:00
Tejun Heo f8f1e1cc0c libata: reorganize ata_ehi_hotplugged()
__ata_ehi_hotplugged() now has no users.  Regorganize
ata_ehi_hotplugged() such that a new function ata_ehi_schedule_probe()
deals with scheduling probing.  ata_ehi_hotplugged() calls it and
additionally marks hotplug specific flags.  ata_ehi_schedule_probe()
will be used laster.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-07-20 08:26:25 -04:00
Tejun Heo 008a78961e libata: improve SATA PHY speed down logic
sata_down_spd_limit() first reads the current SPD from SStatus and
limit the speed to the lower one of one below the current limit or one
below the current SPD in SStatus.  SPD may not be accessible or valid
when SPD down is requested making sata_down_spd_limit() fail when it's
most needed.

This patch makes the current SPD cached after each successful reset
and forces GEN I speed (1.5Gbps) if neither of SStatus or the cached
value is valid, so sata_down_spd_limit() is now guaranteed to lower
the speed limit if lower speed is available.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-07-20 08:19:05 -04:00
Tejun Heo da3dbb17a0 libata: make ->scr_read/write callbacks return error code
Convert ->scr_read/write callbacks to return error code to better
indicate failure.  This will help handling of SCR_NOTIFICATION.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-07-20 08:02:11 -04:00
Tejun Heo 5335b72906 libata: implement AC_ERR_NCQ
When an NCQ command fails, all commands in flight are aborted and the
offending one is reported using log page 10h.  Depending on controller
characteristics and LLD implementation, all commands may appear as
having a device error due to shared TF status making it hard to
determine what's actually going on.

This patch adds AC_ERR_NCQ, marks the command reported by log page 10h
with it and print extra "<F>" after the error report for the command
to help distinguishing the offending command.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-07-20 08:02:11 -04:00
Tejun Heo b64bbc39f2 libata: improve EH report formatting
Requiring LLDs to format multiple error description messages properly
doesn't work too well.  Help LLDs a bit by making ata_ehi_push_desc()
insert ", " on each invocation.  __ata_ehi_push_desc() is the raw
version without the automatic separator.

While at it, make ehi_desc interface proper functions instead of
macros.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-07-20 08:02:11 -04:00
Tejun Heo 9977126c4b libata: add @is_cmd to ata_tf_to_fis()
Add @is_cmd to ata_tf_to_fis().  This controls bit 7 of the second
byte which tells the device whether this H2D FIS is for a command or
not.  This cleans up ahci a bit and will be used by PMP.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-07-20 08:02:10 -04:00
Liam Girdwood 7a05f067c0 [ALSA] ASoC S3C24xx machine drivers - I2C ID for LM4857
This patch adds I2C ID for the LM4857 audio amp and corrects the spacing
of the WM8731, WM8750 and WM8753 ID's.

Signed-off-by: Liam Girdwood <lg@opensource.wolfsonmicro.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jaroslav Kysela <perex@suse.cz>
2007-07-20 11:11:16 +02:00
Vasily Tarasov c2dea2d1fd cfq: async queue allocation per priority
If we have two processes with different ioprio_class, but the same
ioprio_data, their async requests will fall into the same queue. I guess
such behavior is not expected, because it's not right to put real-time
requests and best-effort requests in the same queue.

The attached patch fixes the problem by introducing additional *cfqq
fields on cfqd, pointing to per-(class,priority) async queues.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-20 10:06:38 +02:00
Stephen Rothwell 3f23de10f2 Create drivers/of/platform.c
and populate it with the common parts from PowerPC and Sparc[64].

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
2007-07-20 14:25:51 +10:00
Stephen Rothwell b41912ca34 Create linux/of_platorm.h
Move common stuff from asm-powerpc/of_platform.h to here and
move the common bits from asm-sparc*/of_device.h here as well.

Create asm-sparc*/of_platform.h and move appropriate parts of
of_device.h to them.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
2007-07-20 14:25:22 +10:00
Stephen Rothwell f898f8dbce Begin consolidation of of_device.h
This just moves the common stuff from the arch of_device.h files to
linux/of_device.h.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
2007-07-20 13:41:56 +10:00
Stephen Rothwell 76c1ce7870 Split out common parts of prom.h
This creates linux/of.h and includes asm/prom.h from it.

We also include linux/of.h from asm/prom.h while we transition.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
2007-07-20 13:10:22 +10:00
Paul Mundt 20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Bartlomiej Zolnierkiewicz 4099d14322 ide: add PIO masks
* Add ATA_PIO[0-6] defines to <linux/ata.h>.

* Add ->pio_mask field to ide_pci_device_t and ide_hwif_t.

* Add PIO masks to host drivers.

<linux/ata.h> change ACK-ed by Jeff Garzik <jeff@garzik.org>.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-07-20 01:11:59 +02:00
Bartlomiej Zolnierkiewicz 6a824c92db ide: remove ide_find_best_pio_mode()
* Add ->host_flags to ide_hwif_t to store ide_pci_device_t.host_flags,
  assign it in setup-pci.c:ide_pci_setup_ports().

* Add IDE_HFLAG_PIO_NO_{BLACKLIST,DOWNGRADE} to ide_pci_device_t.host_flags
  and teach ide_get_best_pio_mode() about them.  Also remove needless
  !drive->id check while at it (drive->id is always present).

* Convert amd74xx, via82cxxx and ide-timing.h to use ide_get_best_pio_mode()
  and then remove no longer needed ide_find_best_pio_mode().

There should be no functionality changes caused by this patch.

Acked-by: Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-07-20 01:11:58 +02:00
Bartlomiej Zolnierkiewicz 2134758d2a ide: drop "PIO data" argument from ide_get_best_pio_mode()
* Drop no longer needed "PIO data" argument from ide_get_best_pio_mode()
  and convert all users accordingly.

* Remove no longer needed ide_pio_data_t.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-07-20 01:11:58 +02:00
Bartlomiej Zolnierkiewicz 7dd00083b1 ide: add ide_pio_cycle_time() helper (take 2)
* Add ide_pio_cycle_time() helper.

* Use it in ali14xx/ht6560b/qd65xx/cmd64{0,x}/sl82c105 and pmac host drivers
  (previously cycle time given by the device was only used for "pio" == 255).

* Remove no longer needed ide_pio_data_t.cycle_time field.

v2:
* Fix "ata_" prefix (Noticed by Jeff).

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-07-20 01:11:56 +02:00
Bartlomiej Zolnierkiewicz a5d8c5c834 ide: add ide_pci_device_t.host_flags (take 2)
* Rename ide_pci_device_t.flags to ide_pci_device_t.host_flags
  and IDEPCI_FLAG_ISA_PORTS flag to IDE_HFLAG_ISA_PORTS.

* Add IDE_HFLAG_SINGLE flag for single channel devices.

* Convert core code and all IDE PCI drivers to use IDE_HFLAG_SINGLE
  and remove no longer needed ide_pci_device_t.channels field.

v2:
* Fix issues noticed by Sergei:
  - correct code alignment in scc_pata.c
  - s/IDE_HFLAG_SINGLE/~IDE_HFLAG_SINGLE/ in serverworks.c

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-07-20 01:11:55 +02:00
Bartlomiej Zolnierkiewicz 2229833c13 ide: add ide_dev_has_iordy() helper (take 4)
* Add ide_dev_has_iordy() helper and use it sl82c105 host driver.

* Remove no longer needed ide_pio_data_t.use_iordy field.

v2/v3:
* Fix issues noticed by Sergei:
  - correct patch description
  - fix comment in ide_get_best_pio_mode()

v4:
* Fix "ata_" prefix (Noticed by Jeff).

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-07-20 01:11:55 +02:00
Bartlomiej Zolnierkiewicz 342cdb6d47 ide: make ide_get_best_pio_mode() print info if overriding PIO mode
* Print info about overriding PIO mode in ide_get_best_pio_mode().

* Remove info about overriding PIO mode from cmd64{0,x} host drivers.

* Remove no longer needed ide_pio_data_t.overridden field.

Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-07-20 01:11:55 +02:00
Linus Torvalds fdb64f93b3 Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
  [XFS] Fix inode size update before data write in xfs_setattr
  [XFS] Allow punching holes to free space when at ENOSPC
  [XFS] Implement ->page_mkwrite in XFS.
  [FS] Implement block_page_mkwrite.

Manually fix up conflict with Nick's VM fault handling patches in
fs/xfs/linux-2.6/xfs_file.c

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 14:41:33 -07:00
Linus Torvalds 3e1f900bff Merge git://git.linux-nfs.org/pub/linux/nfs-2.6
* git://git.linux-nfs.org/pub/linux/nfs-2.6:
  NFSv4: handle lack of clientaddr in option string
  NFSv4: debug print ntohl(status) in nfs client callback xdr code
  SUNRPC: Clean up the sillyrename code
  NFS: Introduce struct nfs_removeargs+nfs_removeres
  NFS: Use dentry->d_time to store the parent directory verifier.
  SUNRPC: move bkl locking and xdr proc invocation into a common helper
  NFSv4: Fix the nfsv4 readlink reply buffer alignment
  NFSv4: Fix the readdir reply buffer alignment
  NFSv4: More NFSv4 xdr cleanups
  NFSv4: Try to recover from getfh failures in nfs4_xdr_dec_open
  NFSv4: 'constify' lookup arguments.
  NFSv4: Don't fail nfs4_xdr_dec_open if decode_restorefh() failed
  NFSv4: Fix open state recovery
  NFSD/SUNRPC: Fix the automatic selection of RPCSEC_GSS
2007-07-19 14:33:41 -07:00
Linus Torvalds 40b42f1ebf Merge branch 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6
* 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6: (44 commits)
  i2c: Delete the i2c-isa pseudo bus driver
  hwmon: refuse to load abituguru driver on non-Abit boards
  hwmon: fix Abit Uguru3 driver detection on some motherboards
  hwmon/w83627ehf: Be quiet when no chip is found
  hwmon/w83627ehf: No need to initialize fan_min
  hwmon/w83627ehf: Export the thermal sensor types
  hwmon/w83627ehf: Enable VBAT monitoring
  hwmon/w83627ehf: Add support for the VID inputs
  hwmon/w83627ehf: Fix timing issues
  hwmon/w83627ehf: Add error messages for two error cases
  hwmon/w83627ehf: Convert to a platform driver
  hwmon/w83627ehf: Update the Kconfig entry
  make coretemp_device_remove() static
  hwmon: Add LM93 support
  hwmon: Improve the pwmN_enable documentation
  hwmon/smsc47b397: Don't report missing fans as spinning at 82 RPM
  hwmon: Add support for newer uGuru's
  hwmon/f71805f: Add temperature-tracking fan control mode
  hwmon/w83627ehf: Preserve speed reading when changing fan min
  hwmon: fix detection of abituguru volt inputs
  ...

Manual fixup of trivial conflict in MAINTAINERS file
2007-07-19 14:24:57 -07:00
Linus Torvalds ff86303e30 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  [PATCH] sched: implement cpu_clock(cpu) high-speed time source
  [PATCH] sched: fix the all pinned logic in load_balance_newidle()
  [PATCH] sched: fix newly idle load balance in case of SMT
  [PATCH] sched: sched_cacheflush is now unused
2007-07-19 14:11:14 -07:00
Serge E. Hallyn 626ac545c1 user namespace: fix copy_user_ns return value
When a CONFIG_USER_NS=n and a user tries to unshare some namespace other
than the user namespace, the dummy copy_user_ns returns NULL rather than
the old_ns.

This value then gets assigned to task->nsproxy->user_ns, so that a
subsequent setuid, which uses task->nsproxy->user_ns, causes a NULL
pointer deref.

Fix this by returning old_ns.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 14:05:08 -07:00
Ingo Molnar e436d80085 [PATCH] sched: implement cpu_clock(cpu) high-speed time source
Implement the cpu_clock(cpu) interface for kernel-internal use:
high-speed (but slightly incorrect) per-cpu clock constructed from
sched_clock().

This API, unused at the moment, will be used in the future by blktrace,
by the softlockup-watchdog, by printk and by lockstat.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-19 21:28:35 +02:00
Trond Myklebust e4eff1a622 SUNRPC: Clean up the sillyrename code
Fix a couple of bugs:
 - Don't rely on the parent dentry still being valid when the call completes.
   Fixes a race with shrink_dcache_for_umount_subtree()

 - Don't remove the file if the filehandle has been labelled as stale.

Fix a couple of inefficiencies
 - Remove the global list of sillyrenamed files. Instead we can cache the
   sillyrename information in the dentry->d_fsdata
 - Move common code from unlink_setup/unlink_done into fs/nfs/unlink.c

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-07-19 15:21:39 -04:00
Trond Myklebust 4fdc17b2a7 NFS: Introduce struct nfs_removeargs+nfs_removeres
We need a common structure for setting up an unlink() rpc call in order to
fix the asynchronous unlink code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-07-19 15:21:39 -04:00
J. Bruce Fields be879c4e24 SUNRPC: move bkl locking and xdr proc invocation into a common helper
Since every invocation of xdr encode or decode functions takes the BKL now,
there's a lot of redundant lock_kernel/unlock_kernel pairs that we can pull
out into a common function.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-07-19 15:21:39 -04:00
Jean Delvare e24b8cb4fa i2c: Delete the i2c-isa pseudo bus driver
There are no users of i2c-isa left, so we can finally get rid of it.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
2007-07-19 14:25:20 -04:00
Linus Torvalds ce8c2293be Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (25 commits)
  [TG3]: Fix msi issue with kexec/kdump.
  [NET] XFRM: Fix whitespace errors.
  [NET] TIPC: Fix whitespace errors.
  [NET] SUNRPC: Fix whitespace errors.
  [NET] SCTP: Fix whitespace errors.
  [NET] RXRPC: Fix whitespace errors.
  [NET] ROSE: Fix whitespace errors.
  [NET] RFKILL: Fix whitespace errors.
  [NET] PACKET: Fix whitespace errors.
  [NET] NETROM: Fix whitespace errors.
  [NET] NETFILTER: Fix whitespace errors.
  [NET] IPV4: Fix whitespace errors.
  [NET] DCCP: Fix whitespace errors.
  [NET] CORE: Fix whitespace errors.
  [NET] BLUETOOTH: Fix whitespace errors.
  [NET] AX25: Fix whitespace errors.
  [PATCH] mac80211: remove rtnl locking in ieee80211_sta.c
  [PATCH] mac80211: fix GCC warning on 64bit platforms
  [GENETLINK]: Dynamic multicast groups.
  [NETLIKN]: Allow removing multicast groups.
  ...
2007-07-19 10:23:21 -07:00
Douglas Thompson 53078ca84b include/linux/pci_id.h: add amd northbridge defines
pci_ids.h needs two of the AMD NB device-ids namely, Addressmap and the Memory
Controller devices

This patch adds those to the pci_id.h include file

Signed-off-by:	Douglas Thompson <dougthompson@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:55 -07:00
Dave Jiang 66ee2f940a drivers/edac: mod assert_error check
Change error check and clear variable from an atomic to an int

Signed-off-by: Dave Jiang <djiang@mvista.com>
Signed-off-by: Douglas Thompson <dougthompson@xmission.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:54 -07:00
Jason Uhlenkott 535c6a5303 drivers/edac: new inte 30x0 MC driver
Here's a driver for the Intel 3000 and 3010 memory controllers,
relative to today's Sourceforge code drop.  This has only had light
testing (I've yet to actually see it handle a memory error) but it
detects my hardware correctly.

Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com>
Signed-off-by: Douglas Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:54 -07:00
Dave Jiang c0d1217202 drivers/edac: add new nmi rescan
Provides a way for NMI reported errors on x86 to notify the EDAC
subsystem pending ECC errors by writing to a software state variable.

Here's the reworked patch. I added an EDAC stub to the kernel so we can
have variables that are in the kernel even if EDAC is a module. I also
implemented the idea of using the chip driver to select error detection
mode via module parameter and eliminate the kernel compile option.
Please review/test. Thx!

Also, I only made changes to some of the chipset drivers since I am
unfamiliar with the other ones. We can add similar changes as we go.

Signed-off-by: Dave Jiang <djiang@mvista.com>
Signed-off-by: Douglas Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:53 -07:00
Rusty Russell d7e28ffe6c lguest: the host code
This is the code for the "lg.ko" module, which allows lguest guests to
be launched.

[akpm@linux-foundation.org: update for futex-new-private-futexes]
[akpm@linux-foundation.org: build fix]
[jmorris@namei.org: lguest: use hrtimers]
[akpm@linux-foundation.org: x86_64 build fix]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:52 -07:00
Rusty Russell 07ad157f6e lguest: the guest code
lguest is a simple hypervisor for Linux on Linux.  Unlike kvm it doesn't need
VT/SVM hardware.  Unlike Xen it's simply "modprobe and go".  Unlike both, it's
5000 lines and self-contained.

Performance is ok, but not great (-30% on kernel compile).  But given its
hackability, I expect this to improve, along with the paravirt_ops code which
it supplies a complete example for.  There's also a 64-bit version being
worked on and other craziness.

But most of all, lguest is awesome fun!  Too much of the kernel is a big ball
of hair.  lguest is simple enough to dive into and hack, plus has some warts
which scream "fork me!".

This patch:

This is the code and headers required to make an i386 kernel an lguest guest.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:52 -07:00
J. Bruce Fields c7d51402d2 knfsd: clean up EX_RDONLY
Share a little common code, reverse the arguments for consistency, drop the
unnecessary "inline", and lowercase the name.

Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:52 -07:00
J. Bruce Fields e22841c637 knfsd: move EX_RDONLY out of header
EX_RDONLY is only called in one place; just put it there.

Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:52 -07:00
Andrew Morton d688abf50b move page writeback acounting out of macros
page-writeback accounting is presently performed in the page-flags macros.
This is inconsistent and a bit ugly and makes it awkward to implement
per-backing_dev under-writeback page accounting.

So move this accounting down to the callsite(s).

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:52 -07:00
Johannes Berg 3b5ad0797c stacktrace: fix header file for !CONFIG_STACKTRACE
The print_stack_trace macro in stacktrace.h has a wrong number of
arguments, fix it.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:49 -07:00
Peter Zijlstra 96645678cd lockstat: measure lock bouncing
__acquire
        |
       lock _____
        |        \
        |    __contended
        |         |
        |        wait
        | _______/
        |/
        |
   __acquired
        |
   __release
        |
     unlock

We measure acquisition and contention bouncing.

This is done by recording a cpu stamp in each lock instance.

Contention bouncing requires the cpu stamp to be set on acquisition. Hence we
move __acquired into the generic path.

__acquired is then used to measure acquisition bouncing by comparing the
current cpu with the old stamp before replacing it.

__contended is used to measure contention bouncing (only useful for preemptable
locks)

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:49 -07:00
Peter Zijlstra 4b32d0a4e9 lockdep: various fixes
- update the copyright notices
 - use the default hash function
 - fix a thinko in a BUILD_BUG_ON
 - add a WARN_ON to spot inconsitent naming
 - fix a termination issue in /proc/lock_stat

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:49 -07:00
Peter Zijlstra f20786ff4d lockstat: core infrastructure
Introduce the core lock statistics code.

Lock statistics provides lock wait-time and hold-time (as well as the count
of corresponding contention and acquisitions events). Also, the first few
call-sites that encounter contention are tracked.

Lock wait-time is the time spent waiting on the lock. This provides insight
into the locking scheme, that is, a heavily contended lock is indicative of
a too coarse locking scheme.

Lock hold-time is the duration the lock was held, this provides a reference for
the wait-time numbers, so they can be put into perspective.

  1)
    lock
  2)
    ... do stuff ..
    unlock
  3)

The time between 1 and 2 is the wait-time. The time between 2 and 3 is the
hold-time.

The lockdep held-lock tracking code is reused, because it already collects locks
into meaningful groups (classes), and because it is an existing infrastructure
for lock instrumentation.

Currently lockdep tracks lock acquisition with two hooks:

  lock()
    lock_acquire()
    _lock()

 ... code protected by lock ...

  unlock()
    lock_release()
    _unlock()

We need to extend this with two more hooks, in order to measure contention.

  lock_contended() - used to measure contention events
  lock_acquired()  - completion of the contention

These are then placed the following way:

  lock()
    lock_acquire()
    if (!_try_lock())
      lock_contended()
      _lock()
      lock_acquired()

 ... do locked stuff ...

  unlock()
    lock_release()
    _unlock()

(Note: the try_lock() 'trick' is used to avoid instrumenting all platform
       dependent lock primitive implementations.)

It is also possible to toggle the two lockdep features at runtime using:

  /proc/sys/kernel/prove_locking
  /proc/sys/kernel/lock_stat

(esp. turning off the O(n^2) prove_locking functionaliy can help)

[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: nuke unneeded ifdefs]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:49 -07:00
Peter Zijlstra 21f8ca3bf6 fix raw_spinlock_t vs lockdep
Use the lockdep infrastructure to track lock contention and other lock
statistics.

It tracks lock contention events, and the first four unique call-sites that
encountered contention.

It also measures lock wait-time and hold-time in nanoseconds. The minimum and
maximum times are tracked, as well as a total (which together with the number
of event can give the avg).

All statistics are done per lock class, per write (exclusive state) and per read
(shared state).

The statistics are collected per-cpu, so that the collection overhead is
minimized via having no global cachemisses.

This new lock statistics feature is independent of the lock dependency checking
traditionally done by lockdep; it just shares the lock tracking code. It is
also possible to enable both and runtime disabled either component - thereby
avoiding the O(n^2) lock chain walks for instance.

This patch:

raw_spinlock_t should not use lockdep (and doesn't) since lockdep itself
relies on it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:49 -07:00
Jan Harkes 3cf01f28c3 coda: remove statistics counters from /proc/fs/coda
Similar information can easily be obtained with strace -c.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:48 -07:00
Jan Harkes a1b0aa8764 coda: remove struct coda_sb_info
The sb_info structure only contains a single pointer to the character device,
there is no need for the added indirection.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:48 -07:00
Jan Harkes d9664c95af coda: block signals during upcall processing
We ignore signals for about 30 seconds to give userspace a chance to see the
upcall.  As we did not block signals we ended up in a busy loop for the
remainder of the period when a signal is received.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:48 -07:00
Kawai, Hidehiro 3cb4a0bb1e coredump masking: add an interface for core dump filter
This patch adds an interface to set/reset flags which determines each memory
segment should be dumped or not when a core file is generated.

/proc/<pid>/coredump_filter file is provided to access the flags.  You can
change the flag status for a particular process by writing to or reading from
the file.

The flag status is inherited to the child process when it is created.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:47 -07:00
Kawai, Hidehiro 6c5d523826 coredump masking: reimplementation of dumpable using two flags
This patch changes mm_struct.dumpable to a pair of bit flags.

set_dumpable() converts three-value dumpable to two flags and stores it into
lower two bits of mm_struct.flags instead of mm_struct.dumpable.
get_dumpable() behaves in the opposite way.

[akpm@linux-foundation.org: export set_dumpable]
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:46 -07:00
Josef 'Jeff' Sipek f79c20f525 fs: remove path_walk export
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:45 -07:00
Josef 'Jeff' Sipek c4a7808fc3 fs: mark link_path_walk static
Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:45 -07:00
Josef 'Jeff' Sipek 16f1820028 fs: introduce vfs_path_lookup
Stackable file systems, among others, frequently need to lookup paths or
path components starting from an arbitrary point in the namespace
(identified by a dentry and a vfsmount).  Currently, such file systems use
lookup_one_len, which is frowned upon [1] as it does not pass the lookup
intent along; not passing a lookup intent, for example, can trigger BUG_ON's
when stacking on top of NFSv4.

The first patch introduces a new lookup function to allow lookup starting
from an arbitrary point in the namespace.  This approach has been suggested
by Christoph Hellwig [2].

The second patch changes sunrpc to use vfs_path_lookup.

The third patch changes nfsctl.c to use vfs_path_lookup.

The fourth patch marks link_path_walk static.

The fifth, and last patch, unexports path_walk because it is no longer
unnecessary to call it directly, and using the new vfs_path_lookup is
cleaner.

For example, the following snippet of code, looks up "some/path/component"
in a directory pointed to by parent_{dentry,vfsmnt}:

err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
		      "some/path/component", 0, &nd);
if (!err) {
	/* exits */

	...

	/* once done, release the references */
	path_release(&nd);
} else if (err == -ENOENT) {
	/* doesn't exist */
} else {
	/* other error */
}

VFS functions such as lookup_create can be used on the nameidata structure
to pass the create intent to the file system.

Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:45 -07:00
Ollie Wild b6a2fea393 mm: variable length argument support
Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
the old mm into the new mm.

We create the new mm before the binfmt code runs, and place the new stack at
the very top of the address space.  Once the binfmt code runs and figures out
where the stack should be, we move it downwards.

It is a bit peculiar in that we have one task with two mm's, one of which is
inactive.

[a.p.zijlstra@chello.nl: limit stack size]
Signed-off-by: Ollie Wild <aaw@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
[bunk@stusta.de: unexport bprm_mm_init]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:45 -07:00
Peter Zijlstra bdf4c48af2 audit: rework execve audit
The purpose of audit_bprm() is to log the argv array to a userspace daemon at
the end of the execve system call.  Since user-space hasn't had time to run,
this array is still in pristine state on the process' stack; so no need to
copy it, we can just grab it from there.

In order to minimize the damage to audit_log_*() copy each string into a
temporary kernel buffer first.

Currently the audit code requires that the full argument vector fits in a
single packet.  So currently it does clip the argv size to a (sysctl) limit,
but only when execve auditing is enabled.

If the audit protocol gets extended to allow for multiple packets this check
can be removed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ollie Wild <aaw@google.com>
Cc: <linux-audit@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:45 -07:00
Michael Ellerman 3d7e33825d jprobes: make jprobes a little safer for users
I realise jprobes are a razor-blades-included type of interface, but that
doesn't mean we can't try and make them safer to use.  This guy I know once
wrote code like this:

struct jprobe jp = { .kp.symbol_name = "foo", .entry = "jprobe_foo" };

And then his kernel exploded. Oops.

This patch adds an arch hook, arch_deref_entry_point() (I don't like it
either) which takes the void * in a struct jprobe, and gives back the text
address that it represents.

We can then use that in register_jprobe() to check that the entry point we're
passed is actually in the kernel text, rather than just some random value.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Michael Ellerman 9e367d8592 jprobes: remove JPROBE_ENTRY()
AFAICT now that jprobe.entry is a void *, JPROBE_ENTRY doesn't do anything
useful - so remove it ..

I've left a do-nothing version so that out-of-tree jprobes code will still
compile without modifications.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Michael Ellerman 81eae375ec jprobes: make struct jprobe.entry a void *
Currently jprobe.entry is a kprobe_opcode_t *, but that's a lie.  On some
platforms it doesn't point to an opcode at all, it points to a function
descriptor.

It's really a pointer to something that the arch code can turn into a function
entry point.  And that's what actually happens, none of the generic code ever
looks at jprobe.entry, it's only ever dereferenced by arch code.

So just make it a void *.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Fengguang Wu f9acc8c7b3 readahead: sanify file_ra_state names
Rename some file_ra_state variables and remove some accessors.

It results in much simpler code.
Kudos to Rusty!

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Rusty Russell cf914a7d65 readahead: split ondemand readahead interface into two functions
Split ondemand readahead interface into two functions.  I think this makes it
a little clearer for non-readahead experts (like Rusty).

Internally they both call ondemand_readahead(), but the page argument is
changed to an obvious boolean flag.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Fengguang Wu fe3cba17c4 mm: share PG_readahead and PG_reclaim
Share the same page flag bit for PG_readahead and PG_reclaim.

One is used only on file reads, another is only for emergency writes.  One
is used mostly for fresh/young pages, another is for old pages.

Combinations of possible interactions are:

a) clear PG_reclaim => implicit clear of PG_readahead
	it will delay an asynchronous readahead into a synchronous one
	it actually does _good_ for readahead:
		the pages will be reclaimed soon, it's readahead thrashing!
		in this case, synchronous readahead makes more sense.

b) clear PG_readahead => implicit clear of PG_reclaim
	one(and only one) page will not be reclaimed in time
	it can be avoided by checking PageWriteback(page) in readahead first

c) set PG_reclaim => implicit set of PG_readahead
	will confuse readahead and make it restart the size rampup process
	it's a trivial problem, and can mostly be avoided by checking
	PageWriteback(page) first in readahead

d) set PG_readahead => implicit set of PG_reclaim
	PG_readahead will never be set on already cached pages.
	PG_reclaim will always be cleared on dirtying a page.
	so not a problem.

In summary,
	a)   we get better behavior
	b,d) possible interactions can be avoided
	c)   racy condition exists that might affect readahead, but the chance
	     is _really_ low, and the hurt on readahead is trivial.

Compound pages also use PG_reclaim, but for now they do not interact with
reclaim/readahead code.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Fengguang Wu c743d96b6d readahead: remove the old algorithm
Remove the old readahead algorithm.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Fengguang Wu 122a21d11c readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance.  Also it is full integrated with adaptive readahead.

It is designed to be called on demand:
	- on a missing page, to do synchronous readahead
	- on a lookahead page, to do asynchronous readahead

In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read.  It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.

HEURISTICS

The logic deals with four cases:

	- sequential-next
		found a consistent readahead window, so push it forward

	- random
		standalone small read, so read as is

	- sequential-first
		create a new readahead window for a sequential/oversize request

	- lookahead-clueless
		hit a lookahead page not associated with the readahead window,
		so create a new readahead window and ramp it up

In each case, three parameters are determined:

	- readahead index: where the next readahead begins
	- readahead size:  how much to readahead
	- lookahead size:  when to do the next readahead (for pipelining)

BEHAVIORS

The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:

	- It no longer imposes strict sequential checks.
	  It might help some interleaved cases, and clustered random reads.
	  It does introduce risks of a random lookahead hit triggering an
	  unexpected readahead. But in general it is more likely to do good
	  than to do evil.

	- Interleaved reads are supported in a minimal way.
	  Their chances of being detected and proper handled are still low.

	- Readahead thrashings are better handled.
	  The current readahead leads to tiny average I/O sizes, because it
	  never turn back for the thrashed pages.  They have to be fault in
	  by do_generic_mapping_read() one by one.  Whereas the on-demand
	  readahead will redo readahead for them.

OVERHEADS

The new code reduced the overheads of

	- excessively calling the readahead routine on small sized reads
	  (the current readahead code insists on seeing all requests)

	- doing a lot of pointless page-cache lookups for small cached files
	  (the current readahead only turns itself off after 256 cache hits,
	  unfortunately most files are < 1MB, so never see that chance)

That accounts for speedup of
	- 0.3% on 1-page sequential reads on sparse file
	- 1.2% on 1-page cache hot sequential reads
	- 3.2% on 256-page cache hot sequential reads
	- 1.3% on cache hot `tar /lib`

However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.

PERFORMANCE

The basic benchmark setup is
	- 2.6.20 kernel with on-demand readahead
	- 1MB max readahead size
	- 2.9GHz Intel Core 2 CPU
	- 2GB memory
	- 160G/8M Hitachi SATA II 7200 RPM disk

The benchmarks show that
	- it maintains the same performance for trivial sequential/random reads
	- sysbench/OLTP performance on MySQL gains up to 8%
	- performance on readahead thrashing gains up to 3 times

iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k

			       2.6.20          on-demand      gain
first run
	  "  Initial write "   61437.27        64521.53      +5.0%
	  "        Rewrite "   47893.02        48335.20      +0.9%
	  "           Read "   62111.84        62141.49      +0.0%
	  "        Re-read "   62242.66        62193.17      -0.1%
	  "   Reverse Read "   50031.46        49989.79      -0.1%
	  "    Stride read "    8657.61         8652.81      -0.1%
	  "    Random read "   13914.28        13898.23      -0.1%
	  " Mixed workload "   19069.27        19033.32      -0.2%
	  "   Random write "   14849.80        14104.38      -5.0%
	  "         Pwrite "   62955.30        65701.57      +4.4%
	  "          Pread "   62209.99        62256.26      +0.1%

second run
	  "  Initial write "   60810.31        66258.69      +9.0%
	  "        Rewrite "   49373.89        57833.66     +17.1%
	  "           Read "   62059.39        62251.28      +0.3%
	  "        Re-read "   62264.32        62256.82      -0.0%
	  "   Reverse Read "   49970.96        50565.72      +1.2%
	  "    Stride read "    8654.81         8638.45      -0.2%
	  "    Random read "   13901.44        13949.91      +0.3%
	  " Mixed workload "   19041.32        19092.04      +0.3%
	  "   Random write "   14019.99        14161.72      +1.0%
	  "         Pwrite "   64121.67        68224.17      +6.4%
	  "          Pread "   62225.08        62274.28      +0.1%

In summary, writes are unstable, reads are pretty close on average:

			  access pattern  2.6.20  on-demand   gain
				   Read  62085.61  62196.38  +0.2%
				Re-read  62253.49  62224.99  -0.0%
			   Reverse Read  50001.21  50277.75  +0.6%
			    Stride read   8656.21   8645.63  -0.1%
			    Random read  13907.86  13924.07  +0.1%
	 		 Mixed workload  19055.29  19062.68  +0.0%
				  Pread  62217.53  62265.27  +0.1%

aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

					2.6.20      on-demand  delta
			sequential	 92.57s      92.54s    -0.0%
			random		311.87s     312.15s    +0.1%

sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
	 --file-total-size=4G --file-block-size=64K \
	 --num-threads=001 --max-requests=10000 --max-time=900 run

				threads    2.6.20   on-demand    delta
		first run
				      1   59.1974s    59.2262s  +0.0%
				      2   58.0575s    58.2269s  +0.3%
				      4   48.0545s    47.1164s  -2.0%
				      8   41.0684s    41.2229s  +0.4%
				     16   35.8817s    36.4448s  +1.6%
				     32   32.6614s    32.8240s  +0.5%
				     64   23.7601s    24.1481s  +1.6%
				    128   24.3719s    23.8225s  -2.3%
				    256   23.2366s    22.0488s  -5.1%

		second run
				      1   59.6720s    59.5671s  -0.2%
				      8   41.5158s    41.9541s  +1.1%
				     64   25.0200s    23.9634s  -4.2%
				    256   22.5491s    20.9486s  -7.1%

Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:

                sum all up               495.046s    491.514s   -0.7%

sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
	 --mysql-socket=/var/run/mysqld/mysqld.sock \
	 --mysql-user=root --mysql-password=readahead \
	 --num-threads=064 --max-requests=10000 --max-time=900 run

	10000-transactions run
				threads    2.6.20   on-demand    gain
				      1     62.81       64.56   +2.8%
				      2     67.97       70.93   +4.4%
				      4     81.81       85.87   +5.0%
				      8     94.60       97.89   +3.5%
				     16     99.07      104.68   +5.7%
				     32     95.93      104.28   +8.7%
				     64     96.48      103.68   +7.5%
	5000-transactions run
				      1     48.21       48.65   +0.9%
				      8     68.60       70.19   +2.3%
				     64     70.57       74.72   +5.9%
	2000-transactions run
				      1     37.57       38.04   +1.3%
				      2     38.43       38.99   +1.5%
				      4     45.39       46.45   +2.3%
				      8     51.64       52.36   +1.4%
				     16     54.39       55.18   +1.5%
				     32     52.13       54.49   +4.5%
				     64     54.13       54.61   +0.9%

That's interesting results. Some investigations show that
	- MySQL is accessing the db file non-uniformly: some parts are
	  more hot than others
	- It is mostly doing 4-page random reads, and sometimes doing two
	  reads in a row, the latter one triggers a 16-page readahead.
	- The on-demand readahead leaves many lookahead pages (flagged
	  PG_readahead) there. Many of them will be hit, and trigger
	  more readahead pages. Which might save more seeks.
	- Naturally, the readahead windows tend to lie in hot areas,
	  and the lookahead pages in hot areas is more likely to be hit.
	- The more overall read density, the more possible gain.

That also explains the adaptive readahead tricks for clustered random reads.

readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.

			      max throughput     min avg I/O size
		2.6.20:            5MB/s               16KB
		on-demand:        15MB/s              140KB

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Fengguang Wu 5ce1110b92 readahead: data structure and routines
Extend struct file_ra_state to support the on-demand readahead logic.  Also
define some helpers for it.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:44 -07:00
Fengguang Wu d77c2d7cc5 readahead: introduce PG_readahead
Introduce a new page flag: PG_readahead.

It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
invoke the read-ahead logic.  For the sake of I/O pipelining, don't wait until
it runs out of cached pages!

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:43 -07:00
David Brownell 2ba2d00363 AIO sparse fix (type of ki_flags)
Fix type issue reported by latest 'sparse': kiocb.ki_flags should be
"unsigned long" (not "long"), to match bitop type signature.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:43 -07:00
Akinobu Mita e53252d97e unregister_chrdev() return void
unregister_chrdev() does not return meaningful value.  This patch makes it
return void like most unregister_* functions.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:43 -07:00
Pavel Machek 77afcf78a2 PM: Integrate beeping flag with existing acpi_sleep flags
Move "debug during resume from s2ram" into the variable we already use
for real-mode flags to simplify code. It also closes nasty trap for
the user in acpi_sleep_setup; order of parameters actually mattered there,
acpi_sleep=s3_bios,s3_mode doing something different from
acpi_sleep=s3_mode,s3_bios.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:43 -07:00
Nigel Cunningham 5a60d6235c PM: Optional beeping during resume from suspend to RAM
Add a feature allowing the user to make the system beep during a resume from
suspend to RAM, on x86_64 and i386.

This is useful for the users with broken resume from RAM, so that they can
verify if the control reaches the kernel after a wake-up event.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:43 -07:00
Rafael J. Wysocki bd804eba1c PM: Introduce pm_power_off_prepare
Introduce the pm_power_off_prepare() callback that can be registered by the
interested platforms in analogy with pm_idle() and pm_power_off(), used for
preparing the system to power off (needed by ACPI).

This allows us to drop acpi_sysclass and device_acpi that are only defined in
order to register the ACPI power off preparation callback, which is needed by
pm_power_off() registered in a much different way.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:42 -07:00
Rafael J. Wysocki b10d911749 PM: introduce hibernation and suspend notifiers
Make it possible to register hibernation and suspend notifiers, so that
subsystems can perform hibernation-related or suspend-related operations that
should not be carried out by device drivers' .suspend() and .resume()
routines.

[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:42 -07:00
Rafael J. Wysocki 0c1eecfb34 Freezer: avoid freezing kernel threads prematurely
Kernel threads should not have TIF_FREEZE set when user space processes are
being frozen, since otherwise some of them might be frozen prematurely.
To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE
unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks()
check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags
when user space processes are to be frozen.

Namely, when user space processes are being frozen, we only should set
TIF_FREEZE for tasks that have p->mm different from NULL and don't have
PF_BORROWED_MM set in p->flags.  For this reason task_lock() must be used to
prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which
p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p).  Also, we
need to prevent the following scenario from happening:

* daemonize() is called by a task spawned from a user space code path
* freezer checks if the task has p->mm set and the result is positive
* task enters exit_mm() and clears its TIF_FREEZE
* freezer sets TIF_FREEZE for the task
* task calls try_to_freeze() and goes to the refrigerator, which is wrong at
  that point

This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and
p->mm are examined and release it after TIF_FREEZE is set for p (or it turns
out that TIF_FREEZE should not be set).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:42 -07:00
Rafael J. Wysocki a634cc1016 swsusp: introduce restore platform operations
At least on some machines it is necessary to prepare the ACPI firmware for the
restoration of the system memory state from the hibernation image if the
"platform" mode of hibernation has been used.  Namely, in that cases we need
to disable the GPEs before replacing the "boot" kernel with the "frozen"
kernel (cf.  http://bugzilla.kernel.org/show_bug.cgi?id=7887).  After the
restore they will be re-enabled by hibernation_ops->finish(), but if the
restore fails, they have to be re-enabled by the restore code explicitly.

For this purpose we can introduce two additional hibernation operations,
called pre_restore() and restore_cleanup() and call them from the restore code
path.  Still, they should be called if the "platform" mode of hibernation has
been used, so we need to pass the information about the hibernation mode from
the "frozen" kernel to the "boot" kernel in the image header.

Apparently, we can't drop the disabling of GPEs before the restore because of
Bug #7887 .   We also can't do it unconditionally, because the GPEs wouldn't
have been enabled after a successful restore if the suspend had been done in
the 'shutdown' or 'reboot' mode.

In principle we could (and probably should) unconditionally disable the GPEs
before each snapshot creation *and* before the restore, but then we'd have to
unconditionally enable them after the snapshot creation as well as after the
restore (or restore failure)   Still, for this purpose we'd need to modify
acpi_enter_sleep_state_prep() and acpi_leave_sleep_state() and we'd have to
introduce some mechanism synchronizing the disablind/enabling of the GPEs with
the device drivers' .suspend()/.resume() routines and with
disable_/enable_nonboot_cpus().   However, this would have affected the
suspend (ie.  s2ram) code as well as the hibernation, which I'd like to avoid
in this patch series.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:42 -07:00
Mel Gorman bb2d5ce164 Remove alloc_zeroed_user_highpage()
alloc_zeroed_user_highpage() has no in-tree users and it is not exported.
As it is not exported, it can simply be removed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
Nick Piggin 83c54070ee mm: fault feedback #2
This patch completes Linus's wish that the fault return codes be made into
bit flags, which I agree makes everything nicer.  This requires requires
all handle_mm_fault callers to be modified (possibly the modifications
should go further and do things like fault accounting in handle_mm_fault --
however that would be for another patch).

[akpm@linux-foundation.org: fix alpha build]
[akpm@linux-foundation.org: fix s390 build]
[akpm@linux-foundation.org: fix sparc build]
[akpm@linux-foundation.org: fix sparc64 build]
[akpm@linux-foundation.org: fix ia64 build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Still apparently needs some ARM and PPC loving - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
Nick Piggin d0217ac04c mm: fault feedback #1
Change ->fault prototype.  We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
 FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things.  This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).

This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.

struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.

The page is now returned via a page pointer in the vm_fault struct.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
Nick Piggin 54cb8821de mm: merge populate and nopage into fault (fixes nonlinear)
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.

->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping.  The hitch here
is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
 But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).

Having the populate handler install the pte itself is likewise a nasty thing
to be doing.

This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn.  Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.

The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.

After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache.  Seems like a fringe functionality anyway.

NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
users have hit mainline yet.

[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
Nick Piggin d00806b183 mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.

Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.

The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache.  Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.

The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.

Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).

Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size.  do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size).  The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data.  Valid mappings to the same
place will see a different page.

Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit.  He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight.  However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit.  Scalability is not an issue.

This patch implements this latter approach.  ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so).  do_no_page only unlocks the page after setting up the mapping
completely.  invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).

This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:41 -07:00
David Chinner 5417169026 [FS] Implement block_page_mkwrite.
Many filesystems need a ->page-mkwrite callout to correctly
set up pages that have been written to by mmap. This is especially
important when mmap is writing into holes as it allows filesystems
to correctly account for and allocate space before the mmap
write is allowed to proceed.

Protection against truncate races is provided by locking the page
and checking to see whether the page mapping is correct and whether
it is beyond EOF so we don't end up allowing allocations beyond
the current EOF or changing EOF as a result of a mmap write.

SGI-PV: 940392
SGI-Modid: 2.6.x-xfs-melb:linux:29146a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-07-19 19:50:50 +10:00
Linus Torvalds ce524c8360 Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
  eHEA: Fix bonding support
  Blackfin ethernet driver: on chip ethernet MAC controller driver
  fix wrong argument of tc35815_read_plat_dev_addr()
  ARM/ETHER3: Handle multicast frames.
  SAA9730: Handle multicast frames.
  NI5010: Handle multicast frames.
  NS83820: Handle multicast frames.
  Fix RGMII-ID handling in gianfar
  Fix Vitesse RGMII-ID support
  Add phy-connection-type to gianfar nodes
  Fix Vitesse 824x PHY interrupt acking
  [PATCH] zd1211rw: Add ID for Siemens Gigaset USB Stick 54
  [PATCH] zd1211rw: Add ID for Planex GW-US54GXS
  [PATCH] Update version ipw2200 stamp to 1.2.2
  [PATCH] ipw2200: Fix ipw_isr() comments error on shared IRQ
  [PATCH] Fix ipw2200 set wrong power parameter causing firmware error
  [PATCH] ipw2100: Fix `iwpriv set_power` error
  [PATCH] softmac: Channel is listed twice in scan output
2007-07-18 18:33:45 -07:00
Linus Torvalds 29e7ee378e Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6:
  sysfs: cosmetic clean up on node creation failure paths
  sysfs: kill an extra put in sysfs_create_link() failure path
  Driver core: check return code of sysfs_create_link()
  HOWTO: Add the knwon_regression URI to the documentation
  dev_vdbg() documentation
  dev_vdbg(), available with -DVERBOSE_DEBUG
  sysfs: make sysfs_init_inode() static
  sysfs: fix sysfs root inode nlink accounting
  Documentation fix devres.txt: lib/iomap.c -> lib/devres.c
  sysfs: avoid kmem_cache_free(NULL)
  PM: remove deprecated dpm_runtime_* routines
  PM: Remove deprecated sysfs files
  Driver core: accept all valid action-strings in uevent-trigger
  debugfs: remove rmdir() non-empty complaint
2007-07-18 18:28:08 -07:00
Linus Torvalds fc15bc817e Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/uio-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/uio-2.6:
  UIO: Hilscher CIF card driver
  UIO: Documentation
  UIO: Add the User IO core code
2007-07-18 18:27:50 -07:00
Linus Torvalds a8dcf12f9e Merge branch 'for-linus' of git://linux-nfs.org/~bfields/linux
* 'for-linus' of git://linux-nfs.org/~bfields/linux:
  locks: fix vfs_test_lock() comment
  locks: make posix_test_lock() interface more consistent
  nfs: disable leases over NFS
  gfs2: stop giving out non-cluster-coherent leases
  locks: export setlease to filesystems
  locks: provide a file lease method enabling cluster-coherent leases
  locks: rename lease functions to reflect locks.c conventions
  locks: share more common lease code
  locks: clean up lease_alloc()
  locks: convert an -EINVAL return to a BUG
  leases: minor break_lease() comment clarification
2007-07-18 18:27:00 -07:00
J. Bruce Fields 6d34ac199a locks: make posix_test_lock() interface more consistent
Since posix_test_lock(), like fcntl() and ->lock(), indicates absence or
presence of a conflict lock by setting fl_type to, respectively, F_UNLCK
or something other than F_UNLCK, the return value is no longer needed.

Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
2007-07-18 19:17:19 -04:00