Commit Graph

13435 Commits

Author SHA1 Message Date
Marc Eshel 85f3f1b3f7 lockd: pass cookie in nlmsvc_testlock
Change NLM internal interface to pass more information for test lock; we
need this to make sure the cookie information is pushed down to the place
where we do request deferral, which is handled for testlock by the
following patch.

Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2007-05-06 20:38:50 -04:00
Marc Eshel 2b36f412ab lockd: save lock state on deferral
We need to keep some state for a pending asynchronous lock request, so this
patch adds that state to struct nlm_block.

This also adds a function which defers the request, by calling
rqstp->rq_chandle.defer and storing the resulting deferred request in a
nlm_block structure which we insert into lockd's global block list.  That
new function isn't called yet, so it's dead code until a later patch.

Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2007-05-06 20:38:50 -04:00
Marc Eshel 2beb6614f5 locks: add fl_grant callback for asynchronous lock return
Acquiring a lock on a cluster filesystem may require communication with
remote hosts, and to avoid blocking lockd or nfsd threads during such
communication, we allow the results to be returned asynchronously.

When a ->lock() call needs to block, the file system will return
-EINPROGRESS, and then later return the results with a call to the
routine in the fl_grant field of the lock_manager_operations struct.

This differs from the case when ->lock returns -EAGAIN to a blocking
lock request; in that case, the filesystem calls fl_notify when the lock
is granted, and the caller retries the original lock.  So while
fl_notify is merely a hint to the caller that it should retry, fl_grant
actually communicates the final result of the lock operation (with the
lock already acquired in the succesful case).

Therefore fl_grant takes a lock, a status and, for the test lock case, a
conflicting lock.  We also allow fl_grant to return an error to the
filesystem, to handle the case where the fl_grant requests arrives after
the lock manager has already given up waiting for it.

Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2007-05-06 20:38:49 -04:00
Marc Eshel 9b9d2ab415 locks: add lock cancel command
Lock managers need to be able to cancel pending lock requests.  In the case
where the exported filesystem manages its own locks, it's not sufficient just
to call posix_unblock_lock(); we need to let the filesystem know what's
happening too.

We do this by adding a new fcntl lock command: FL_CANCELLK.  Some day this
might also be made available to userspace applications that could benefit from
an asynchronous locking api.

Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
2007-05-06 20:38:28 -04:00
Marc Eshel 150b393456 locks: allow {vfs,posix}_lock_file to return conflicting lock
The nfsv4 protocol's lock operation, in the case of a conflict, returns
information about the conflicting lock.

It's unclear how clients can use this, so for now we're not going so far as to
add a filesystem method that can return a conflicting lock, but we may as well
return something in the local case when it's easy to.

Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
2007-05-06 19:23:24 -04:00
Marc Eshel 7723ec9777 locks: factor out generic/filesystem switch from setlock code
Factor out the code that switches between generic and filesystem-specific lock
methods; eventually we want to call this from lock managers (lockd and nfsd)
too; currently they only call the generic methods.

This patch does that for all the setlk code.

Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
2007-05-06 18:08:49 -04:00
J. Bruce Fields 3ee17abd14 locks: factor out generic/filesystem switch from test_lock
Factor out the code that switches between generic and filesystem-specific lock
methods; eventually we want to call this from lock managers (lockd and nfsd)
too; currently they only call the generic methods.

This patch does that for test_lock.

Note that this hasn't been necessary until recently, because the few
filesystems that define ->lock() (nfs, cifs...) aren't exportable via NFS.
However GFS (and, in the future, other cluster filesystems) need to implement
their own locking to get cluster-coherent locking, and also want to be able to
export locking to NFS (lockd and NFSv4).

So we accomplish this by factoring out code such as this and exporting it for
the use of lockd and nfsd.

Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
2007-05-06 18:06:44 -04:00
Marc Eshel 9d6a8c5c21 locks: give posix_test_lock same interface as ->lock
posix_test_lock() and ->lock() do the same job but have gratuitously
different interfaces.  Modify posix_test_lock() so the two agree,
simplifying some code in the process.

Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
2007-05-06 17:39:00 -04:00
Linus Torvalds 15700770ef Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (38 commits)
  kconfig: fix mconf segmentation fault
  kbuild: enable use of code from a different dir
  kconfig: error out if recursive dependencies are found
  kbuild: scripts/basic/fixdep segfault on pathological string-o-death
  kconfig: correct minor typo in Kconfig warning message.
  kconfig: fix path to modules.txt in Kconfig help
  usr/Kconfig: fix typo
  kernel-doc: alphabetically-sorted entries in index.html of 'htmldocs'
  kbuild: be more explicit on missing .config file
  kbuild: clarify the creation of the LOCALVERSION_AUTO string.
  kbuild: propagate errors from find in scripts/gen_initramfs_list.sh
  kconfig: refer to qt3 if we cannot find qt libraries
  kbuild: handle compressed cpio initramfs-es
  kbuild: ignore section mismatch warning for references from .paravirtprobe to .init.text
  kbuild: remove stale comment in modpost.c
  kbuild/mkuboot.sh: allow spaces in CROSS_COMPILE
  kbuild: fix make mrproper for Documentation/DocBook/man
  kbuild: remove kconfig binaries during make mrproper
  kconfig/menuconfig: do not hardcode '.config'
  kbuild: override build timestamp & version
  ...
2007-05-06 13:21:57 -07:00
Linus Torvalds 6de410c2b0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (66 commits)
  KVM: Remove unused 'instruction_length'
  KVM: Don't require explicit indication of completion of mmio or pio
  KVM: Remove extraneous guest entry on mmio read
  KVM: SVM: Only save/restore MSRs when needed
  KVM: fix an if() condition
  KVM: VMX: Add lazy FPU support for VT
  KVM: VMX: Properly shadow the CR0 register in the vcpu struct
  KVM: Don't complain about cpu erratum AA15
  KVM: Lazy FPU support for SVM
  KVM: Allow passing 64-bit values to the emulated read/write API
  KVM: Per-vcpu statistics
  KVM: VMX: Avoid unnecessary vcpu_load()/vcpu_put() cycles
  KVM: MMU: Avoid heavy ASSERT at non debug mode.
  KVM: VMX: Only save/restore MSR_K6_STAR if necessary
  KVM: Fold drivers/kvm/kvm_vmx.h into drivers/kvm/vmx.c
  KVM: VMX: Don't switch 64-bit msrs for 32-bit guests
  KVM: VMX: Reduce unnecessary saving of host msrs
  KVM: Handle guest page faults when emulating mmio
  KVM: SVM: Report hardware exit reason to userspace instead of dmesg
  KVM: Retry sleeping allocation if atomic allocation fails
  ...
2007-05-06 13:21:18 -07:00
Linus Torvalds c6799ade4a Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (82 commits)
  [ARM] Add comments marking in-use ptrace numbers
  [ARM] Move syscall saving out of the way of utrace
  [ARM] 4360/1: S3C24XX: regs-udc.h remove unused macro
  [ARM] 4358/1: S3C24XX: mach-qt2410.c: remove linux/mmc/protocol.h header
  [ARM] mm 10: allow memory type to be specified with ioremap
  [ARM] mm 9: add additional device memory types
  [ARM] mm 8: define mem_types table L1 bit 4 to be for ARMv6
  [ARM] iop: add missing parens in macro
  [ARM] mm 7: remove duplicated __ioremap() prototypes
  ARM: OMAP: fix OMAP1 mpuio suspend/resume oops
  ARM: OMAP: MPUIO wake updates
  ARM: OMAP: speed up gpio irq handling
  ARM: OMAP: plat-omap changes for 2430 SDP
  ARM: OMAP: gpio object shrinkage, cleanup
  ARM: OMAP: /sys/kernel/debug/omap_gpio
  ARM: OMAP: Implement workaround for GPIO wakeup bug in OMAP2420 silicon
  ARM: OMAP: Enable 24xx GPIO autoidling
  [ARM] 4318/2: DSM-G600 Board Support
  [ARM] 4227/1: minor head.S fixups
  [ARM] 4328/1: Move i.MX UART regs to driver
  ...
2007-05-06 13:20:10 -07:00
Russell King 5cd4715515 Merge branch 'ixp4xx' into devel
Conflicts:

	include/asm-arm/arch-ixp4xx/io.h
2007-05-06 20:58:29 +01:00
Russell King 6f95416ebe Merge branches 'arm-mm', 'at91', 'clkevts', 'imx', 'iop', 'misc', 'netx', 'ns9xxx', 'omap', 'pxa', 'rpc', 's3c' and 'sa1100' into devel 2007-05-06 20:57:51 +01:00
Russell King 1b11652286 [ARM] Add comments marking in-use ptrace numbers
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-06 14:49:56 +01:00
Russell King 5ba6d3febd [ARM] Move syscall saving out of the way of utrace
utrace removes the ptrace_message field in task_struct.  Move our use
of this field into a new member in thread_info called "syscall"

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-06 13:56:26 +01:00
Linus Torvalds ea62ccd00f Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (231 commits)
  [PATCH] i386: Don't delete cpu_devs data to identify different x86 types in late_initcall
  [PATCH] i386: type may be unused
  [PATCH] i386: Some additional chipset register values validation.
  [PATCH] i386: Add missing !X86_PAE dependincy to the 2G/2G split.
  [PATCH] x86-64: Don't exclude asm-offsets.c in Documentation/dontdiff
  [PATCH] i386: avoid redundant preempt_disable in __unlazy_fpu
  [PATCH] i386: white space fixes in i387.h
  [PATCH] i386: Drop noisy e820 debugging printks
  [PATCH] x86-64: Fix allnoconfig error in genapic_flat.c
  [PATCH] x86-64: Shut up warnings for vfat compat ioctls on other file systems
  [PATCH] x86-64: Share identical video.S between i386 and x86-64
  [PATCH] x86-64: Remove CONFIG_REORDER
  [PATCH] x86-64: Print type and size correctly for unknown compat ioctls
  [PATCH] i386: Remove copy_*_user BUG_ONs for (size < 0)
  [PATCH] i386: Little cleanups in smpboot.c
  [PATCH] x86-64: Don't enable NUMA for a single node in K8 NUMA scanning
  [PATCH] x86: Use RDTSCP for synchronous get_cycles if possible
  [PATCH] i386: Add X86_FEATURE_RDTSCP
  [PATCH] i386: Implement X86_FEATURE_SYNC_RDTSC on i386
  [PATCH] i386: Implement alternative_io for i386
  ...

Fix up trivial conflict in include/linux/highmem.h manually.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-05 14:55:20 -07:00
Ralf Baechle 989485c190 Fix nfsroot build
CC      fs/nfs/nfsroot.o
fs/nfs/nfsroot.c:131: error: tokens causes a section type conflict
make[2]: *** [fs/nfs/nfsroot.o] Error 1

This is due to mixing const and non-const content in the same section
which halfway recent gccs absolutely hate.  Fixed by dropping the const.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-05 14:15:32 -07:00
Linus Torvalds 68762f3d8e Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [TG3]: Add TG3_FLAG_SUPPORT_MSI flag.
  [TG3]: Eliminate the TG3_FLAG_5701_REG_WRITE_BUG flag.
  [TG3]: Eliminate the TG3_FLAG_GOT_SERDES_FLOWCTL flag.
  [TG3]: Remove reset during MAC address changes.
  [TG3]: WoL fixes.
  [TG3]: Clear GPIO mask before storing.
  [TG3]: Improve NVRAM sizing.
  [TG3]: Fix TSO bugs.
  [MAC80211]: Add maintainers entry for mac80211.
  [MAC80211]: Add debugfs attributes.
  [MAC80211]: Add mac80211 wireless stack.
  [MAC80211]: Add generic include/linux/ieee80211.h
  [NETLINK]: Remove references to process ID
  [AF_IUCV]: Compile fix - adopt to skbuff changes.
2007-05-05 14:13:36 -07:00
Linus Torvalds 4f7a307dc6 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (87 commits)
  [SCSI] fusion: fix domain validation loops
  [SCSI] qla2xxx: fix regression on sparc64
  [SCSI] modalias for scsi devices
  [SCSI] sg: cap reserved_size values at max_sectors
  [SCSI] BusLogic: stop using check_region
  [SCSI] tgt: fix rdma transfer bugs
  [SCSI] aacraid: fix aacraid not finding device
  [SCSI] aacraid: Correct SMC products in aacraid.txt
  [SCSI] scsi_error.c: Add EH Start Unit retry
  [SCSI] aacraid: [Fastboot] Panics for AACRAID driver during 'insmod' for kexec test.
  [SCSI] ipr: Driver version to 2.3.2
  [SCSI] ipr: Faster sg list fetch
  [SCSI] ipr: Return better qc_issue errors
  [SCSI] ipr: Disrupt device error
  [SCSI] ipr: Improve async error logging level control
  [SCSI] ipr: PCI unblock config access fix
  [SCSI] ipr: Fix for oops following SATA request sense
  [SCSI] ipr: Log error for SAS dual path switch
  [SCSI] ipr: Enable logging of debug error data for all devices
  [SCSI] ipr: Add new PCI-E IDs to device table
  ...
2007-05-05 13:30:44 -07:00
Linus Torvalds fabb5c4e4a Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/voyager-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/voyager-2.6:
  [VOYAGER] add smp alternatives
  [VOYAGER] Use modern techniques to setup and teardown low identiy mappings.
  [VOYAGER] Convert the monitor thread to use the kthread API
  [VOYAGER] clockevents driver: bring voyager in to line
  [VOYAGER] clockevents: correct boot cpu is zero assumption
  [VOYAGER] add smp_call_function_single
2007-05-05 13:30:23 -07:00
Arnaud Patard d0fdb5a58e [ARM] 4360/1: S3C24XX: regs-udc.h remove unused macro
The S3C2410_UDC_SETIX() macro is not used and won't be used by the udc
driver, so delete it.

Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org>
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-05 21:09:42 +01:00
Sergei Shtylyov e93df705af sl82c105: rework PIO support (take 2)
Get rid of the 'pio_speed' member of 'ide_drive_t' that was only used by this
driver by storing the PIO mode timings in the 'drive_data' instead -- this
allows us to greatly  simplify the process of "reloading" of the chip's timing
register and do it right in sl82c150_dma_off_quietly() and to get rid of two
extra arguments to config_for_pio() -- which got renamed to sl82c105_tune_pio()
and now returns a PIO mode selected, with ide_config_drive_speed() call moved
into the tuneproc() method, now called sl82c105_tune_drive() with the code to
set drive's 'io_32bit' and 'unmask' flags in its turn moved to its proper place
in the init_hwif() method.
Also, while at it, rename get_timing_sl82c105() into get_pio_timings() and get
rid of the code in it clamping cycle counts to 32 which was both incorrect and
never executed anyway...

Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2007-05-05 22:03:49 +02:00
Russell King 3603ab2b62 [ARM] mm 10: allow memory type to be specified with ioremap
__ioremap() took a set of page table flags (specifically the cacheable
and bufferable bits) to control the mapping type.  However, with
the advent of ARMv6, this is far too limited.

Replace the page table flags with a memory type index, so that the
desired attributes can be selected from the mem_type table.

Finally, to prevent silent miscompilation due to the differing
arguments, rename the __ioremap() and __ioremap_pfn() functions.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-05 20:59:27 +01:00
Russell King 0af92befeb [ARM] mm 9: add additional device memory types
Add cached device type for ioremap_cached().  Group all device memory
types together, and ensure that they all have a "MT_DEVICE" prefix.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-05 20:28:16 +01:00
Jiri Benc f0706e828e [MAC80211]: Add mac80211 wireless stack.
Add mac80211, the IEEE 802.11 software MAC layer.

Signed-off-by: Jiri Benc <jbenc@suse.cz>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-05-05 11:45:53 -07:00
Jiri Benc a9de8ce094 [MAC80211]: Add generic include/linux/ieee80211.h
Add generic IEEE 802.11 definitions.

Signed-off-by: Jiri Benc <jbenc@suse.cz>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-05 11:43:04 -07:00
Herbert Xu cf130cb102 [NETLINK]: Remove references to process ID
People treating the *_pid fields in netlink as a process ID has caused
endless confusion over the years.  The fact that our own netlink.h
does this only adds to the confusion.

So here is a patch to change the comments to refer to it as the port
ID which hopefully will make it clear what the purpose of the fields
really is.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-05 11:42:03 -07:00
Russell King ad902cb9e2 [ARM] iop: add missing parens in macro
Fix:

 drivers/serial/8250.c:1837: warning: suggest parentheses around arithmetic in operand of |

due to a macro argument being used without required parenthesis.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-05 11:59:13 +01:00
Russell King 0058ca32c3 [ARM] mm 7: remove duplicated __ioremap() prototypes
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-05 11:57:39 +01:00
Michael-Luke Jones 28bd3a0dcc [ARM] 4318/2: DSM-G600 Board Support
This patch adds support for the D-Link DSM-G600 Rev A.
This is an ARM XScale IXP4xx system relatively similar to
the NSLU2 and NAS-100D already supported by mainline. An
important difference is Gigabit Ethernet support using
the Via Velocity chipset.

This patch is the combined work of Michael Westerhof and
Alessandro Zummo, with contributions from Michael-Luke
Jones. This version addresses review comments from rmk
and Deepak Saxena.

Signed-off-by: Michael-Luke Jones <mlj28@cam.ac.uk>
Signed-off-by: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Michael Westerhof <mwester@dls.net>
Signed-off-by: Deepak Saxena <dsaxena@plexity.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-05 10:06:49 +01:00
Linus Torvalds 62ea6d8021 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc: (46 commits)
  mmc-omap: Clean up omap set_ios and make MMC_POWER_ON work
  mmc-omap: Fix omap to use MMC_POWER_ON
  mmc-omap: add missing '\n'
  mmc: make tifm_sd_set_dma_data() static
  mmc: remove old card states
  mmc: support unsafe resume of cards
  mmc: separate out reading EXT_CSD
  mmc: break apart switch function
  MMC: Fix handling of low-voltage cards
  MMC: Consolidate voltage definitions
  mmc: add bus handler
  wbsd: check for data opcode earlier
  mmc: Separate out protocol ops
  mmc: Move core functions to subdir
  mmc: deprecate mmc bus topology
  mmc: remove card upon suspend
  mmc: allow suspended block driver to be removed
  mmc: Flush pending detects on host removal
  mmc: Move host and card drivers to subdirs
  mmc: Move queue functions to mmc_block
  ...
2007-05-04 21:44:34 -07:00
Linus Torvalds 4d4700707c Merge git://git.linux-nfs.org/pub/linux/nfs-2.6
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (28 commits)
  NFS: Fix a compile glitch on 64-bit systems
  NFS: Clean up nfs_create_request comments
  spkm3: initialize hash
  spkm3: remove bad kfree, unnecessary export
  spkm3: fix spkm3's use of hmac
  NFS4: invalidate cached acl on setacl
  NFS: Fix directory caching problem - with test case and patch.
  NFS: Set meaningful value for fattr->time_start in readdirplus results.
  NFS: Added support to turn off the NFSv3 READDIRPLUS RPC.
  SUNRPC: RPC client should retry with different versions of rpcbind
  SUNRPC: remove old portmapper
  NFS: switch NFSROOT to use new rpcbind client
  SUNRPC: switch the RPC server to use the new rpcbind registration API
  SUNRPC: switch socket-based RPC transports to use rpcbind
  SUNRPC: introduce rpcbind: replacement for in-kernel portmapper
  SUNRPC: Eliminate side effects from rpc_malloc
  SUNRPC: RPC buffer size estimates are too large
  NLM: Shrink the maximum request size of NLM4 requests
  NFS: Use pgoff_t in structures and functions that pass page cache offsets
  NFS: Clean up nfs_sync_mapping_wait()
  ...
2007-05-04 19:55:11 -07:00
Linus Torvalds 7e20ef030d Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (49 commits)
  [SCTP]: Set assoc_id correctly during INIT collision.
  [SCTP]: Re-order SCTP initializations to avoid race with sctp_rcv()
  [SCTP]: Fix the SO_REUSEADDR handling to be similar to TCP.
  [SCTP]: Verify all destination ports in sctp_connectx.
  [XFRM] SPD info TLV aggregation
  [XFRM] SAD info TLV aggregationx
  [AF_RXRPC]: Sort out MTU handling.
  [AF_IUCV/IUCV] : Add missing section annotations
  [AF_IUCV]: Implementation of a skb backlog queue
  [NETLINK]: Remove bogus BUG_ON
  [IPV6]: Some cleanups in include/net/ipv6.h
  [TCP]: zero out rx_opt in tcp_disconnect()
  [BNX2]: Fix TSO problem with small MSS.
  [NET]: Rework dev_base via list_head (v3)
  [TCP] Highspeed: Limited slow-start is nowadays in tcp_slow_start
  [BNX2]: Update version and reldate.
  [BNX2]: Print bus information for PCIE devices.
  [BNX2]: Add 1-shot MSI handler for 5709.
  [BNX2]: Restructure PHY event handling.
  [BNX2]: Add indirect spinlock.
  ...
2007-05-04 19:36:58 -07:00
Linus Torvalds a3d52136ee Merge branch 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/dtor/input
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/dtor/input: (65 commits)
  Input: gpio_keys - add support for switches (EV_SW)
  Input: cobalt_btns - convert to use polldev library
  Input: add skeleton for simple polled devices
  Input: update some documentation
  Input: wistron - fix typo in keymap for Acer TM610
  Input: add input_set_capability() helper
  Input: i8042 - add Fujitsu touchscreen/touchpad PNP IDs
  Input: i8042 - add Panasonic CF-29 to nomux list
  Input: lifebook - split into 2 devices
  Input: lifebook - add signature of Panasonic CF-29
  Input: lifebook - activate 6-byte protocol on select models
  Input: lifebook - work properly on Panasonic CF-18
  Input: cobalt buttons - separate device and driver registration
  Input: ati_remote - make button repeat sensitivity configurable
  Input: pxa27x - do not use deprecated SA_INTERRUPT flag
  Input: ucb1400 - make delays configurable
  Input: misc devices - switch to using input_dev->dev.parent
  Input: joysticks - switch to using input_dev->dev.parent
  Input: touchscreens - switch to using input_dev->dev.parent
  Input: mice - switch to using input_dev->dev.parent
  ...

Fixed up conflicts with core device model removal of "struct subsystem" manually.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 18:16:12 -07:00
Linus Torvalds 5b33991576 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6:
  remove "struct subsystem" as it is no longer needed
  sysfs: printk format warning
  DOC: Fix wrong identifier name in Documentation/driver-model/devres.txt
  platform: reorder platform_device_del
  Driver core: fix show_uevent from taking up way too much stack
2007-05-04 18:04:48 -07:00
Linus Torvalds 89661adaae Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6: (59 commits)
  PCI: Free resource files in error path of pci_create_sysfs_dev_files()
  pci-quirks: disable MSI on RS400-200 and RS480
  PCI hotplug: Use menuconfig objects
  PCI: ZT5550 CPCI Hotplug driver fix
  PCI: rpaphp: Remove semaphores
  PCI: rpaphp: Ensure more pcibios_add/pcibios_remove symmetry
  PCI: rpaphp: Use pcibios_remove_pci_devices() symmetrically
  PCI: rpaphp: Document is_php_dn()
  PCI: rpaphp: Document find_php_slot()
  PCI: rpaphp: Rename rpaphp_register_pci_slot() to rpaphp_enable_slot()
  PCI: rpaphp: refactor tail call to rpaphp_register_slot()
  PCI: rpaphp: remove rpaphp_set_attention_status()
  PCI: rpaphp: remove print_slot_pci_funcs()
  PCI: rpaphp: Remove setup_pci_slot()
  PCI: rpaphp: remove a call that does nothing but a pointer lookup
  PCI: rpaphp: Remove another wrappered function
  PCI: rpaphp: Remve another call that is a wrapper
  PCI: rpaphp: remove a function that does nothing but wrap debug printks
  PCI: rpaphp: Remove un-needed goto
  PCI: rpaphp: Fix a memleak; slot->location string was never freed
  ...
2007-05-04 18:04:29 -07:00
Linus Torvalds 6adae5d9e6 Merge master.kernel.org:/pub/scm/linux/kernel/git/herbert/crypto-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  [CRYPTO] padlock: Remove pointless padlock module
  [CRYPTO] api: Add ablkcipher_request_set_tfm
  [CRYPTO] cryptd: Add software async crypto daemon
  [CRYPTO] api: Do not remove users unless new algorithm matches
  [CRYPTO] cryptomgr: Fix parsing of nested templates 
  [CRYPTO] api: Add async blkcipher type
  [CRYPTO] templates: Pass type/mask when creating instances
  [CRYPTO] tcrypt: Use async blkcipher interface
  [CRYPTO] api: Add async block cipher interface
  [CRYPTO] api: Proc functions should be marked as unused
2007-05-04 18:01:17 -07:00
Masashi Kimoto 640729014e ps3: Make `ps3videomode -v 0 (auto mode) work again
ps3: Make `ps3videomode -v 0' (auto mode) work again

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:59:08 -07:00
Geert Uytterhoeven fffe52e86b ps3av: misc updates
ps3av:
  - Move the definition of struct ps3av to ps3av.c, as it's locally used only.
  - Kill ps3av.sem, use the existing ps3av.mutex instead.
  - Make the 512-byte buffer in ps3av_do_pkt() static to reduce stack usage.
    Its use is protected by a semaphore anyway.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:59:08 -07:00
Geert Uytterhoeven 5caf5db887 ps3av: thread updates
ps3av: Replace the kernel_thread and the ping pong semaphores by a singlethread
workqueue and a completion.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:59:08 -07:00
Geert Uytterhoeven 254f9c5cd2 Convert non-highmem kmap_atomic() to static inline function
Convert kmap_atomic() in the non-highmem case from a macro to a static
inline function, for better type-checking and the ability to pass void
pointers instead of struct page pointers.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:59:08 -07:00
Finn Thain f877958879 NuBus header update
Sync the nubus defines with the latest code in the mac68k repo. Some of these
are needed for DP8390 driver update in the next patch.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:59:07 -07:00
Finn Thain df7e7d6a89 m68k: remove unused adb.h
The asm-m68k/adb.h header is unused. Some definitions are wrong and the rest
are duplicated in linux/adb.h. Remove it.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:59:07 -07:00
Roman Zippel b3e2fd9ceb lockdep: Add missing disable/enable irq variant
Add missing disable/enable irq variant

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:59:06 -07:00
Michael Schmitz c04cb856e2 m68k: Atari keyboard and mouse support.
Atari keyboard and mouse support.
(reformating and Kconfig fixes by Roman Zippel)

Signed-off-by: Michael Schmitz <schmitz@debian.org>
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:59:05 -07:00
Linus Torvalds 8d41f0e8d5 Merge branch 'i2c-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6
* 'i2c-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6: (44 commits)
  i2c-s3c2410: Fix bug in releasing driver
  i2c-s3c2410: Fix I2C SDA to SCL setup time
  i2c: New i2c-tiny-usb bus driver
  i2c: Documentation update
  i2c: SPIN_LOCK_UNLOCKED cleanup
  i2c: Obsolete i2c-ixp2000, i2c-ixp4xx and scx200_i2c
  i2c: New Simtec I2C bus driver
  i2c: Bitbanging I2C bus driver using the GPIO API
  Use menuconfig objects - I2C
  i2c: Restore i2c_smbus_read_block_data
  i2c-pxa: Clean transaction stop
  i2c-algo-bit: Improve debugging
  i2c-algo-bit: Implement a 50/50 SCL duty cycle
  i2c-omap: Switch to static adapter numbering
  i2c: Blackfin Two Wire Interface driver
  i2c-algo-sgi: Comment and whitespace cleanups
  i2c: Make i2c_del_driver a void function
  i2c: Move i2c-isa-only exported symbol declarations
  i2c: Document i2c_new_device()
  i2c: Add i2c_new_probed_device()
  ...

Fixed trivial conflict in Documentation/feature-removal-schedule.txt manually.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-04 17:46:27 -07:00
Linus Torvalds ded1504dfa Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq
* master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
  [CPUFREQ] Report the number of processors in PowerNow-k8 correctly
  [CPUFREQ] do not declare undefined functions
  [CPUFREQ] cleanup kconfig options
  [CPUFREQ] Longhaul - Revert Longhaul ver. 2
  [CPUFREQ] Remove deprecated /proc/acpi/processor/performance write support
  [CPUFREQ] Fix limited cpufreq when booted on battery
  Fix preemption warnings in speedstep-centrino.c
  [CPUFREQ] Longhaul - Correct PCI code
  [CPUFREQ] p4-clockmod: switch to rdmsr_on_cpu/wrmsr_on_cpu
2007-05-04 17:38:48 -07:00
Linus Torvalds 98b96173c7 Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/agpgart
* master.kernel.org:/pub/scm/linux/kernel/git/davej/agpgart:
  [AGPGART] sworks-agp: Switch to PCI ref counting APIs
  [AGPGART] Nvidia AGP: Use refcount aware PCI interfaces
  [AGPGART] Fix sparse warning in sgi-agp.c
  [AGPGART] Intel-agp adjustments
  [AGPGART] Move [un]map_page_into_agp into asm/agp.h
  [AGPGART] Add missing calls to global_flush_tlb() to ali-agp
  [AGPGART] prevent probe collision of sis-agp and amd64_agp
2007-05-04 17:38:16 -07:00
Vlad Yasevich 07d9396771 [SCTP]: Set assoc_id correctly during INIT collision.
During the INIT/COOKIE-ACK collision cases, it's possible to get
into a situation where the association id is not yet set at the time
of the user event generation.  As a result, user events have an
association id set to 0 which will confuse applications.

This happens if we hit case B of duplicate cookie processing.
In the particular example found and provided by Oscar Isaula
<Oscar.Isaula@motorola.com>, flow looks like this:
A				B
---- INIT------->  (lost)
	    <---------INIT------
---- INIT-ACK--->
	    <------ Cookie ECHO

When the Cookie Echo is received, we end up trying to update the
association that was created on A as a result of the (lost) INIT,
but that association doesn't have the ID set yet.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-04 13:55:27 -07:00
Sridhar Samudrala 827bf12236 [SCTP]: Re-order SCTP initializations to avoid race with sctp_rcv()
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-04 13:36:30 -07:00
Jamal Hadi Salim 5a6d34162f [XFRM] SPD info TLV aggregation
Aggregate the SPD info TLVs.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-04 12:55:39 -07:00
Jamal Hadi Salim af11e31609 [XFRM] SAD info TLV aggregationx
Aggregate the SAD info TLVs.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-04 12:55:13 -07:00
Jennifer Hunt 561e036006 [AF_IUCV]: Implementation of a skb backlog queue
With the inital implementation we missed to implement a skb backlog
queue . The result is that socket receive processing tossed packets.
Since AF_IUCV connections are working synchronously it leads to
connection hangs. Problems with read, close and select also occured.

Using a skb backlog queue is fixing all of these problems .

Signed-off-by: Jennifer Hunt <jenhunt@us.ibm.com>
Signed-off-by: Frank Pavlic <fpavlic@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-04 12:22:07 -07:00
Martin Schwidefsky cf8ba7a955 [S390] add hardware capability support (ELF_HWCAP).
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-05-04 18:48:35 +02:00
Cornelia Huck 52706ec903 [S390] cio: Deprecate read_dev_chars() and read_conf_data{,_lpm}().
These helper functions are a leftover from 2.4 sync I/O and are a
notorious source for bugs. They lead to device driver specific code
creeping into cio, and some issues can't really be fixed at all.

Device drivers can easily implement those functions themselves in a
more robust manner, so let's get rid of them.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-05-04 18:48:25 +02:00
Christoph Hellwig 33464e3b57 [S390] get rid of kprobes notifier call chain.
And here's a port of the powerpc patch to get rid of the notifier
chain completely to s390.  It's ontop of Martins patch as that one
is in mainline already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-05-04 18:48:24 +02:00
Eric Dumazet db3459d1a7 [IPV6]: Some cleanups in include/net/ipv6.h
1) struct ip6_flowlabel : moves 'users' field to avoid two 32bits
   holes for 64bit arches. Shrinks by 8 bytes sizeof(struct
   ip6_flowlabel)

2) ipv6_addr_cmp() and ipv6_addr_copy() dont need (void *) casts :
   Compiler might take into account natural alignement of in6_addr
   structs to emit better code for memcpy()/memcmp() Casts to (void *)
   force byte accesses.

3) ipv6_addr_prefix() optimization :

Better to clear whole struct, as compiler can emit better code for
memset(addr, 0, 16) (2 stores on x86_64), and avoid some conditional
branches.

# size vmlinux.after vmlinux.before
   text    data     bss     dec     hex filename
5262262  647612  557432 6467306  62aeea vmlinux.after
5262550  647612  557432 6467594  62b00a vmlinux.before

thats 288 bytes saved.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 17:39:04 -07:00
Pavel Emelianov 7562f876cd [NET]: Rework dev_base via list_head (v3)
Cleanup of dev_base list use, with the aim to simplify making device
list per-namespace. In almost every occasion, use of dev_base variable
and dev->next pointer could be easily replaced by for_each_netdev
loop. A few most complicated places were converted to using
first_netdev()/next_netdev().

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 15:13:45 -07:00
Michael Chan 27a005b883 [BNX2]: Add support for 5709 Serdes.
Add PCI ID and code to support the 5709 Serdes PHY.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 13:23:41 -07:00
Michael Chan 427c2196b9 [ETHTOOL]: Add 2.5G bit definitions.
Add 2.5G supported and advertising bit definitions.  2.5G is supported
by the bnx2 driver.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 13:17:25 -07:00
Sascha Hauer ff4bfb2163 [ARM] 4328/1: Move i.MX UART regs to driver
This patch moves the i.MX UART register descriptions from
include/asm-arm/arch-imx/imx-regs.h to the serial driver itself.
This helps using the driver on other architectures like mx31

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-03 20:24:21 +01:00
Sascha Hauer fe7fdb80e9 [ARM] 4329/1: fix position of NETX_SYSTEM_REG
This patch fixes the position of the netx reset control register

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-03 20:22:49 +01:00
Russell King 5559bca8e6 [ARM] ecard: Convert card type enum to a flag
'type' in the struct expansion_card is only used to indicate
whether this card is an EASI card or not.  Therefore, having
it as an enum is wasteful (and introduces additional noise
when we come to remove the enum.)  Convert it to a mere flag
instead.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-03 14:16:56 +01:00
Russell King c0b04d1b2c [ARM] ecard: Move private ecard junk out of asm/ecard.h
Move ecard.c private junk from asm/ecard.h to a local header file.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-03 14:16:56 +01:00
Andrew Victor 7776a94c31 [ARM] 4352/1: AT91: Platform data for LCD and AC97.
Define resources, platform_device and device registration functions for
the LCD and AC97 controllers on the AT91SAM9263.
Also update the AT91SAM9261 to use the common atmel_lcdfb driver.

Signed-off-by: Nicolas Ferre <nicolas.ferre@rfo.atmel.com>
Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-03 14:10:22 +01:00
Andrew Victor ce813b97e5 [ARM] 4350/1: AT91: Hardware header for ADC peripheral
Definitions for Analog-to-Digital Converter (ADC) found on the Atmel
AT91SAM9260 processor.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-03 14:10:20 +01:00
Dan Williams d2dd8b1fed [ARM] 4342/2: iop13xx: add resource definitions for the tpmi units
The tpmi units interface with the SAS controller on iop348.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-03 14:03:54 +01:00
Dan Williams e90ddd813d [ARM] 4348/4: iop3xx: Give Linux control over PCI initialization
Currently the iop3xx platform support code assumes that RedBoot is the
bootloader and has already initialized the ATU.  Linux should handle this
initialization for three reasons:

1/ The memory map that RedBoot sets up is not optimal (page_to_dma and
virt_to_phys return different addresses).  The effect of this is that using
the dma mapping API for the internal bus dma units generates pci bus
addresses that are incorrect for the internal bus.

2/ Not all iop platforms use RedBoot

3/ If the ATU is already initialized it indicates that the iop is an add-in
card in another host, it does not own the PCI bus, and should not be
re-initialized.

Changelog:
* rather than change nr_controllers to zero, simply do not call
  pci_common_init

Cc: Lennert Buytenhek <kernel@wantstofly.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-05-03 14:02:48 +01:00
Patrick McHardy fc38582db9 [NETFILTER]: bridge netfilter: consolidate header pushing/pulling code
Consolidate the common push/pull sequences into a few helper functions.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:36:16 -07:00
Jorge Boncompte c2a1910b06 [NETFILTER]: nf_nat_proto_gre: do not modify/corrupt GREv0 packets through NAT
While porting some changes of the 2.6.21-rc7 pptp/proto_gre conntrack
and nat modules to a 2.4.32 kernel I noticed that the gre_key function
returns a wrong pointer to the GRE key of a version 0 packet thus
corrupting the packet payload.

The intended behaviour for GREv0 packets is to act like
nf_conntrack_proto_generic/nf_nat_proto_unknown so I have ripped the
offending functions (not used anymore) and modified the
nf_nat_proto_gre modules to not touch version 0 (non PPTP) packets.

Signed-off-by: Jorge Boncompte <jorge@dti2.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:34:42 -07:00
Ilpo Järvinen 0ec96822d5 [TCP]: Use S+L catcher only with SACK for now
TCP has a transitional state when SACK is not in use during
which this invariant is temporarily broken. Without SACK,
tcp_clean_rtx_queue does not decrement sacked_out. Therefore
calls to tcp_sync_left_out before sacked_out is again
corrected by tcp_fastretrans_alert can trigger this trap as
sacked_out still has couple of segments that are already out
of window.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:30:34 -07:00
Patrick McHardy 4e9cac2ba4 [NET]: Add __dev_getfirstbyhwtype
Add __dev_getfirstbyhwtype for callers that don't want a reference but
some data from the device and thus need to take the rtnl anyway.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:28:13 -07:00
Randy Dunlap be52178b9f [NET] skbuff: fix kernel-doc
Fix skbuff.h kernel-doc:
linux-2.6.21-git4//include/linux/skbuff.h:316): No description found for parameter 'transport_header'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:16:20 -07:00
David Howells ef4533f8af [AFS]: Make the match_*() functions take const options.
Make the match_*() functions take a const pointer to the options table
and make strings pointers in the options table const too.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:10:39 -07:00
Eric Dumazet 709525fad8 [IPV6]: Get rid of __HAVE_ARCH_ADDR_SET.
__HAVE_ARCH_ADDR_SET seems unused these days, just get rid of it.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:08:43 -07:00
Avi Kivity 2ff81f70b5 KVM: Remove unused 'instruction_length'
As we no longer emulate in userspace, this is meaningless.  We don't
compute it on SVM anyway.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:32 +03:00
Avi Kivity 02c8320972 KVM: Don't require explicit indication of completion of mmio or pio
It is illegal not to return from a pio or mmio request without completing
it, as mmio or pio is an atomic operation.  Therefore, we can simplify
the userspace interface by avoiding the completion indication.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:32 +03:00
Avi Kivity b8836737d9 KVM: Add fpu get/set operations
These are really helpful when migrating an floating point app to another
machine.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Avi Kivity e8207547d2 KVM: Add physical memory aliasing feature
With this, we can specify that accesses to one physical memory range will
be remapped to another.  This is useful for the vga window at 0xa0000 which
is used as a movable window into the (much larger) framebuffer.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Avi Kivity 039576c03c KVM: Avoid guest virtual addresses in string pio userspace interface
The current string pio interface communicates using guest virtual addresses,
relying on userspace to translate addresses and to check permissions.  This
interface cannot fully support guest smp, as the check needs to take into
account two pages at one in case an unaligned string transfer straddles a
page boundary.

Change the interface not to communicate guest addresses at all; instead use
a buffer page (mmaped by userspace) and do transfers there.  The kernel
manages the virtual to physical translation and can perform the checks
atomically by taking the appropriate locks.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity 07c45a366d KVM: Allow kernel to select size of mmap() buffer
This allows us to store offsets in the kernel/user kvm_run area, and be
sure that userspace has them mapped.  As offsets can be outside the
kvm_run struct, userspace has no way of knowing how much to mmap.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity 1961d276c8 KVM: Add guest mode signal mask
Allow a special signal mask to be used while executing in guest mode.  This
allows signals to be used to interrupt a vcpu without requiring signal
delivery to a userspace handler, which is quite expensive.  Userspace still
receives -EINTR and can get the signal via sigwait().

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity 1b19f3e61d KVM: Add a special exit reason when exiting due to an interrupt
This is redundant, as we also return -EINTR from the ioctl, but it
allows us to examine the exit_reason field on resume without seeing
old data.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity 8eb7d334bd KVM: Fold kvm_run::exit_type into kvm_run::exit_reason
Currently, userspace is told about the nature of the last exit from the
guest using two fields, exit_type and exit_reason, where exit_type has
just two enumerations (and no need for more).  So fold exit_type into
exit_reason, reducing the complexity of determining what really happened.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity b4e63f560b KVM: Allow userspace to process hypercalls which have no kernel handler
This is useful for paravirtualized graphics devices, for example.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity 5d308f4550 KVM: Add method to check for backwards-compatible API extensions
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:24 +03:00
Avi Kivity 739872c56f KVM: Renumber ioctls
The recent changes have left the ioctl numbers in complete disarray.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity 2a4dac3952 KVM: Remove minor wart from KVM_CREATE_VCPU ioctl
That ioctl does not transfer any data, so it should be an _IO rather than an
_IOW.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity 106b552b43 KVM: Remove the 'emulated' field from the userspace interface
We no longer emulate single instructions in userspace.  Instead, we service
mmio or pio requests.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity 06465c5a3a KVM: Handle cpuid in the kernel instead of punting to userspace
KVM used to handle cpuid by letting userspace decide what values to
return to the guest.  We now handle cpuid completely in the kernel.  We
still let userspace decide which values the guest will see by having
userspace set up the value table beforehand (this is necessary to allow
management software to set the cpu features to the least common denominator,
so that live migration can work).

The motivation for the change is that kvm kernel code can be impacted by
cpuid features, for example the x86 emulator.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity 46fc147788 KVM: Do not communicate to userspace through cpu registers during PIO
Currently when passing the a PIO emulation request to userspace, we
rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx
(on string instructions).  This (a) requires two extra ioctls for getting
and setting the registers and (b) is unfriendly to non-x86 archs, when
they get kvm ports.

So fix by doing the register fixups in the kernel and passing to userspace
only an abstract description of the PIO to be done.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity 9a2bb7f486 KVM: Use a shared page for kernel/user communication when runing a vcpu
Instead of passing a 'struct kvm_run' back and forth between the kernel and
userspace, allocate a page and allow the user to mmap() it.  This reduces
needless copying and makes the interface expandable by providing lots of
free space.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity ff42697436 KVM: Export <linux/kvm.h>
This allows users to actually build prgrams that use kvm without
the entire source tree.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:22 +03:00
Avi Kivity bbe4432e66 KVM: Use own minor number
Use the minor number (232) allocated to kvm by lanana.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:22 +03:00
Adrian Bunk ecf36501bc PCI: the overdue removal of pci_module_init()
Unless we finally completely remove it, people will always add new users.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:38 -07:00
Adrian Bunk 5adc55da4a PCI: remove the broken PCI_MULTITHREAD_PROBE option
This patch removes the PCI_MULTITHREAD_PROBE option that had already 
been marked as broken.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:38 -07:00
Michael Ellerman 032de8e2fe MSI: Give archs the option to free all MSI/Xs at once.
This patch introduces an optional function, arch_teardown_msi_irqs(),
which gives an arch the opportunity to do per-device teardown for
MSI/X. If that's not required, the default version simply calls
arch_teardown_msi_irq() for each msi irq required.

arch_teardown_msi_irqs() is simply passed a pdev, attached to the pdev
is a list of msi_descs, it is up to the arch to free the irq associated
with each of these as appropriate.

For archs that _don't_ implement arch_teardown_msi_irqs(), all msi_descs
with irq == 0 are considered unallocated, and the arch teardown routine
is not called on them.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:38 -07:00
Michael Ellerman 9c8313343c MSI: Give archs the option to allocate all MSI/Xs at once.
This patch introduces an optional function, arch_setup_msi_irqs(),
(note the plural) which gives an arch the opportunity to do per-device
setup for MSI/X and then allocate all the requested MSI/Xs at once.

If that's not required by the arch, the default version simply calls
arch_setup_msi_irq() for each MSI irq required.

arch_setup_msi_irqs() is passed a pdev, attached to the pdev is a list
of msi_descs with irq == 0, it is up to the arch to connect these up to
an irq (via set_irq_msi()) or return an error. For convenience the number
of vectors and the type are passed also.

All msi_descs with irq != 0 are considered allocated, and the arch
teardown routine will be called on them when necessary.

The existing semantics of pci_enable_msix() are that if the requested
number of irqs can not be allocated, the maximum number that _could_ be
allocated is returned. To support that, we define that in case of an
error from arch_setup_msi_irqs(), the number of msi_descs with irq != 0
are considered allocated, and are counted toward the "max that could be
allocated".


Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:38 -07:00
Michael Ellerman 314e77b3ee MSI: Remove dev->first_msi_irq
Now that we keep a list of msi descriptors, we don't need first_msi_irq
in the pci dev.

If we somehow have zero MSIs configured list_entry() will give us weird
oopes or nice memory corruption bugs. So be paranoid. Add BUG_ONs and also
a check in pci_msi_check_device() to make sure nvec > 0.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:37 -07:00
Michael Ellerman 4aa9bc955d MSI: Use a list instead of the custom link structure
The msi descriptors are linked together with what looks a lot like
a linked list, but isn't a struct list_head list. Make it one.

The only complication is that previously we walked a list of irqs, and
got the descriptor for each with get_irq_msi(). Now we have a list of
descriptors and need to get the irq out of it, so it needs to be in the
actual struct msi_desc. We use 0 to indicate no irq is setup.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:37 -07:00
Michael Ellerman 65891215e6 PCI: Create alloc_pci_dev(), the one true way to create a struct pci_dev
There are currently several places in the kernel where we kmalloc()
a struct pci_dev and start initialising it. It'd be preferable to
have an allocator so we can ensure the pci_dev is correctly initialised
in one place.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:37 -07:00
Michael Ellerman c9953a73e9 MSI: Add an arch_msi_check_device()
Add an arch_check_device(), which gives archs a chance to check the input
to pci_enable_msi/x. The arch might be interested in the value of nvec so
pass it in. Propagate the error value returned from the arch routine out
to the caller.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:37 -07:00
Sergei Shtylyov 0da0ead901 PCI: define pci_request/release_regions() for CONFIG_PCI=n
Balance declarations of pci_request_regions() and pci_release_regions() with
empty inline definitions for the CONFIG_PCI=n case -- otherwise my patch to
drivers/net/3c59x.c in the -mm tree doesn't compile. :-)

Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:35 -07:00
Jean Delvare 6473d160b4 PCI: Cleanup the includes of <linux/pci.h>
I noticed that many source files include <linux/pci.h> while they do
not appear to need it. Here is an attempt to clean it all up.

In order to find all possibly affected files, I searched for all
files including <linux/pci.h> but without any other occurence of "pci"
or "PCI". I removed the include statement from all of these, then I
compiled an allmodconfig kernel on both i386 and x86_64 and fixed the
false positives manually.

My tests covered 66% of the affected files, so there could be false
positives remaining. Untested files are:

arch/alpha/kernel/err_common.c
arch/alpha/kernel/err_ev6.c
arch/alpha/kernel/err_ev7.c
arch/ia64/sn/kernel/huberror.c
arch/ia64/sn/kernel/xpnet.c
arch/m68knommu/kernel/dma.c
arch/mips/lib/iomap.c
arch/powerpc/platforms/pseries/ras.c
arch/ppc/8260_io/enet.c
arch/ppc/8260_io/fcc_enet.c
arch/ppc/8xx_io/enet.c
arch/ppc/syslib/ppc4xx_sgdma.c
arch/sh64/mach-cayman/iomap.c
arch/xtensa/kernel/xtensa_ksyms.c
arch/xtensa/platform-iss/setup.c
drivers/i2c/busses/i2c-at91.c
drivers/i2c/busses/i2c-mpc.c
drivers/media/video/saa711x.c
drivers/misc/hdpuftrs/hdpu_cpustate.c
drivers/misc/hdpuftrs/hdpu_nexus.c
drivers/net/au1000_eth.c
drivers/net/fec_8xx/fec_main.c
drivers/net/fec_8xx/fec_mii.c
drivers/net/fs_enet/fs_enet-main.c
drivers/net/fs_enet/mac-fcc.c
drivers/net/fs_enet/mac-fec.c
drivers/net/fs_enet/mac-scc.c
drivers/net/fs_enet/mii-bitbang.c
drivers/net/fs_enet/mii-fec.c
drivers/net/ibm_emac/ibm_emac_core.c
drivers/net/lasi_82596.c
drivers/parisc/hppb.c
drivers/sbus/sbus.c
drivers/video/g364fb.c
drivers/video/platinumfb.c
drivers/video/stifb.c
drivers/video/valkyriefb.c
include/asm-arm/arch-ixp4xx/dma.h
sound/oss/au1550_ac97.c

I would welcome test reports for these files. I am fine with removing
the untested files from the patch if the general opinion is that these
changes aren't safe. The tested part would still be nice to have.

Note that this patch depends on another header fixup patch I submitted
to LKML yesterday:
  [PATCH] scatterlist.h needs types.h
  http://lkml.org/lkml/2007/3/01/141

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:35 -07:00
Jean Delvare a9dfd281a7 PCI: scatterlist.h needs types.h
Most architectures' scatterlist.h use the type dma_addr_t, but omit to
include <asm/types.h> which defines it.  This could lead to build failures,
so let's add the missing includes.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:34 -07:00
Brian King f7bdd12d23 pci: New PCI-E reset API
Adds a new API which can be used to issue various types
of PCI-E reset, including PCI-E warm reset and PCI-E hot reset.
This is needed for an ipr PCI-E adapter which does not properly
implement BIST. Running BIST on this adapter results in PCI-E
errors. The only reliable reset mechanism that exists on this
hardware is PCI Fundamental reset (warm reset). Since driving
this type of reset is architecture unique, this provides the
necessary hooks for architectures to add this support.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:34 -07:00
Greg Kroah-Hartman 823bccfc40 remove "struct subsystem" as it is no longer needed
We need to work on cleaning up the relationship between kobjects, ksets and
ktypes.  The removal of 'struct subsystem' is the first step of this,
especially as it is not really needed at all.

Thanks to Kay for fixing the bugs in this patch.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 18:57:59 -07:00
Sam Ravnborg dc24f0e708 kbuild: remove dependency on input.h from file2alias
Almost all definitions used by file2alias was already
present in mod_devicetable.h.
Added the last definition and killed the input.h usage.

The errornous include was pointed out
by: Jan Engelhardt <jengelh@linux01.gwdg.de>

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Cc: Deepak Saxena <dsaxena@plexity.net>
2007-05-02 20:58:08 +02:00
Jan Kiszka c41bf8fa5e [PATCH] i386: avoid redundant preempt_disable in __unlazy_fpu
There are two callers of __unlazy_fpu, unlazy_fpu and __switch_to, and
none of them appear to require additional preempt_disable/enable here.
Let's open-code save_init_fpu in __unlazy_fpu to save a few ops.

Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:21 +02:00
Jan Kiszka 02b64dab56 [PATCH] i386: white space fixes in i387.h
Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:21 +02:00
Andi Kleen c5bcb5635a [PATCH] x86: Use RDTSCP for synchronous get_cycles if possible
RDTSCP is already synchronous and doesn't need an explicit CPUID.
This is a little faster and more importantly avoids VMEXITs on Hypervisors.

Original patch from Joerg Roedel, but reworked by AK
Also includes miscompilation fix by Eric Biederman

Cc: "Joerg Roedel" <joerg.roedel@amd.com>

Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:21 +02:00
Andi Kleen 9bccb23dc5 [PATCH] i386: Add X86_FEATURE_RDTSCP
Following x86-64
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:20 +02:00
Andi Kleen 3aefbe0746 [PATCH] i386: Implement X86_FEATURE_SYNC_RDTSC on i386
Syncs up with x86-64.

Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:20 +02:00
Andi Kleen e859dc553c [PATCH] i386: Implement alternative_io for i386
Ported from x86-64.

Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:20 +02:00
Andi Kleen 3671df8572 [PATCH] i386: Evaluate constant cpu features at runtime
Redefine cpu_has() to evaluate cpu features already checked in early
boot at compile time.  This way the compiler might eliminate some dead code.
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:20 +02:00
Andi Kleen c7f81c9453 [PATCH] i386: Verify important CPUID bits in real mode
Check some CPUID bits that are needed for compiler generated early in boot.
When the system is still in real mode before changing the VESA BIOS mode
it is possible to still display an visible error message on the screen.

Similar to x86-64.

Includes cleanups from Eric Biederman

Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:20 +02:00
Andi Kleen 05cb007dac [PATCH] x86-64: Use the 32bit wd_ops for 64bit too.
This mainly removes a lot of code, replacing it with calls into the new 32bit
perfctr-watchdog.c

Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:20 +02:00
Andi Kleen 09198e6850 [PATCH] i386: Clean up NMI watchdog code
- Introduce a wd_ops structure
- Convert the various nmi watchdogs over to it
- This allows to split the perfctr reservation from the watchdog
setup cleanly.
- Do perfctr reservation globally as it should have always been
- Remove dead code referenced only by unused EXPORT_SYMBOLs

Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:20 +02:00
Zachary Amsden 9e5e3162b2 [PATCH] i386: pte simplify ops
Add comment and condense code to make use of native_local_ptep_get_and_clear
function.  Also, it turns out the 2-level and 3-level paging definitions were
identical, so move the common definition into pgtable.h

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:19 +02:00
Zachary Amsden 142dd97591 [PATCH] i386: pte xchg optimization
In situations where page table updates need only be made locally, and there is
no cross-processor A/D bit races involved, we need not use the heavyweight
xchg instruction to atomically fetch and clear page table entries.  Instead,
we can just read and clear them directly.

This introduces a neat optimization for non-SMP kernels; drop the atomic xchg
operations from page table updates.

Thanks to Michel Lespinasse for noting this potential optimization.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:19 +02:00
Zachary Amsden c2c1accd4b [PATCH] i386: pte clear optimization
When exiting from an address space, no special hypervisor notification of page
table updates needs to occur; direct page table hypervisors, such as Xen,
switch to another address space first (init_mm) and unprotects the page tables
to avoid the cost of trapping to the hypervisor for each pte_clear.  Shadow
mode hypervisors, such as VMI and lhype don't need to do the extra work of
calling through paravirt-ops, and can just directly clear the page table
entries without notifiying the hypervisor, since all the page tables are about
to be freed.

So introduce native_pte_clear functions which bypass any paravirt-ops
notification.  This results in a significant performance win for VMI and
removes some indirect calls from zap_pte_range.

Note the 3-level paging already had a native_pte_clear function, thus
demanding argument conformance and extra args for the 2-level definition.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:19 +02:00
Andi Kleen 57a4f91ae5 [PATCH] x86-64: Auto compute __NR_syscall_max at compile time
No need to maintain it anymore

Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:18 +02:00
Fernando Luis [** ISO-8859-1 charset **] VzquezCao 70ae77f497 [PATCH] x86-64: Use safe_apic_wait_icr_idle in __send_IPI_dest_field - x86_64
Use safe_apic_wait_icr_idle to check ICR idle bit if the vector is
NMI_VECTOR to avoid potential hangups in the event of crash when kdump
tries to stop the other CPUs.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:18 +02:00
Fernando Luis [** ISO-8859-1 charset **] VzquezCao 9062d888aa [PATCH] x86-64: __send_IPI_dest_field - x86_64
Implement __send_IPI_dest_field which can be used to send IPIs when the
"destination shorthand" field of the ICR is set to 00 (destination
field). Use it whenever possible.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:18 +02:00
Fernando Luis VazquezCao 8339e9fba3 [PATCH] x86-64: safe_apic_wait_icr_idle - x86_64
apic_wait_icr_idle looks like this:

static __inline__ void apic_wait_icr_idle(void)
{
  while (apic_read(APIC_ICR) & APIC_ICR_BUSY)
    cpu_relax();
}

The busy loop in this function would not be problematic if the
corresponding status bit in the ICR were always updated, but that does
not seem to be the case under certain crash scenarios. Kdump uses an IPI
to stop the other CPUs in the event of a crash, but when any of the
other CPUs are locked-up inside the NMI handler the CPU that sends the
IPI will end up looping forever in the ICR check, effectively
hard-locking the whole system.

Quoting from Intel's "MultiProcessor Specification" (Version 1.4), B-3:

"A local APIC unit indicates successful dispatch of an IPI by
resetting the Delivery Status bit in the Interrupt Command
Register (ICR). The operating system polls the delivery status
bit after sending an INIT or STARTUP IPI until the command has
been dispatched.

A period of 20 microseconds should be sufficient for IPI dispatch
to complete under normal operating conditions. If the IPI is not
successfully dispatched, the operating system can abort the
command. Alternatively, the operating system can retry the IPI by
writing the lower 32-bit double word of the ICR. This “time-out”
mechanism can be implemented through an external interrupt, if
interrupts are enabled on the processor, or through execution of
an instruction or time-stamp counter spin loop."

Intel's documentation suggests the implementation of a time-out
mechanism, which, by the way, is already being open-coded in some parts
of the kernel that tinker with ICR.

Create a apic_wait_icr_idle replacement that implements the time-out
mechanism and that can be used to solve the aforementioned problem.

AK: moved both functions out of line
AK: Added improved loop from Keith Owens

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:17 +02:00
Fernando Luis VazquezCao f2b218dd61 [PATCH] i386: safe_apic_wait_icr_idle - i386
apic_wait_icr_idle looks like this:

static __inline__ void apic_wait_icr_idle(void)
{
  while (apic_read(APIC_ICR) & APIC_ICR_BUSY)
    cpu_relax();
}

The busy loop in this function would not be problematic if the
corresponding status bit in the ICR were always updated, but that does
not seem to be the case under certain crash scenarios. Kdump uses an IPI
to stop the other CPUs in the event of a crash, but when any of the
other CPUs are locked-up inside the NMI handler the CPU that sends the
IPI will end up looping forever in the ICR check, effectively
hard-locking the whole system.

Quoting from Intel's "MultiProcessor Specification" (Version 1.4), B-3:

"A local APIC unit indicates successful dispatch of an IPI by
resetting the Delivery Status bit in the Interrupt Command
Register (ICR). The operating system polls the delivery status
bit after sending an INIT or STARTUP IPI until the command has
been dispatched.

A period of 20 microseconds should be sufficient for IPI dispatch
to complete under normal operating conditions. If the IPI is not
successfully dispatched, the operating system can abort the
command. Alternatively, the operating system can retry the IPI by
writing the lower 32-bit double word of the ICR. This “time-out”
mechanism can be implemented through an external interrupt, if
interrupts are enabled on the processor, or through execution of
an instruction or time-stamp counter spin loop."

Intel's documentation suggests the implementation of a time-out
mechanism, which, by the way, is already being open-coded in some parts
of the kernel that tinker with ICR.

Create a apic_wait_icr_idle replacement that implements the time-out
mechanism and that can be used to solve the aforementioned problem.

AK: moved both functions out of line
AK: added improved loop from Keith Owens

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:17 +02:00
Bernhard Kaindl de938c51d5 [PATCH] i386: Enable support for fixed-range IORRs to keep RdMem & WrMem in sync
If our copy of the MTRRs of the BSP has RdMem or WrMem set, and
we are running on an AMD64/K8 system, the boot CPU must have had
MtrrFixDramEn and MtrrFixDramModEn set (otherwise our RDMSR would
have copied these bits cleared), so we set them on this CPU as well.

This allows us to keep the AMD64/K8 RdMem and WrMem bits in sync
across the CPUs of SMP systems in order to fullfill the duty of
system software to "initialize and maintain MTRR consistency
across all processors." as written in the AMD and Intel manuals.

If an WRMSR instruction fails because MtrrFixDramModEn is not
set, I expect that also the Intel-style MTRR bits are not updated.

AK: minor cleanup, moved MSR defines around

Signed-off-by: Bernhard Kaindl <bk@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
2007-05-02 19:27:17 +02:00
Bernhard Kaindl 2b1f6278d7 [PATCH] x86: Save the MTRRs of the BSP before booting an AP
Applied fix by Andew Morton:
http://lkml.org/lkml/2007/4/8/88 - Fix `make headers_check'.

AMD and Intel x86 CPU manuals state that it is the responsibility of
system software to initialize and maintain MTRR consistency across
all processors in Multi-Processing Environments.

Quote from page 188 of the AMD64 System Programming manual (Volume 2):

7.6.5 MTRRs in Multi-Processing Environments

"In multi-processing environments, the MTRRs located in all processors must
characterize memory in the same way. Generally, this means that identical
values are written to the MTRRs used by the processors." (short omission here)
"Failure to do so may result in coherency violations or loss of atomicity.
Processor implementations do not check the MTRR settings in other processors
to ensure consistency. It is the responsibility of system software to
initialize and maintain MTRR consistency across all processors."

Current Linux MTRR code already implements the above in the case that the
BIOS does not properly initialize MTRRs on the secondary processors,
but the case where the fixed-range MTRRs of the boot processor are changed
after Linux started to boot, before the initialsation of a secondary
processor, is not handled yet.

In this case, secondary processors are currently initialized by Linux
with MTRRs which the boot processor had very early, when mtrr_bp_init()
did run, but not with the MTRRs which the boot processor uses at the
time when that secondary processors is actually booted,
causing differing MTRR contents on the secondary processors.

Such situation happens on Acer Ferrari 1000 and 5000 notebooks where the
BIOS enables and sets AMD-specific IORR bits in the fixed-range MTRRs
of the boot processor when it transitions the system into ACPI mode.
The SMI handler of the BIOS does this in SMM, entered while Linux ACPI
code runs acpi_enable().

Other occasions where the SMI handler of the BIOS may change bits in
the MTRRs could occur as well. To initialize newly booted secodary
processors with the fixed-range MTRRs which the boot processor uses
at that time, this patch saves the fixed-range MTRRs of the boot
processor before new secondary processors are started. When the
secondary processors run their Linux initialisation code, their
fixed-range MTRRs will be updated with the saved fixed-range MTRRs.

If CONFIG_MTRR is not set, we define mtrr_save_state
as an empty statement because there is nothing to do.

Possible TODOs:

*) CPU-hotplugging outside of SMP suspend/resume is not yet tested
   with this patch.

*) If, even in this case, an AP never runs i386/do_boot_cpu or x86_64/cpu_up,
   then the calls to mtrr_save_state() could be replaced by calls to
   mtrr_save_fixed_ranges(NULL) and  mtrr_save_state() would not be
   needed.

   That would need either verification of the CPU-hotplug code or
   at least a test on a >2 CPU machine.

*) The MTRRs of other running processors are not yet checked at this
   time but it might be interesting to syncronize the MTTRs of all
   processors before booting. That would be an incremental patch,
   but of rather low priority since there is no machine known so
   far which would require this.

AK: moved prototypes on x86-64 around to fix warnings

Signed-off-by: Bernhard Kaindl <bk@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
2007-05-02 19:27:17 +02:00
Bernhard Kaindl 2b3b4835c9 [PATCH] x86: Adds mtrr_save_fixed_ranges() for use in two later patches.
In this current implementation which is used in other patches,
mtrr_save_fixed_ranges() accepts a dummy void pointer because
in the current implementation of one of these patches, this
function may be called from smp_call_function_single() which
requires that this function takes a void pointer argument.

This function calls get_fixed_ranges(), passing mtrr_state.fixed_ranges
which is the element of the static struct which stores our current
backup of the fixed-range MTRR values which all CPUs shall be
using.

Because  mtrr_save_fixed_ranges calls get_fixed_ranges after
kernel initialisation time, __init needs to be removed from
the declaration of get_fixed_ranges().

If CONFIG_MTRR is not set, we define mtrr_save_fixed_ranges
as an empty statement because there is nothing to do.

AK: Moved prototypes for x86-64 around to fix warnings

Signed-off-by: Bernhard Kaindl <bk@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
2007-05-02 19:27:17 +02:00
Andi Kleen 856f44ff4a [PATCH] x86-64: Move mtrr prototypes from proto.h to mtrr.h
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:17 +02:00
Jeremy Fitzhardinge 03df4f6ee9 [PATCH] i386: Clean up ELF note generation
Three cleanups:

1: ELF notes are never mapped, so there's no need to have any access
flags in their phdr.

2: When generating them from asm, tell the assembler to use a SHT_NOTE
section type.  There doesn't seem to be a way to do this from C.

3: Use ANSI rather than traditional cpp behaviour to stringify the
macro argument.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
2007-05-02 19:27:17 +02:00
Jeremy Fitzhardinge 441d40dca0 [PATCH] x86: PARAVIRT: Jeremy Fitzhardinge <jeremy@goop.org>
The other symbols used to delineate the alt-instructions sections have the
form __foo/__foo_end.  Rename parainstructions to match.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:16 +02:00
Zachary Amsden e0bb864397 [PATCH] i386: Convert VMI timer to use clock events
Convert VMI timer to use clock events, making it properly able to use the NO_HZ
infrastructure.  On UP systems, with no local APIC, we just continue to route
these events through the PIT.  On systems with a local APIC, or SMP, we provide
a single source interrupt chip which creates the local timer IRQ.  It actually
gets delivered by the APIC hardware, but we don't want to use the same local
APIC clocksource processing, so we create our own handler here.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
CC: Dan Hecht <dhecht@vmware.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
2007-05-02 19:27:16 +02:00
Jeremy Fitzhardinge 57decbda6a [PATCH] x86: update for i386 and x86-64 check_bugs
Remove spurious comments, headers and keywords from x86-64 bugs.[ch].

Use identify_boot_cpu()

AK: merged with other patch

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:16 +02:00
Jeremy Fitzhardinge c5413fbe89 [PATCH] i386: Fix UP gdt bugs
Fixes two problems with the GDT when compiling for uniprocessor:
 - There's no percpu segment, so trying to load its selector into %fs fails.
   Use a null selector instead.
 - The real gdt needs to be loaded at some point.  Do it in cpu_init().

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
2007-05-02 19:27:16 +02:00
Jeremy Fitzhardinge 1956c73bb5 [PATCH] i386: Define per_cpu_offset
Define per_cpu_offset in asm-i386/percpu.h when SMP defined, like
asm-generic/percpu.h does for UP.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
2007-05-02 19:27:16 +02:00
Jeremy Fitzhardinge 978c038ec9 [PATCH] i386: cleanups to help using per-cpu variables from asm
This patch does a few small cleanups:
 - use PER_CPU_NAME to generate the names of per-cpu variables
 - use lea to add the per_cpu offset in PER_CPU(), because it doesn't
   affect condition flags
 - add PER_CPU_VAR which allows direct access to pre-cpu variables
   with the %fs: prefix on SMP.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
2007-05-02 19:27:16 +02:00
Jeremy Fitzhardinge 7c3576d261 [PATCH] i386: Convert PDA into the percpu section
Currently x86 (similar to x84-64) has a special per-cpu structure
called "i386_pda" which can be easily and efficiently referenced via
the %fs register.  An ELF section is more flexible than a structure,
allowing any piece of code to use this area.  Indeed, such a section
already exists: the per-cpu area.

So this patch:
(1) Removes the PDA and uses per-cpu variables for each current member.
(2) Replaces the __KERNEL_PDA segment with __KERNEL_PERCPU.
(3) Creates a per-cpu mirror of __per_cpu_offset called this_cpu_off, which
    can be used to calculate addresses for this CPU's variables.
(4) Simplifies startup, because %fs doesn't need to be loaded with a
    special segment at early boot; it can be deferred until the first
    percpu area is allocated (or never for UP).

The result is less code and one less x86-specific concept.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
2007-05-02 19:27:16 +02:00
Jeremy Fitzhardinge 7a61d35d4b [PATCH] i386: Page-align the GDT
Xen wants a dedicated page for the GDT.  I believe VMI likes it too.
lguest, KVM and native don't care.

Simple transformation to page-aligned "struct gdt_page".

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
2007-05-02 19:27:15 +02:00
Jeremy Fitzhardinge 4cdd9c8931 [PATCH] i386: PARAVIRT: drop unused ptep_get_and_clear
In shadow mode hypervisors, ptep_get_and_clear achieves the desired
purpose of keeping the shadows in sync by issuing a native_get_and_clear,
followed by a call to pte_update, which indicates the PTE has been
modified.

Direct mode hypervisors (Xen) have no need for this anyway, and will trap
the update using writable pagetables.

This means no hypervisor makes use of ptep_get_and_clear; there is no
reason to have it in the paravirt-ops structure.  Change confusing
terminology about raw vs. native functions into consistent use of
native_pte_xxx for operations which do not invoke paravirt-ops.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:15 +02:00
Jeremy Fitzhardinge 1a45b7aaa5 [PATCH] i386: PARAVIRT: Clean up paravirt patchable wrappers
Replace all the open-coded macros for generating calls with a pair of
more general macros (__PVOP_CALL/VCALL), and redefine all the
PVOP_V?CALL[0-4] in terms of them.

[ Andrew, Andi: this should slot in immediately after "Document asm-i386/paravirt.h"
  (paravirt_ops-document-asm-i386-paravirth.patch) ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
2007-05-02 19:27:15 +02:00
Jeremy Fitzhardinge 4e0fa85602 [PATCH] i386: PARAVIRT: Use enums for paravirt lazy flush modi
Remove #defines, add enum for PARAVIRT_LAZY_FLUSH.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:15 +02:00
Jeremy Fitzhardinge ce6234b529 [PATCH] i386: PARAVIRT: add kmap_atomic_pte for mapping highpte pages
Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space.  These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.

Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, kmap_atomic_prot, which maps the page with the
specified page protections.

This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings.  Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.

[ Zach - vmi.c will need some attention after this patch.  It wasn't
  immediately obvious to me what needs to be done. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
2007-05-02 19:27:15 +02:00
Jeremy Fitzhardinge a27fe809b8 [PATCH] i386: PARAVIRT: revert map_pt_hook.
Back out the map_pt_hook to clear the way for kmap_atomic_pte.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
2007-05-02 19:27:15 +02:00
Jeremy Fitzhardinge d4c104771a [PATCH] i386: PARAVIRT: add flush_tlb_others paravirt_op
This patch adds a pv_op for flush_tlb_others.  Linux running on native
hardware uses cross-CPU IPIs to flush the TLB on any CPU which may
have a particular mm's pagetable entries cached in its TLB.  This is
inefficient in a paravirtualized environment, since the hypervisor
knows which real CPUs actually contain cached mappings, which may be a
small subset of a guest's VCPUs.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:15 +02:00
Jeremy Fitzhardinge 63f70270cc [PATCH] i386: PARAVIRT: add common patching machinery
Implement the actual patching machinery.  paravirt_patch_default()
contains the logic to automatically patch a callsite based on a few
simple rules:

 - if the paravirt_op function is paravirt_nop, then patch nops
 - if the paravirt_op function is a jmp target, then jmp to it
 - if the paravirt_op function is callable and doesn't clobber too much
    for the callsite, call it directly

paravirt_patch_default is suitable as a default implementation of
paravirt_ops.patch, will remove most of the expensive indirect calls
in favour of either a direct call or a pile of nops.

Backends may implement their own patcher, however.  There are several
helper functions to help with this:

paravirt_patch_nop	nop out a callsite
paravirt_patch_ignore	leave the callsite as-is
paravirt_patch_call	patch a call if the caller and callee
			have compatible clobbers
paravirt_patch_jmp	patch in a jmp
paravirt_patch_insns	patch some literal instructions over
			the callsite, if they fit

This patch also implements more direct patches for the native case, so
that when running on native hardware many common operations are
implemented inline.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
Acked-by: Ingo Molnar <mingo@elte.hu>
2007-05-02 19:27:14 +02:00
Jeremy Fitzhardinge 294688c028 [PATCH] i386: PARAVIRT: Document asm-i386/paravirt.h
Clean things up, and broadly document:
 - the paravirt_ops functions themselves
 - the patching mechanism

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
2007-05-02 19:27:14 +02:00
Jeremy Fitzhardinge f8822f4201 [PATCH] i386: PARAVIRT: Consistently wrap paravirt ops callsites to make them patchable
Wrap a set of interesting paravirt_ops calls in a wrapper which makes
the callsites available for patching.  Unfortunately this is pretty
ugly because there's no way to get gcc to generate a function call,
but also wrap just the callsite itself with the necessary labels.

This patch supports functions with 0-4 arguments, and either void or
returning a value.  64-bit arguments must be split into a pair of
32-bit arguments (lower word first).  Small structures are returned in
registers.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
2007-05-02 19:27:14 +02:00
Jeremy Fitzhardinge 42c24fa22e [PATCH] i386: PARAVIRT: Fix patch site clobbers to include return register
Fix a few clobbers to include the return register.  The clobbers set
is the set of all registers modified (or may be modified) by the code
snippet, regardless of whether it was deliberate or accidental.

Also, make sure that callsites which are used in contexts which don't
allow clobbers actually save and restore all clobberable registers.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
2007-05-02 19:27:14 +02:00
Jeremy Fitzhardinge d582203578 [PATCH] i386: PARAVIRT: Use patch site IDs computed from offset in paravirt_ops structure
Use patch type identifiers derived from the offset of the operation in
the paravirt_ops structure.  This avoids having to maintain a separate
enum for patch site types.

Also, since the identifier is derived from the offset into
paravirt_ops, the offset can be derived from the identifier.  This is
used to remove replicated information in the various callsite macros,
which has been a source of bugs in the past.

This patch also drops the fused save_fl+cli operation, which doesn't
really add much and makes things more complex - specifically because
it breaks the 1:1 relationship between identifiers and offsets.  If
this operation turns out to be particularly beneficial, then the right
answer is to define a new entrypoint for it.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
2007-05-02 19:27:14 +02:00
Jeremy Fitzhardinge 98de032b68 [PATCH] i386: PARAVIRT: rename struct paravirt_patch to paravirt_patch_site for clarity
Rename struct paravirt_patch to paravirt_patch_site, so that it
clearly refers to a callsite, and not the patch which may be applied
to that callsite.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
2007-05-02 19:27:14 +02:00
Jeremy Fitzhardinge d6dd61c831 [PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destruction
Add hooks to allow a paravirt implementation to track the lifetime of
an mm.  Paravirtualization requires three hooks, but only two are
needed in common code.  They are:

arch_dup_mmap, which is called when a new mmap is created at fork

arch_exit_mmap, which is called when the last process reference to an
  mm is dropped, which typically happens on exit and exec.

The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures.  It's called when an mm is first used.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
2007-05-02 19:27:14 +02:00
Jeremy Fitzhardinge 5311ab62cd [PATCH] i386: PARAVIRT: Allow paravirt backend to choose kernel PMD sharing
Normally when running in PAE mode, the 4th PMD maps the kernel address space,
which can be shared among all processes (since they all need the same kernel
mappings).

Xen, however, does not allow guests to have the kernel pmd shared between page
tables, so parameterize pgtable.c to allow both modes of operation.

There are several side-effects of this.  One is that vmalloc will update the
kernel address space mappings, and those updates need to be propagated into
all processes if the kernel mappings are not intrinsically shared.  In the
non-PAE case, this is done by maintaining a pgd_list of all processes; this
list is used when all process pagetables must be updated.  pgd_list is
threaded via otherwise unused entries in the page structure for the pgd, which
means that the pgd must be page-sized for this to work.

Normally the PAE pgd is only 4x64 byte entries large, but Xen requires the PAE
pgd to page aligned anyway, so this patch forces the pgd to be page
aligned+sized when the kernel pmd is unshared, to accomodate both these
requirements.

Also, since there may be several distinct kernel pmds (if the user/kernel
split is below 3G), there's no point in allocating them from a slab cache;
they're just allocated with get_free_page and initialized appropriately.  (Of
course the could be cached if there is just a single kernel pmd - which is the
default with a 3G user/kernel split - but it doesn't seem worthwhile to add
yet another case into this code).

[ Many thanks to wli for review comments. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:13 +02:00
Jeremy Fitzhardinge 90caccb975 [PATCH] i386: PARAVIRT: Allocate a fixmap slot
Allocate a fixmap slot for use by a paravirt_ops implementation.  This
is intended for early-boot bootstrap mappings.  Once the zones and
allocator have been set up, it would be better to use get_vm_area() to
allocate some virtual space.

Xen uses this to map the hypervisor's shared info page, which doesn't
have a pseudo-physical page number, and therefore can't be mapped
ordinarily.  It is needed early because it contains the vcpu state,
including the interrupt mask.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
2007-05-02 19:27:13 +02:00
Jeremy Fitzhardinge b239fb2501 [PATCH] i386: PARAVIRT: Hooks to set up initial pagetable
This patch introduces paravirt_ops hooks to control how the kernel's
initial pagetable is set up.

In the case of a native boot, the very early bootstrap code creates a
simple non-PAE pagetable to map the kernel and physical memory.  When
the VM subsystem is initialized, it creates a proper pagetable which
respects the PAE mode, large pages, etc.

When booting under a hypervisor, there are many possibilities for what
paging environment the hypervisor establishes for the guest kernel, so
the constructon of the kernel's pagetable depends on the hypervisor.

In the case of Xen, the hypervisor boots the kernel with a fully
constructed pagetable, which is already using PAE if necessary.  Also,
Xen requires particular care when constructing pagetables to make sure
all pagetables are always mapped read-only.

In order to make this easier, kernel's initial pagetable construction
has been changed to only allocate and initialize a pagetable page if
there's no page already present in the pagetable.  This allows the Xen
paravirt backend to make a copy of the hypervisor-provided pagetable,
allowing the kernel to establish any more mappings it needs while
keeping the existing ones.

A slightly subtle point which is worth highlighting here is that Xen
requires all kernel mappings to share the same pte_t pages between all
pagetables, so that updating a kernel page's mapping in one pagetable
is reflected in all other pagetables.  This makes it possible to
allocate a page and attach it to a pagetable without having to
explicitly enumerate that page's mapping in all pagetables.

And:

+From: "Eric W. Biederman" <ebiederm@xmission.com>

If we don't set the leaf page table entries it is quite possible that
will inherit and incorrect page table entry from the initial boot
page table setup in head.S.  So we need to redo the effort here,
so we pick up PSE, PGE and the like.

Hypervisors like Xen require that their page tables be read-only,
which is slightly incompatible with our low identity mappings, however
I discussed this with Jeremy he has modified the Xen early set_pte
function to avoid problems in this area.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: William Irwin <bill.irwin@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
2007-05-02 19:27:13 +02:00
Jeremy Fitzhardinge 3dc494e86d [PATCH] i386: PARAVIRT: Add pagetable accessors to pack and unpack pagetable entries
Add a set of accessors to pack, unpack and modify page table entries
(at all levels).  This allows a paravirt implementation to control the
contents of pgd/pmd/pte entries.  For example, Xen uses this to
convert the (pseudo-)physical address into a machine address when
populating a pagetable entry, and converting back to pphys address
when an entry is read.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
2007-05-02 19:27:13 +02:00
Jeremy Fitzhardinge 4587623360 [PATCH] i386: PARAVIRT: use paravirt_nop to consistently mark no-op operations
Add a _paravirt_nop function for use as a stub for no-op operations,
and paravirt_nop #defined void * version to make using it easier
(since all its uses are as a void *).

This is useful to allow the patcher to automatically identify noop
operations so it can simply nop out the callsite.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
[mingo] but only as a cleanup of the current open-coded (void *) casts.
My problem with this is that it loses the types. Not that there is much
to check for, but still, this adds some assumptions about how function
calls look like
2007-05-02 19:27:13 +02:00
Rusty Russell a75c54f933 [PATCH] i386: i386 separate hardware-defined TSS from Linux additions
On Thu, 2007-03-29 at 13:16 +0200, Andi Kleen wrote:
> Please clean it up properly with two structs.

Not sure about this, now I've done it.  Running it here.

If you like it, I can do x86-64 as well.

==
lguest defines its own TSS struct because the "struct tss_struct"
contains linux-specific additions.  Andi asked me to split the struct
in processor.h.

Unfortunately it makes usage a little awkward.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:13 +02:00
Jeremy Fitzhardinge d0175ab644 [PATCH] i386: Remove smp_alt_instructions
The .smp_altinstructions section and its corresponding symbols are
completely unused, so remove them.

Also, remove stray #ifdef __KENREL__ in asm-i386/alternative.h

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
2007-05-02 19:27:13 +02:00
H. Peter Anvin 4bc5aa91fb [PATCH] x86: Clean up x86 control register and MSR macros (corrected)
This patch is based on Rusty's recent cleanup of the EFLAGS-related
macros; it extends the same kind of cleanup to control registers and
MSRs.

It also unifies these between i386 and x86-64; at least with regards
to MSRs, the two had definitely gotten out of sync.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:12 +02:00
Andi Kleen f039b75471 [PATCH] x86: Don't use MWAIT on AMD Family 10
It doesn't put the CPU into deeper sleep states, so it's better to use the standard
idle loop to save power. But allow to reenable it anyways for benchmarking.

I also removed the obsolete idle=halt on i386

Cc: andreas.herrmann@amd.com

Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:12 +02:00
Jeremy Fitzhardinge c169859d6d [PATCH] x86-64: Clean up asm-x86_64/bugs.h
Most of asm-x86_64/bugs.h is code which should be in a C file, so put it there.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-02 19:27:12 +02:00
Jeremy Fitzhardinge 1dbf527c51 [PATCH] i386: Make COMPAT_VDSO runtime selectable.
Now that relocation of the VDSO for COMPAT_VDSO users is done at
runtime rather than compile time, it is possible to enable/disable
compat mode at runtime.

This patch allows you to enable COMPAT_VDSO mode with "vdso=2" on the
kernel command line, or via sysctl.  (Switching on a running system
shouldn't be done lightly; any process which was relying on the compat
VDSO will be upset if it goes away.)

The COMPAT_VDSO config option still exists, but if enabled it just
makes vdso_enabled default to VDSO_COMPAT.

+From: Hugh Dickins <hugh@veritas.com>

Fix oops from i386-make-compat_vdso-runtime-selectable.patch.

Even mingetty at system startup finds it easy to trigger an oops
while reading /proc/PID/maps: though it has a good hold on the mm
itself, that cannot stop exit_mm() from resetting tsk->mm to NULL.

(It is usually show_map()'s call to get_gate_vma() which oopses,
and I expect we could change that to check priv->tail_vma instead;
but no matter, even m_start()'s call just after get_task_mm() is racy.)

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: "Jan Beulich" <JBeulich@novell.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
2007-05-02 19:27:12 +02:00
Jeremy Fitzhardinge d4f7a2c18e [PATCH] i386: Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO
Some versions of libc can't deal with a VDSO which doesn't have its
ELF headers matching its mapped address.  COMPAT_VDSO maps the VDSO at
a specific system-wide fixed address.  Previously this was all done at
build time, on the grounds that the fixed VDSO address is always at
the top of the address space.  However, a hypervisor may reserve some
of that address space, pushing the fixmap address down.

This patch does the adjustment dynamically at runtime, depending on
the runtime location of the VDSO fixmap.

[ Patch has been through several hands: Jan Beulich wrote the orignal
  version; Zach reworked it, and Jeremy converted it to relocate phdrs
  as well as sections. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: "Jan Beulich" <JBeulich@novell.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
2007-05-02 19:27:12 +02:00
Jeremy Fitzhardinge a6c4e076ee [PATCH] i386: clean up identify_cpu
identify_cpu() is used to identify both the boot CPU and secondary
CPUs, but it performs some actions which only apply to the boot CPU.
Those functions are therefore really __init functions, but because
they're called by identify_cpu(), they must be marked __cpuinit.

This patch splits identify_cpu() into identify_boot_cpu() and
identify_secondary_cpu(), and calls the appropriate init functions
from each.  Also, identify_boot_cpu() and all the functions it
dominates are marked __init.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:12 +02:00
Jeremy Fitzhardinge 1353ebb4b4 [PATCH] i386: Clean up asm-i386/bugs.h
Most of asm-i386/bugs.h is code which should be in a C file, so put it there.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-02 19:27:12 +02:00
Avi Kivity bbf30a1650 [PATCH] x86-64: fix arithmetic in comment
The xmm space on x86_64 is 256 bytes.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:12 +02:00
Andi Kleen 5d02d7ae73 [PATCH] x86-64: Use X86_EFLAGS_IF in x86-64/irqflags.h.
As per i386 patch: move X86_EFLAGS_IF et al out to a new header:
processor-flags.h, so we can include it from irqflags.h and use it in
raw_irqs_disabled_flags().

As a side-effect, we could now use these flags in .S files.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:11 +02:00
Jan Beulich b92e9fac40 [PATCH] x86: fix amd64-agp aperture validation
Under CONFIG_DISCONTIGMEM, assuming that a !pfn_valid() implies all
subsequent pfn-s are also invalid is wrong. Thus replace this by
explicitly checking against the E820 map.

AK: make e820 on x86-64 not initdata

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Mark Langsdorf <mark.langsdorf@amd.com>
2007-05-02 19:27:11 +02:00
Jeremy Fitzhardinge b00742d399 [PATCH] x86-64: Account for module percpu space separately from kernel percpu
Rather than using a single constant PERCPU_ENOUGH_ROOM, compute it as
the sum of kernel_percpu + PERCPU_MODULE_RESERVE.  This is now common
to all architectures; if an architecture wants to set
PERCPU_ENOUGH_ROOM to something special, then it may do so (ia64 is
the only one which does).

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
2007-05-02 19:27:11 +02:00
Jeremy Fitzhardinge 07f3331c6b [PATCH] i386: Add machine_ops interface to abstract halting and rebooting
machine_ops is an interface for the machine_* functions defined in
<linux/reboot.h>.  This is intended to allow hypervisors to intercept
the reboot process, but it could be used to implement other x86
subarchtecture reboots.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:11 +02:00
Jeremy Fitzhardinge 01a2f43556 [PATCH] i386: Add smp_ops interface
Add a smp_ops interface.  This abstracts the API defined by
<linux/smp.h> for use within arch/i386.  The primary intent is that it
be used by a paravirtualizing hypervisor to implement SMP, but it
could also be used by non-APIC-using sub-architectures.

This is related to CONFIG_PARAVIRT, but is implemented unconditionally
since it is simpler that way and not a highly performance-sensitive
interface.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
2007-05-02 19:27:11 +02:00
Rusty Russell 4fbb596881 [PATCH] i386: cleanup GDT Access
Now we have an explicit per-cpu GDT variable, we don't need to keep the
descriptors around to use them to find the GDT: expose cpu_gdt directly.

We could go further and make load_gdt() pack the descriptor for us, or even
assume it means "load the current cpu's GDT" which is what it always does.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:11 +02:00
Adrian Bunk ca906e4231 [PATCH] x86: sys_ioperm() prototype cleanup
- there's no reason for duplicating the prototype from
  include/linux/syscalls.h in include/asm-x86_64/unistd.h
- every file should #include the headers containing the prototypes for
  it's global functions

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:10 +02:00
Christoph Lameter 2bff73830c [PATCH] x86-64: use lru instead of page->index and page->private for pgd lists management.
x86_64 currently simulates a list using the index and private fields of the
page struct.  Seems that the code was inherited from i386.  But x86_64 does
not use the slab to allocate pgds and pmds etc.  So the lru field is not
used by the slab and therefore available.

This patch uses standard list operations on page->lru to realize pgd
tracking.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:10 +02:00
Andi Kleen b4531e863d [PATCH] i386: Use X86_EFLAGS_IF in irqflags.h.
Move X86_EFLAGS_IF et al out to a new header: processor-flags.h, so we
can include it from irqflags.h and use it in raw_irqs_disabled_flags().

As a side-effect, we could now use these flags in .S files.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:10 +02:00
Jan Beulich 6fb14755a6 [PATCH] x86: tighten kernel image page access rights
On x86-64, kernel memory freed after init can be entirely unmapped instead
of just getting 'poisoned' by overwriting with a debug pattern.

On i386 and x86-64 (under CONFIG_DEBUG_RODATA), kernel text and bug table
can also be write-protected.

Compared to the first version, this one prevents re-creating deleted
mappings in the kernel image range on x86-64, if those got removed
previously. This, together with the original changes, prevents temporarily
having inconsistent mappings when cacheability attributes are being
changed on such pages (e.g. from AGP code). While on i386 such duplicate
mappings don't exist, the same change is done there, too, both for
consistency and because checking pte_present() before using various other
pte_XXX functions is a requirement anyway. At once, i386 code gets
adjusted to use pte_huge() instead of open coding this.

AK: split out cpa() changes

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:10 +02:00
Jan Beulich d01ad8dd56 [PATCH] x86: Improve handling of kernel mappings in change_page_attr
Fix various broken corner cases in i386 and x86-64 change_page_attr.

AK: split off from tighten kernel image access rights

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:10 +02:00
Rusty Russell 90a0a06aa8 [PATCH] i386: rationalize paravirt wrappers
paravirt.c used to implement native versions of all low-level
functions.  Far cleaner is to have the native versions exposed in the
headers and as inline native_XXX, and if !CONFIG_PARAVIRT, then simply
#define XXX native_XXX.

There are several nice side effects:

1) write_dt_entry() now takes the correct "struct Xgt_desc_struct *"
   not "void *".

2) load_TLS is reintroduced to the for loop, not manually unrolled
   with a #error in case the bounds ever change.

3) Macros become inlines, with type checking.

4) Access to the native versions is trivial for KVM, lguest, Xen and
   others who might want it.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:10 +02:00
Rusty Russell d2cbcc49e2 [PATCH] i386: clean up cpu_init()
We now have cpu_init() and secondary_cpu_init() doing nothing but calling
_cpu_init() with the same arguments.  Rename _cpu_init() to cpu_init() and use
it as a replcement for secondary_cpu_init().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:10 +02:00
Rusty Russell bf50467204 [PATCH] i386: Use per-cpu GDT immediately upon boot
Now we are no longer dynamically allocating the GDT, we don't need the
"cpu_gdt_table" at all: we can switch straight from "boot_gdt_table" to the
per-cpu GDT.  This means initializing the cpu_gdt array in C.

The boot CPU uses the per-cpu var directly, then in smp_prepare_cpus() it
switches to the per-cpu copy just allocated.  For secondary CPUs, the
early_gdt_descr is set to point directly to their per-cpu copy.

For UP the code is very simple: it keeps using the "per-cpu" GDT as per SMP,
but we never have to move.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:10 +02:00
Rusty Russell ae1ee11be7 [PATCH] i386: Use per-cpu variables for GDT, PDA
Allocating PDA and GDT at boot is a pain.  Using simple per-cpu variables adds
happiness (although we need the GDT page-aligned for Xen, which we do in a
followup patch).

[akpm@linux-foundation.org: build fix]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:10 +02:00
Ian Campbell 79e030114a [PATCH] i386: Allow i386 crash kernels to handle x86_64 dumps
The specific case I am encountering is kdump under Xen with a 64 bit
hypervisor and 32 bit kernel/userspace.  The dump created is 64 bit due to
the hypervisor but the dump kernel is 32 bit for maximum compatibility.

It's possibly less likely to be useful in a purely native scenario but I
see no reason to disallow it.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Ian Campbell <ian.campbell@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Horms <horms@verge.net.au>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
Rusty Russell eab0c72aec [PATCH] x86-64: Introduce load_TLS to the "for" loop.
GCC (4.1 at least) unrolls it anyway, but I can't believe this code
was ever justifiable.  (I've also submitted a patch which cleans up
i386, which is even uglier).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
Rusty Russell 692174b97d [PATCH] i386: Initialize esp0 properly all the time
Whenever we schedule, __switch_to calls load_esp0 which does:

	tss->esp0 = thread->esp0;

This is never initialized for the initial thread (ie "swapper"), so when we're
scheduling that, we end up setting esp0 to 0.  This is fine: the swapper never
leaves ring 0, so this field is never used.

lguest, however, gets upset that we're trying to used an unmapped page as our
kernel stack.  Rather than work around it there, let's initialize it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
David Rientjes 8b8ca80e19 [PATCH] x86-64: configurable fake numa node sizes
Extends the numa=fake x86_64 command-line option to allow for configurable
node sizes.  These nodes can be used in conjunction with cpusets for coarse
memory resource management.

The old command-line option is still supported:
  numa=fake=32	gives 32 fake NUMA nodes, ignoring the NUMA setup of the
		actual machine.

But now you may configure your system for the node sizes of your choice:
  numa=fake=2*512,1024,2*256
		gives two 512M nodes, one 1024M node, two 256M nodes, and
		the rest of system memory to a sixth node.

The existing hash function is maintained to support the various node sizes
that are possible with this implementation.

Each node of the same size receives roughly the same amount of available
pages, regardless of any reserved memory with its address range.  The total
available pages on the system is calculated and divided by the number of equal
nodes to allocate.  These nodes are then dynamically allocated and their
borders extended until such time as their number of available pages reaches
the required size.

Configurable node sizes are recommended when used in conjunction with cpusets
for memory control because it eliminates the overhead associated with scanning
the zonelists of many smaller full nodes on page_alloc().

Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
john stultz 5a90cf205c [PATCH] x86: Log reason why TSC was marked unstable
Change mark_tsc_unstable() so it takes a string argument, which holds the
reason the TSC was marked unstable.

This is then displayed the first time mark_tsc_unstable is called.

This should help us better debug why the TSC was marked unstable on certain
systems and allow us to make sure we're not being overly paranoid when
throwing out this troublesome clocksource.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:08 +02:00
Vivek Goyal 1833d6bc72 [PATCH] i386: modpost apic related warning fixes
o Modpost generates warnings for i386 if compiled with CONFIG_RELOCATABLE=y

WARNING: vmlinux - Section mismatch: reference to .init.text:find_unisys_acpi_oem_table from .text between 'acpi_madt_oem_check' (at offset 0xc0101eda) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:acpi_get_table_header_early from .text between 'acpi_madt_oem_check' (at offset 0xc0101ef0) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'acpi_madt_oem_check' (at offset 0xc0101f2e) and 'enable_apic_mode'
WARNING: vmlinux - Section mismatch: reference to .init.text:setup_unisys from .text between 'acpi_madt_oem_check' (at offset 0xc0101f37) and 'enable_apic_mode'WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'mps_oem_check' (at offset 0xc0101ec7) and 'acpi_madt_oem_check'
WARNING: vmlinux - Section mismatch: reference to .init.text:es7000_sw_apic from .text between 'enable_apic_mode' (at offset 0xc0101f48) and 'check_apicid_present'

o Some functions which are inline (acpi_madt_oem_check) are not inlined by
  compiler as these functions are accessed using function pointer. These
  functions are put in .text section and they in-turn access __init type
  functions hence modpost generates warnings.

o Do not iniline acpi_madt_oem_check, instead make it __init.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:08 +02:00
Ravikiran G Thirumalai e073ae1b34 [PATCH] x86-64: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA
Enable system hashtable memory to be distributed among nodes on x86_64 NUMA

Forcing the kernel to use node interleaved vmalloc instead of bootmem for
the system hashtable memory (alloc_large_system_hash) reduces the memory
imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box:

Before the following patch, on bootup of a 8 node box:

Node 0 MemTotal:      3407488 kB
Node 0 MemFree:       3206296 kB
Node 0 MemUsed:        201192 kB
Node 0 Active:           7012 kB
Node 0 Inactive:          512 kB
Node 0 Dirty:               0 kB
Node 0 Writeback:           0 kB
Node 0 FilePages:        1912 kB
Node 0 Mapped:            420 kB
Node 0 AnonPages:        5612 kB
Node 0 PageTables:        468 kB
Node 0 NFS_Unstable:        0 kB
Node 0 Bounce:              0 kB
Node 0 Slab:             5408 kB
Node 0 SReclaimable:      644 kB
Node 0 SUnreclaim:       4764 kB

After the patch (or using hashdist=1 on the kernel command line):

Node 0 MemTotal:      3407488 kB
Node 0 MemFree:       3247608 kB
Node 0 MemUsed:        159880 kB
Node 0 Active:           3012 kB
Node 0 Inactive:          616 kB
Node 0 Dirty:               0 kB
Node 0 Writeback:           0 kB
Node 0 FilePages:        2424 kB
Node 0 Mapped:            380 kB
Node 0 AnonPages:        1200 kB
Node 0 PageTables:        396 kB
Node 0 NFS_Unstable:        0 kB
Node 0 Bounce:              0 kB
Node 0 Slab:             6304 kB
Node 0 SReclaimable:     1596 kB
Node 0 SUnreclaim:       4708 kB

I guess it is a good idea to keep HASHDIST_DEFAULT "on" for x86_64 NUMA
since x86_64 has no dearth of vmalloc space?  Or maybe enable hash
distribution for all 64bit NUMA arches?  The following patch does it only
for x86_64.

I ran a HPC MPI benchmark -- 'Ansys wingsolid', which takes up quite a bit of
memory and uses up tlb entries.  This was on a 4 way, 2 socket
Tyan AMD box (non vsmp), with 8G total memory (4G pernode).

The results with and without hash distribution are:

1. Vanilla - runtime of 1188.000s
2. With hashdist=1 runtime of 1154.000s

Oprofile output for the duration of run is:

1. Vanilla:
PU: AMD64 processors, speed 2411.16 MHz (estimated)
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit
mask of 0x00 (No unit mask) count 500
samples  %        app name                 symbol name
163054    6.5513  libansys1.so             MultiFront::decompose(int, int,
Elemset *, int *, int, int, int)
162061    6.5114  libansys3.so             blockSaxpy6L_fd
162042    6.5107  libansys3.so             blockInnerProduct6L_fd
156286    6.2794  libansys3.so             maxb33_
87879     3.5309  libansys1.so             elmatrixmultpcg_
84857     3.4095  libansys4.so             saxpy_pcg
58637     2.3560  libansys4.so             .st4560
46612     1.8728  libansys4.so             .st4282
43043     1.7294  vmlinux-t                copy_user_generic_string
41326     1.6604  libansys3.so             blockSaxpyBackSolve6L_fd
41288     1.6589  libansys3.so             blockInnerProductBackSolve6L_fd

2. With hashdist=1
CPU: AMD64 processors, speed 2411.13 MHz (estimated)
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit
mask of 0x00 (No unit mask) count 500
samples  %        app name                 symbol name
162993    6.9814  libansys1.so             MultiFront::decompose(int, int,
Elemset *, int *, int, int, int)
160799    6.8874  libansys3.so             blockInnerProduct6L_fd
160459    6.8729  libansys3.so             blockSaxpy6L_fd
156018    6.6826  libansys3.so             maxb33_
84700     3.6279  libansys4.so             saxpy_pcg
83434     3.5737  libansys1.so             elmatrixmultpcg_
58074     2.4875  libansys4.so             .st4560
46000     1.9703  libansys4.so             .st4282
41166     1.7632  libansys3.so             blockSaxpyBackSolve6L_fd
41033     1.7575  libansys3.so             blockInnerProductBackSolve6L_fd
35762     1.5318  libansys1.so             inner_product_sub
35591     1.5245  libansys1.so             inner_product_sub2
28259     1.2104  libansys4.so             addVectors

Signed-off-by: Pravin B. Shelar <pravin.shelar@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:08 +02:00
Andrew Morton 184c44d204 [PATCH] x86-64: fix x86_64-mm-sched-clock-share
Fix for the following patch. Provide dummy cpufreq functions when
CPUFREQ is not compiled in.

Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:08 +02:00
Vivek Goyal 6a50a664ca [PATCH] x86-64: build-time checking
o X86_64 kernel should run from 2MB aligned address for two reasons.
	- Performance.
	- For relocatable kernels, page tables are updated based on difference
	  between compile time address and load time physical address.
	  This difference should be multiple of 2MB as kernel text and data
	  is mapped using 2MB pages and PMD should be pointing to a 2MB
	  aligned address. Life is simpler if both compile time and load time
	  kernel addresses are 2MB aligned.

o Flag the error at compile time if one is trying to build a kernel which
  does not meet alignment restrictions.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:08 +02:00
Vivek Goyal 1ab60e0f72 [PATCH] x86-64: Relocatable Kernel Support
This patch modifies the x86_64 kernel so that it can be loaded and run
at any 2M aligned address, below 512G.  The technique used is to
compile the decompressor with -fPIC and modify it so the decompressor
is fully relocatable.  For the main kernel the page tables are
modified so the kernel remains at the same virtual address.  In
addition a variable phys_base is kept that holds the physical address
the kernel is loaded at.  __pa_symbol is modified to add that when
we take the address of a kernel symbol.

When loaded with a normal bootloader the decompressor will decompress
the kernel to 2M and it will run there.  This both ensures the
relocation code is always working, and makes it easier to use 2M
pages for the kernel and the cpu.

AK: changed to not make RELOCATABLE default in Kconfig

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:07 +02:00
Vivek Goyal 0dbf7028c0 [PATCH] x86: __pa and __pa_symbol address space separation
Currently __pa_symbol is for use with symbols in the kernel address
map and __pa is for use with pointers into the physical memory map.
But the code is implemented so you can usually interchange the two.

__pa which is much more common can be implemented much more cheaply
if it is it doesn't have to worry about any other kernel address
spaces.  This is especially true with a relocatable kernel as
__pa_symbol needs to peform an extra variable read to resolve
the address.

There is a third macro that is added for the vsyscall data
__pa_vsymbol for finding the physical addesses of vsyscall pages.

Most of this patch is simply sorting through the references to
__pa or __pa_symbol and using the proper one.  A little of
it is continuing to use a physical address when we have it
instead of recalculating it several times.

swapper_pgd is now NULL.  leave_mm now uses init_mm.pgd
and init_mm.pgd is initialized at boot (instead of compile time)
to the physmem virtual mapping of init_level4_pgd.  The
physical address changed.

Except for the for EMPTY_ZERO page all of the remaining references
to __pa_symbol appear to be during kernel initialization.  So this
should reduce the cost of __pa in the common case, even on a relocated
kernel.

As this is technically a semantic change we need to be on the lookout
for anything I missed.  But it works for me (tm).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:07 +02:00
Vivek Goyal cfd243d4af [PATCH] x86-64: Remove the identity mapping as early as possible
With the rewrite of the SMP trampoline and the early page
allocator there is nothing that needs identity mapped pages,
once we start executing C code.

So add zap_identity_mappings into head64.c and remove
zap_low_mappings() from much later in the code.  The functions
 are subtly different thus the name change.

This also kills boot_level4_pgt which was from an earlier
attempt to move the identity mappings as early as possible,
and is now no longer needed.  Essentially I have replaced
boot_level4_pgt with trampoline_level4_pgt in trampoline.S

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:07 +02:00
Vivek Goyal 7db681d7e4 [PATCH] x86-64: wakeup.S rename registers to reflect right names
o Use appropriate names for 64bit regsiters.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:07 +02:00
Vivek Goyal 3c321bceb4 [PATCH] x86-64: Add EFER to the register set saved by save_processor_state
EFER varies like %cr4 depending on the cpu capabilities, and which cpu
capabilities we want to make use of.  So save/restore it make certain
we have the same EFER value when we are done.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:07 +02:00
Vivek Goyal 30f4728954 [PATCH] x86-64: cleanup segments
Move __KERNEL32_CS up into the unused gdt entry.  __KERNEL32_CS is
used when entering the kernel so putting it first is useful when
trying to keep boot gdt sizes to a minimum.

Set the accessed bit on all gdt entries.  We don't care
so there is no need for the cpu to burn the extra cycles,
and it potentially allows the pages to be immutable.  Plus
it is confusing when debugging and your gdt entries mysteriously
change.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:07 +02:00
Vivek Goyal 67dcbb6bc6 [PATCH] x86-64: Clean up the early boot page table
- Merge physmem_pgt and ident_pgt, removing physmem_pgt.  The merge
  is broken as soon as mm/init.c:init_memory_mapping is run.
- As physmem_pgt is gone don't export it in pgtable.h.
- Use defines from pgtable.h for page permissions.
- Fix the physical memory identity mapping so it is at the correct
  address.
- Remove the physical memory mapping from wakeup_level4_pgt it
  is at the wrong address so we can't possibly be usinging it.
- Simply NEXT_PAGE the work to calculate the phys_ alias
  of the labels was very cool.  Unfortuantely it was a brittle
  special purpose hack that makes maitenance more difficult.
  Instead just use label - __START_KERNEL_map like we do
  everywhere else in assembly.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:06 +02:00
Vivek Goyal 9d291e787b [PATCH] x86-64: Assembly safe page.h and pgtable.h
This patch makes pgtable.h and page.h safe to include
in assembly files like head.S.  Allowing us to use
symbolic constants instead of hard coded numbers when
refering to the page tables.

This patch copies asm-sparc64/const.h to asm-x86_64 to
get a definition of _AC() a very convinient macro that
allows us to force the type when we are compiling the
code in C and to drop all of the type information when
we are using the constant in assembly.  Previously this
was done with multiple definition of the same constant.
const.h was modified slightly so that it works when given
CONFIG options as arguments.

This patch adds #ifndef __ASSEMBLY__ ... #endif
and _AC(1,UL) where appropriate so the assembler won't
choke on the header files.  Otherwise nothing
should have changed.

AK: added const.h to exported headers to fix headers_check

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:06 +02:00
Stephen Hemminger e658450455 [PATCH] x86-64: dma_ops as const
The dma_ops structure can be const since it never changes
after boot.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:06 +02:00