OpenCloudOS-Kernel/arch/powerpc/include/asm/percpu.h

#ifndef _ASM_POWERPC_PERCPU_H_
#define _ASM_POWERPC_PERCPU_H_
#ifdef __powerpc64__

/*
 * Same as asm-generic/percpu.h, except that we store the per cpu offset
 * in the paca. Based on the x86-64 implementation.
 */

#ifdef CONFIG_SMP

#include <asm/paca.h>

#define __my_cpu_offset local_paca->data_offset

#endif /* CONFIG_SMP */
#endif /* __powerpc64__ */

#include <asm-generic/percpu.h>

#endif /* _ASM_POWERPC_PERCPU_H_ */
[PATCH] powerpc/64: per cpu data optimisations The current ppc64 per cpu data implementation is quite slow. eg: lhz 11,18(13) /* smp_processor_id() / ld 9,.LC63-.LCTOC1(30) / per_cpu__variable_name / ld 8,.LC61-.LCTOC1(30) / __per_cpu_offset / sldi 11,11,3 / form index into __per_cpu_offset / mr 10,9 ldx 9,11,8 / __per_cpu_offset[smp_processor_id()] / ldx 0,10,9 / load per cpu data / 5 loads for something that is supposed to be fast, pretty awful. One reason for the large number of loads is that we have to synthesize 2 64bit constants (per_cpu__variable_name and __per_cpu_offset). By putting __per_cpu_offset into the paca we can avoid the 2 loads associated with it: ld 11,56(13) / paca->data_offset / ld 9,.LC59-.LCTOC1(30) / per_cpu__variable_name / ldx 0,9,11 / load per cpu data Longer term we can should be able to do even better than 3 loads. If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset was in a register we could cut it down to one load. A suggestion from Rusty is to use gcc's __thread extension here. In order to do this we would need to free up r13 (the __thread register and where the paca currently is). So far Ive had a few unsuccessful attempts at doing that :) The patch also allocates per cpu memory node local on NUMA machines. This patch from Rusty has been sitting in my queue _forever_ but stalled when I hit the compiler bug. Sorry about that. Finally I also only allocate per cpu data for possible cpus, which comes straight out of the x86-64 port. On a pseries kernel (with NR_CPUS == 128) and 4 possible cpus we see some nice gains: total used free shared buffers cached Mem: 4012228 212860 3799368 0 0 162424 total used free shared buffers cached Mem: 4016200 212984 3803216 0 0 162424 A saving of 3.75MB. Quite nice for smaller machines. Note: we now have to be careful of per cpu users that touch data for !possible cpus. At this stage it might be worth making the NUMA and possible cpu optimisations generic, but per cpu init is done so early we have to be careful that all architectures have their possible map setup correctly. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org> 2006-01-11 10:16:44 +08:00			`#ifndef _ASM_POWERPC_PERCPU_H_`
			`#define _ASM_POWERPC_PERCPU_H_`
			`#ifdef __powerpc64__`

			`/*`
			`* Same as asm-generic/percpu.h, except that we store the per cpu offset`
			`* in the paca. Based on the x86-64 implementation.`
			`*/`

			`#ifdef CONFIG_SMP`

			`#include <asm/paca.h>`

percpu: fix DEBUG_PREEMPT per_cpu checking 2.6.25-rc1 percpu changes broke CONFIG_DEBUG_PREEMPT's per_cpu checking on several architectures. On s390, sparc64 and x86 it's been weakened to not checking at all; whereas on powerpc64 it's become too strict, issuing warnings from __raw_get_cpu_var in io_schedule and init_timer for example. Fix this by weakening powerpc's __my_cpu_offset to use the non-checking local_paca instead of get_paca (which itself contains such a check); and strengthening the generic my_cpu_offset to go the old slow way via smp_processor_id when CONFIG_DEBUG_PREEMPT (debug_smp_processor_id is where all the knowledge of what's correct when lives). Signed-off-by: Hugh Dickins <hugh@veritas.com> Reviewed-by: Mike Travis <travis@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 2008-02-24 03:40:17 +08:00			`#define __my_cpu_offset local_paca->data_offset`
[PATCH] powerpc/64: per cpu data optimisations The current ppc64 per cpu data implementation is quite slow. eg: lhz 11,18(13) /* smp_processor_id() / ld 9,.LC63-.LCTOC1(30) / per_cpu__variable_name / ld 8,.LC61-.LCTOC1(30) / __per_cpu_offset / sldi 11,11,3 / form index into __per_cpu_offset / mr 10,9 ldx 9,11,8 / __per_cpu_offset[smp_processor_id()] / ldx 0,10,9 / load per cpu data / 5 loads for something that is supposed to be fast, pretty awful. One reason for the large number of loads is that we have to synthesize 2 64bit constants (per_cpu__variable_name and __per_cpu_offset). By putting __per_cpu_offset into the paca we can avoid the 2 loads associated with it: ld 11,56(13) / paca->data_offset / ld 9,.LC59-.LCTOC1(30) / per_cpu__variable_name / ldx 0,9,11 / load per cpu data Longer term we can should be able to do even better than 3 loads. If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset was in a register we could cut it down to one load. A suggestion from Rusty is to use gcc's __thread extension here. In order to do this we would need to free up r13 (the __thread register and where the paca currently is). So far Ive had a few unsuccessful attempts at doing that :) The patch also allocates per cpu memory node local on NUMA machines. This patch from Rusty has been sitting in my queue _forever_ but stalled when I hit the compiler bug. Sorry about that. Finally I also only allocate per cpu data for possible cpus, which comes straight out of the x86-64 port. On a pseries kernel (with NR_CPUS == 128) and 4 possible cpus we see some nice gains: total used free shared buffers cached Mem: 4012228 212860 3799368 0 0 162424 total used free shared buffers cached Mem: 4016200 212984 3803216 0 0 162424 A saving of 3.75MB. Quite nice for smaller machines. Note: we now have to be careful of per cpu users that touch data for !possible cpus. At this stage it might be worth making the NUMA and possible cpu optimisations generic, but per cpu init is done so early we have to be careful that all architectures have their possible map setup correctly. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org> 2006-01-11 10:16:44 +08:00
POWERPC: use generic per cpu Powerpc has a way to determine the address of the per cpu area of the currently executing processor via the paca and the array of per cpu offsets is avoided by looking up the per cpu area from the remote paca's (copying x86_64). Cc: Paul Mackerras <paulus@samba.org> Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Olof Johansson <olof@lixom.net> Tested-by: Geoff Levand <geoffrey.levand@am.sony.com> 2008-01-31 06:27:58 +08:00			`#endif /* CONFIG_SMP */`
			`#endif /* __powerpc64__ */`
[PATCH] powerpc/64: per cpu data optimisations The current ppc64 per cpu data implementation is quite slow. eg: lhz 11,18(13) /* smp_processor_id() / ld 9,.LC63-.LCTOC1(30) / per_cpu__variable_name / ld 8,.LC61-.LCTOC1(30) / __per_cpu_offset / sldi 11,11,3 / form index into __per_cpu_offset / mr 10,9 ldx 9,11,8 / __per_cpu_offset[smp_processor_id()] / ldx 0,10,9 / load per cpu data / 5 loads for something that is supposed to be fast, pretty awful. One reason for the large number of loads is that we have to synthesize 2 64bit constants (per_cpu__variable_name and __per_cpu_offset). By putting __per_cpu_offset into the paca we can avoid the 2 loads associated with it: ld 11,56(13) / paca->data_offset / ld 9,.LC59-.LCTOC1(30) / per_cpu__variable_name / ldx 0,9,11 / load per cpu data Longer term we can should be able to do even better than 3 loads. If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset was in a register we could cut it down to one load. A suggestion from Rusty is to use gcc's __thread extension here. In order to do this we would need to free up r13 (the __thread register and where the paca currently is). So far Ive had a few unsuccessful attempts at doing that :) The patch also allocates per cpu memory node local on NUMA machines. This patch from Rusty has been sitting in my queue _forever_ but stalled when I hit the compiler bug. Sorry about that. Finally I also only allocate per cpu data for possible cpus, which comes straight out of the x86-64 port. On a pseries kernel (with NR_CPUS == 128) and 4 possible cpus we see some nice gains: total used free shared buffers cached Mem: 4012228 212860 3799368 0 0 162424 total used free shared buffers cached Mem: 4016200 212984 3803216 0 0 162424 A saving of 3.75MB. Quite nice for smaller machines. Note: we now have to be careful of per cpu users that touch data for !possible cpus. At this stage it might be worth making the NUMA and possible cpu optimisations generic, but per cpu init is done so early we have to be careful that all architectures have their possible map setup correctly. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org> 2006-01-11 10:16:44 +08:00
[PATCH] Move all the very similar files to asm-powerpc They differed in either simple comments or in the protecting ifdefs. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Mackerras <paulus@samba.org> 2005-08-29 12:08:11 +08:00			`#include <asm-generic/percpu.h>`
[PATCH] powerpc/64: per cpu data optimisations The current ppc64 per cpu data implementation is quite slow. eg: lhz 11,18(13) /* smp_processor_id() / ld 9,.LC63-.LCTOC1(30) / per_cpu__variable_name / ld 8,.LC61-.LCTOC1(30) / __per_cpu_offset / sldi 11,11,3 / form index into __per_cpu_offset / mr 10,9 ldx 9,11,8 / __per_cpu_offset[smp_processor_id()] / ldx 0,10,9 / load per cpu data / 5 loads for something that is supposed to be fast, pretty awful. One reason for the large number of loads is that we have to synthesize 2 64bit constants (per_cpu__variable_name and __per_cpu_offset). By putting __per_cpu_offset into the paca we can avoid the 2 loads associated with it: ld 11,56(13) / paca->data_offset / ld 9,.LC59-.LCTOC1(30) / per_cpu__variable_name / ldx 0,9,11 / load per cpu data Longer term we can should be able to do even better than 3 loads. If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset was in a register we could cut it down to one load. A suggestion from Rusty is to use gcc's __thread extension here. In order to do this we would need to free up r13 (the __thread register and where the paca currently is). So far Ive had a few unsuccessful attempts at doing that :) The patch also allocates per cpu memory node local on NUMA machines. This patch from Rusty has been sitting in my queue _forever_ but stalled when I hit the compiler bug. Sorry about that. Finally I also only allocate per cpu data for possible cpus, which comes straight out of the x86-64 port. On a pseries kernel (with NR_CPUS == 128) and 4 possible cpus we see some nice gains: total used free shared buffers cached Mem: 4012228 212860 3799368 0 0 162424 total used free shared buffers cached Mem: 4016200 212984 3803216 0 0 162424 A saving of 3.75MB. Quite nice for smaller machines. Note: we now have to be careful of per cpu users that touch data for !possible cpus. At this stage it might be worth making the NUMA and possible cpu optimisations generic, but per cpu init is done so early we have to be careful that all architectures have their possible map setup correctly. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org> 2006-01-11 10:16:44 +08:00
			`#endif /* _ASM_POWERPC_PERCPU_H_ */`