The current ARC VM code has 13 flags in Page Table entry: some software
(accesed/dirty/non-linear-maps) and rest hardware specific. With 8k MMU
page, we need 19 bits for addressing page frame so remaining 13 bits is
just about enough to accomodate the current flags.
In MMUv4 there are 2 additional flags, SZ (normal or super page) and WT
(cache access mode write-thru) - and additionally PFN is 20 bits (vs. 19
before for 8k). Thus these can't be held in current PTE w/o making each
entry 64bit wide.
It seems there is some scope of compressing the current PTE flags (and
freeing up a few bits). Currently PTE contains fully orthogonal distinct
access permissions for kernel and user mode (Kr, Kw, Kx; Ur, Uw, Ux)
which can be folded into one set (R, W, X). The translation of 3 PTE
bits into 6 TLB bits (when programming the MMU) can be done based on
following pre-requites/assumptions:
1. For kernel-mode-only translations (vmalloc: 0x7000_0000 to
0x7FFF_FFFF), PTE additionally has PAGE_GLOBAL flag set (and user
space entries can never be global). Thus such a PTE can translate
to Kr, Kw, Kx (as appropriate) and zero for User mode counterparts.
2. For non global entries, the PTE flags can be used to create mirrored
K and U TLB bits. This is true after commit a950549c67
"ARC: copy_(to|from)_user() to honor usermode-access permissions"
which ensured that user-space translations _MUST_ have same access
permissions for both U/K mode accesses so that copy_{to,from}_user()
play fair with fault based CoW break and such...
There is no such thing as free lunch - the cost is slightly infalted
TLB-Miss Handlers.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
* reduce editor lines taken by pt_regs
* ARCompact ISA specific part of TLB Miss handlers clubbed together
* cleanup some comments
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
In the exception return path, for both U/K cases, intr are already
disabled (for various existing reasons). So when we drop down to
@restore_regs, we need not redo that.
There was subtle issue - when intr were NOT being disabled for
ret-to-kernel-but-no-preemption case - now fixed by moving the
IRQ_DISABLE further up in @resume_kernel_mode.
So what do we gain:
* Shaves off a few insn in return path.
* Eliminates the need for IRQ_DISABLE_SAVE assembler macro for ARCv2
hence allows for entry code sharing.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
After the recent cleanups, all the exception handlers now have same
boilerplate prologue code. Move that into common macro.
This reduces readability but helps greatly with sharing / duplicating
entry code with ARCv2 ISA where the handlers are pretty much the same,
just the entry prologue is different (due to hardware assist).
Also while at it, add the missing FAKE_RET_FROM_EXCPN calls in couple of
places to drop down to pure kernel mode (from exception mode) before
jumping off into "C" code.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Highlights of changes:
-Continuation of ARC MM changes from 3.10 including
zero page optimization;
Setting pagecache pages dirty by default;
Non executable stack by default;
Reducing dcache flushes for aliasing VIPT config
-Long overdue rework of pt_regs machinery - removing the unused word gutters
and adding ECR register to baseline (helps cleanup lot of low level code)
-Support for ARC gcc 4.8
-Few other preventive fixes, cosmetics, usage of Kconfig helper..
The diffstat is larger than normal primarily because of arcregs.h header split
as well as beautification of macros in entry.h
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJR0mEbAAoJEGnX8d3iisJe0lEP+gPtakLaOBprpxrVttK3DCLw
E28LMVdR+33jfqugqk8xMVQh9gc5mvSGGJkZFGPoQX9gumdDjJP4vRxqlZxLgpNH
siNr+LvyzUOUa3bQrB6q639jlqlvPcKO19Yjzx1vqKRkMBsV8+YG4+8tdBNvh2K4
dr6nB6gqmZiCWmuUpTQ+D2J7vjlNwi5PWHBmlP0uFH433Z78Wk2tT2LWUHihl6Do
+Uhp/Ldu7EDFTVLDlqc5pv0EUNnGzuR9dpN3XuxfZLknKMtJKpwqLq0/6bnfWqck
g+Cmh8jvj5SAutbzZiG/xggZMtSHOXsqY06+MALquQ7tcoXrm7U0Ubbplt1B55y/
OW8yWD/b1S+2pVVCmbIWV2EinNmSMpdX18nVUmWSggQYBjKQnMq611Y7q1EGBmen
cTCY9u7jv7tD+Qem4H/rze7b+EKlcSFDhzL/Df/iHHnvAmEjb+trmD3qJ6PtA93w
24b+fVaBi5j5dHZx4NEggDKTkmUVcqkD2qfg4Ckg4mqq7cBHfJponuoDJX4KM7t1
qozSpKLMMSzRd3L7DQhePqNJfrlOD2rhD4ELRCJA131dUANr4Wpi2yvIASro135A
4gmpYX5WtjohWjL73aRCwsRcLTnyNmqjPfibT35ZaAygW7wXII7zNVzKY09oEPN3
keol0YTXDokCzGN432NI
=65Fq
-----END PGP SIGNATURE-----
Merge tag 'arc-v3.11-rc1-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull first batch of ARC changes from Vineet Gupta:
"There's a second bunch to follow next week - which depends on commits
on other trees (irq/net). I'd have preferred the accompanying ARC
change via respective trees, but it didn't workout somehow.
Highlights of changes:
- Continuation of ARC MM changes from 3.10 including
zero page optimization
Setting pagecache pages dirty by default
Non executable stack by default
Reducing dcache flushes for aliasing VIPT config
- Long overdue rework of pt_regs machinery - removing the unused word
gutters and adding ECR register to baseline (helps cleanup lot of
low level code)
- Support for ARC gcc 4.8
- Few other preventive fixes, cosmetics, usage of Kconfig helper..
The diffstat is larger than normal primarily because of arcregs.h
header split as well as beautification of macros in entry.h"
* tag 'arc-v3.11-rc1-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (32 commits)
ARC: warn on improper stack unwind FDE entries
arc: delete __cpuinit usage from all arc files
ARC: [tlb-miss] Fix bug with CONFIG_ARC_DBG_TLB_MISS_COUNT
ARC: [tlb-miss] Extraneous PTE bit testing/setting
ARC: Adjustments for gcc 4.8
ARC: Setup Vector Table Base in early boot
ARC: Remove explicit passing around of ECR
ARC: pt_regs update #5: Use real ECR for pt_regs->event vs. synth values
ARC: stop using pt_regs->orig_r8
ARC: pt_regs update #4: r25 saved/restored unconditionally
ARC: K/U SP saved from one location in stack switching macro
ARC: Entry Handler tweaks: Simplify branch for in-kernel preemption
ARC: Entry Handler tweaks: Avoid hardcoded LIMMS for ECR values
ARC: Increase readability of entry handlers
ARC: pt_regs update #3: Remove unused gutter at start of callee_regs
ARC: pt_regs update #2: Remove unused gutter at start of pt_regs
ARC: pt_regs update #1: Align pt_regs end with end of kernel stack page
ARC: pt_regs update #0: remove kernel stack canary
ARC: [mm] Remove @write argument to do_page_fault()
ARC: [mm] Make stack/heap Non-executable by default
...
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings. In any case, they are temporary and harmless.
This removes all the arch/arc uses of the __cpuinit macros from
all C files. Currently arc does not have any __CPUINIT used in
assembly files.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
With ECR now part of pt_regs
* No need to propagate from lowest asm handlers as arg
* No need to save it in tsk->thread.cause_code
* Avoid bit chopping to access the bit-fields
More code consolidation, cleanup
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
pt_regs->event was set with artificial values to identify the low level
system event (syscall trap / breakpoint trap / exceptions / interrupts)
With r8 saving out of the way, the full word can be used to save real
ECR (Exception Cause Register) which helps idenify the event naturally,
including additional info such as cause code, param.
Only for Interrupts, where ECR is not applicable, do we resort to
synthetic non ECR values.
SAVE_ALL_TRAP/EXCEPTIONS can now be merged as they both use ECR with
different runtime values.
The ptrace helpers now use the sub-fields of ECR to distinguish the
events (e.g. vector 0x25 is trap, param 0 is syscall...)
The following benefits will follow:
(1) This centralizes the location of where ECR is saved and will allow
the cleanup of task->thread.cause_code ECR placeholder which is set
in non-uniform way. Then ARC VM code can safely rely on it being
there for purpose of finer grained VM_EXEC dcache flush (based on
exec fault: I-TLB Miss)
(2) Further, ECR being passed around from low level handlers as arg can
be eliminated as it is part of standard reg-file in pt_regs
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Historically, pt_regs have had orig_r8, an overloaded container for
(1) backup copy of r8 (syscall number Trap Exceptions)
(2) additional system state: (syscall/Exception/Interrupt)
There is no point in keeping (1) since syscall number is never clobbered
in-place, in pt_regs, unlike r0 which duals as first syscall arg as well
as syscall return value and in case of syscall restart, the orig arg0
needs restoring (from orig_r0) after having been updated in-place with
syscall ret value.
This further paves way to convert (2) to contain ECR itself (rather than
current madeup values)
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
(This is a VERY IMP change for low level interrupt/exception handling)
-----------------------------------------------------------------------
WHAT
-----------------------------------------------------------------------
* User 25 now saved in pt_regs->user_r25 (vs. tsk->thread_info.user_r25)
* This allows Low level interrupt code to unconditionally save r25
(vs. the prev version which would only do it for U->K transition).
Ofcourse for nested interrupts, only the pt_regs->user_r25 of
bottom-most frame is useful.
* simplifies the interrupt prologue/epilogue
* Needed for ARCv2 ISA code and done here to keep design similar with
ARCompact event handling
-----------------------------------------------------------------------
WHY
-------------------------------------------------------------------------
With CONFIG_ARC_CURR_IN_REG, r25 is used to cache "current" task pointer
in kernel mode. So when entering kernel mode from User Mode
- user r25 is specially safe-kept (it being a callee reg is NOT part of
pt_regs which are saved by default on each interrupt/trap/exception)
- r25 loaded with current task pointer.
Further, if interrupt was taken in kernel mode, this is skipped since we
know that r25 already has valid "current" pointer.
With 2 level of interrupts in ARCompact ISA, detecting this is difficult
but still possible, since we could be in kernel mode but r25 not already saved
(in fact the stack itself might not have been switched).
A. User mode
B. L1 IRQ taken
C. L2 IRQ taken (while on 1st line of L1 ISR)
So in #C, although in kernel mode, r25 not saved (infact SP not
switched at all)
Given that ARcompact has manual stack switching, we could use a bit of
trickey - The low level code would make sure that SP is only set to kernel
mode value at the very end (after saving r25). So a non kernel mode SP,
even if in kernel mode, meant r25 was NOT saved.
The same paradigm won't work in ARCv2 ISA since SP is auto-switched so
it's setting can't be delayed/constrained.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This paves way for further simplifications.
There's an overhead of 1 insn for the non-common case of interrupt taken
from kernel mode.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
* use artificial PUSH/POP contructs for CORE Reg save/restore to stack
* use artificial PUSHAX/POPAX contructs for Auxiliary Space regs
* macro'ize multiple copies of callee-reg-save/restore (SAVE_R13_TO_R24)
* use BIC insn for inverse-and operation
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This is trickier than prev two:
* context switching code saves kernel mode callee regs in the format of
struct callee_regs thus needs adjustment. This also reduces the height
of topmost kernel stack frame by 1 word.
* Since kernel stack unwinder is sensitive to height of topmost kernel
stack frame, that needs a word of adjustment too.
ptrace needs a bit of updating since pt_regs now diverges from
user_regs_struct.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Historically, pt_regs would end at offset of 1 word from end of stack
page.
----------------- -> START of page (task->stack)
| |
| thread_info |
-----------------
| |
^ ~ ~
| ~ ~
| | |
| | | <---- pt_regs used to END here
-----------------
| 1 word GUTTER |
----------------- -> End of page (START of kernel stack)
This required special "one-off" considerations in low level code.
The root cause is very likely assumption of "empty" SP by the original
ARC kernel hackers, despite ARC700 always been "full" SP.
So finally RIP one word gutter !
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
1. For VM_EXEC based delayed dcache/icache flush, reduces the number of
flushes.
2. Makes this security feature ON by default rather than OFF before.
3. Applications can use mprotect() to selectively override this.
4. ELF binaries have a GNU_STACK segment which can easily override the
kernel default permissions.
For nested-functions/trampolines, gcc already auto-enables executable
stack in elf. Others needing this can use -Wl,-z,execstack option.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
* Number of (i|d)cache ways can be retrieved from BCRs and hence no need
to cross check with with built-in constants
* Use of IS_ENABLED() to check for a Kconfig option
* is_not_cache_aligned() not used anymore
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
* Move the various sub-system defines/types into relevant files/functions
(reduces compilation time)
* move CPU specific stuff out of asm/tlb.h into asm/mmu.h
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
gdbserver inserting a breakpoint ends up calling copy_user_page() for a
code page. The generic version of which (non-aliasing config) didn't set
the PG_arch_1 bit hence update_mmu_cache() didn't sync dcache/icache for
corresponding dynamic loader code page - causing garbade to be executed.
So now aliasing versions of copy_user_highpage()/clear_page() are made
default. There is no significant overhead since all of special alias
handling code is compiled out for non-aliasing build
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
The current code uses 2 bits for determining page's dcache color, thus
sorting pages into 4 bins, whereas the aliasing dcache really has 2 bins
(8k page, 64k dcache - 4 way-set-assoc).
This can cause extraneous flushes - e.g. color 0 and 2.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
The VM_EXEC check in update_mmu_cache() was getting optimized away
because of a stupid error in definition of macro addr_not_cache_congruent()
The intention was to have the equivalent of following:
if (a || (1 ? b : 0))
but we ended up with following:
if (a || 1 ? b : 0)
And because precedence of '||' is more that that of '?', gcc was optimizing
away evaluation of <a>
Nasty Repercussions:
1. For non-aliasing configs it would mean some extraneous dcache flushes
for non-code pages if U/K mappings were not congruent.
2. For aliasing config, some needed dcache flush for code pages might
be missed if U/K mappings were congruent.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This manifested as grep failing psuedo-randomly:
-------------->8---------------------
[ARCLinux]$ ip address show lo | grep inet
[ARCLinux]$ ip address show lo | grep inet
[ARCLinux]$ ip address show lo | grep inet
[ARCLinux]$
[ARCLinux]$ ip address show lo | grep inet
inet 127.0.0.1/8 scope host lo
-------------->8---------------------
ARC700 MMU provides fully orthogonal permission bits per page:
Ur, Uw, Ux, Kr, Kw, Kx
The user mode page permission templates used to have all Kernel mode
access bits enabled.
This caused a tricky race condition observed with uClibc buffered file
read and UNIX pipes.
1. Read access to an anon mapped page in libc .bss: write-protected
zero_page mapped: TLB Entry installed with Ur + K[rwx]
2. grep calls libc:getc() -> buffered read layer calls read(2) with the
internal read buffer in same .bss page.
The read() call is on STDIN which has been redirected to a pipe.
read(2) => sys_read() => pipe_read() => copy_to_user()
3. Since page has Kernel-write permission (despite being user-mode
write-protected), copy_to_user() suceeds w/o taking a MMU TLB-Miss
Exception (page-fault for ARC). core-MM is unaware that kernel
erroneously wrote to the reserved read-only zero-page (BUG #1)
4. Control returns to userspace which now does a write to same .bss page
Since Linux MM is not aware that page has been modified by kernel, it
simply reassigns a new writable zero-init page to mapping, loosing the
prior write by kernel - effectively zero'ing out the libc read buffer
under the hood - hence grep doesn't see right data (BUG #2)
The fix is to make all kernel-mode access permissions mirror the
user-mode ones. Note that the kernel still has full access to pages,
when accessed directly (w/o MMU) - this fix ensures that kernel-mode
access in copy_to_from() path uses the same faulting access model as for
pure user accesses to keep MM fully aware of page state.
The issue is peudo-random because it only shows up if the TLB entry
installed in #1 is present at the time of #3. If it is evicted out, due
to TLB pressure or some-such, then copy_to_user() does take a TLB Miss
Exception, with a routine write-to-anon COW processing installing a
fresh page for kernel writes and also usable as it is in userspace.
Further the issue was dormant for so long as it depends on where the
libc internal read buffer (in .bss) is mapped at runtime.
If it happens to reside in file-backed data mapping of libc (in the
page-aligned slack space trailing the file backed data), loader zero
padding the slack space, does the early cow page replacement, setting
things up at the very beginning itself.
With gcc 4.8 based builds, the libc buffer got pushed out to a real
anon mapping which triggers the issue.
Reported-by: Anton Kolesov <akolesov@synopsys.com>
Cc: <stable@vger.kernel.org> # 3.9
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This is the meat of the series which prevents any dcache alias creation
by always keeping the U and K mapping of a page congruent.
If a mapping already exists, and other tries to access the page, prev
one is flushed to physical page (wback+inv)
Essentially flush_dcache_page()/copy_user_highpage() create K-mapping
of a page, but try to defer flushing, unless U-mapping exist.
When page is actually mapped to userspace, update_mmu_cache() flushes
the K-mapping (in certain cases this can be optimised out)
Additonally flush_cache_mm(), flush_cache_range(), flush_cache_page()
handle the puring of stale userspace mappings on exit/munmap...
flush_anon_page() handles the existing U-mapping for anon page before
kernel reads it via the GUP path.
Note that while not complete, this is enough to boot a simple
dynamically linked Busybox based rootfs
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This preps the low level dcache flush helpers to take vaddr argument in
addition to the existing paddr to properly flush the VIPT dcache
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
flush_dcache_page( ) is MM hook to ensure that a page has consistent
views between kernel and userspace. Thus it is called when
* kernel writes to a page which at some later point could get mapped to
userspace (so kernel mapping needs to be flushed-n-inv)
* kernel is about to read from a page with possible userspace mappings
(so userspace mappings needs to be made coherent with kernel ones)
However for Non aliasing VIPT dcache, any userspace mapping will always
be congruent to kernel mapping. Thus d-cache need need not be flushed at
all (or delayed indefinitely).
The only reason it does need to be flushed is when mapping code pages.
Since icache doesn't snoop dcache, those dirty dcache lines need to be
written back to memory and icache line invalidated so that icache lines
fetch will get the right data.
Decent gains on LMBench fork/exec/sh and File I/O micro-benchmarks.
(1) FPGA @ 80 MHZ
Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
3.9-rc6-a Linux 3.9.0-r 80 4.79 8.72 66.7 116. 239. 8.39 30.4 4798 14.K 34.K
3.9-rc6-b Linux 3.9.0-r 80 4.79 8.62 65.4 111. 239. 8.35 29.0 3995 12.K 30.K
3.9-rc7-c Linux 3.9.0-r 80 4.79 9.00 66.1 106. 239. 8.61 30.4 2858 10.K 24.K
^^^^ ^^^^ ^^^
File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
3.9-rc6-a Linux 3.9.0-r 317.8 204.2 1122.3 375.1 3522.0 4.288 20.7 126.8
3.9-rc6-b Linux 3.9.0-r 298.7 223.0 1141.6 367.8 3531.0 4.866 20.9 126.4
3.9-rc7-c Linux 3.9.0-r 278.4 179.2 862.1 339.3 3705.0 3.223 20.3 126.6
^^^^^ ^^^^^ ^^^^^ ^^^^
(2) Customer Silicon @ 500 MHz (166 MHz mem)
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
abilis-ba Linux 3.9.0-r 497 0.71 1.38 4.58 12.0 35.5 1.40 3.89 2070 5525 13.K
abilis-ca Linux 3.9.0-r 497 0.71 1.40 4.61 11.8 35.6 1.37 3.92 1411 4317 10.K
^^^^ ^^^^ ^^^
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Now that we have same helper used for all icache invalidates (i.e.
vaddr+paddr based exact line invalidate), consolidate the open coded
calls into one place.
Also rename flush_icache_range_vaddr => __sync_icache_dcache
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
ARC icache doesn't snoop dcache thus executable pages need to be made
coherent before mapping into userspace in flush_icache_page().
However ARC700 CDU (hardware cache flush module) requires both vaddr
(index in cache) as well as paddr (tag match) to correctly identify a
line in the VIPT cache. A typical ARC700 SoC has aliasing icache, thus
the paddr only based flush_icache_page() API couldn't be implemented
efficiently. It had to loop thru all possible alias indexes and perform
the invalidate operation (ofcourse the cache op would only succeed at
the index(es) where tag matches - typically only 1, but the cost of
visiting all the cache-bins needs to paid nevertheless).
Turns out however that the vaddr (along with paddr) is available in
update_mmu_cache() hence better suits ARC icache flush semantics.
With both vaddr+paddr, exactly one flush operation per line is done.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
munmap ends up calling tlb_flush() which for ARC was flushing the entire
TLB unconditionally (by moving the MMU to a new ASID)
do_munmap
unmap_region
unmap_vmas
unmap_single_vma
unmap_page_range
tlb_start_vma
zap_pud_range
tlb_end_vma()
tlb_finish_mmu
tlb_flush() ---> unconditional flush_tlb_mm()
So even a single page munmap, a frequent operation when uClibc dynamic
linker (ldso) is loading the dependent shared libraries, would move the
the ASID multiple times - needlessly invalidating the pre-faulted TLB
entries (and increasing the rate of ASID wraparound + full TLB flush).
This is now optimised to only be called if tlb->full_mm (which means
for exit/execve) cases only. And for those cases, flush_tlb_mm() is
already optimised to be a no-op for mm->mm_users == 0.
So essentially there are no mmore full mm flushes - except for fork which
anyhow needs it for properly COW'ing parent address space.
munmap now needs to do TLB range flush, which is implemented with
tlb_end_vma()
Results
-------
1. ASID now consistenly moves by 4 during a simple ls (as opposed to 5 or
7 before).
2. LMBench microbenchmark also shows improvements
Basic system parameters
------------------------------------------------------------------------------
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
3.9-rc5-0 Linux 3.9.0-r 3.9-rc5-0404-gcc-4.4-ba 80 8 64 1.1000 1
3.9-rc5-0 Linux 3.9.0-r 3.9-rc5-0405-avoid-full 80 8 64 1.1200 1
Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
3.9-rc5-0 Linux 3.9.0-r 80 4.81 8.69 68.6 118. 239. 8.53 31.6 4839 13.K 34.K
3.9-rc5-0 Linux 3.9.0-r 80 4.46 8.36 53.8 91.3 223. 8.12 24.2 4725 13.K 33.K
File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
3.9-rc5-0 Linux 3.9.0-r 314.7 223.2 1054.9 390.2 3615.0 1.590 20.1 126.6
3.9-rc5-0 Linux 3.9.0-r 265.8 183.8 1014.2 314.1 3193.0 6.910 18.8 110.4
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Infrastructure required to make the Linux kernel compile and boot on the
Abilis Systems TB10x series of SOCs based on ARC700 CPUs:
- Kmake related files (Kconfig, Makefile, tb10x_defconfig)
- TB10x platform initialisation
Signed-off-by: Christian Ruppert <christian.ruppert@abilis.com>
Signed-off-by: Pierrick Hascoet <pierrick.hascoet@abilis.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This patch adds some room for CPU-external interrupt controllers in the
Linux interrupt space. Until now, only the 32 CPU internal interrupt lines
were supported which does not allow for external interrupt controllers such
as GPIO modules etc.
Signed-off-by: Christian Ruppert <christian.ruppert@abilis.com>
Signed-off-by: Pierrick Hascoet <pierrick.hascoet@abilis.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
ARC irqsave/restore macros were missing the compiler barrier, causing a
stale load in irq-enabled region be used in irq-safe region, despite
being changed, because the register holding the value was still live.
The problem manifested as random crashes in timer code when stress
testing ARCLinux (3.9-rc3) on a !SMP && !PREEMPT_COUNT
Here's the exact sequence which caused this:
(0). tv1[x] <----> t1 <---> t2
(1). mod_timer(t1) interrupted after it calls timer_pending()
(2). mod_timer(t2) completes
(3). mod_timer(t1) resumes but messes up the list
(4). __runt_timers( ) uses bogus timer_list entry / crashes in
timer->function
Essentially mod_timer() was racing against itself and while the spinlock
serialized the tv1[] timer link list, timer_pending() called outside the
spinlock, cached timer link list element in a register.
With low register pressure (and a deep register file), lack of barrier
in raw_local_irqsave() as well as preempt_disable (!PREEMPT_COUNT
version), there was nothing to force gcc to reload across the spinlock,
causing a stale value in reg be used for link list manipulation - ensuing
a corruption.
ARcompact disassembly which shows the culprit generated code:
mod_timer:
push_s blink
mov_s r13,r0 # timer, timer
..
###### timer_pending( )
ld_s r3,[r13] # <------ <variable>.entry.next LOADED
brne r3, 0, @.L163
.L163:
..
###### spin_lock_irq( )
lr r5, [status32] # flags
bic r4, r5, 6 # temp, flags,
and.f 0, r5, 6 # flags,
flag.nz r4
###### detach_if_pending( ) begins
tst_s r3,r3 <--------------
# timer_pending( ) checks timer->entry.next
# r3 is NOT reloaded by gcc, using stale value
beq.d @.L169
mov.eq r0,0
##### detach_timer( ): __list_del( )
ld r4,[r13,4] # <variable>.entry.prev, D.31439
st r4,[r3,4] # <variable>.prev, D.31439
st r3,[r4] # <variable>.next, D.30246
We initially tried to fix this by adding barrier() to preempt_* macros
for !PREEMPT_COUNT but Linus clarified that it was anything but wrong.
http://www.spinics.net/lists/kernel/msg1512709.html
[vgupta: updated commitlog]
Reported-by/Signed-off-by: Christian Ruppert <christian.ruppert@abilis.com>
Cc: Christian Ruppert <christian.ruppert@abilis.com>
Cc: Pierrick Hascoet <pierrick.hascoet@abilis.com>
Debugged-by/Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
orig_r8_IS_EXCPN and orig_r8_IS_BRKPT were same values due to a
copy/paste error. Although it looks bad and is wrong, it really doesn't
affect gdb working.
orig_r8_IS_BRKPT is the one relevant to debugging (breakpoints), since
it is used to provide EFA vs. ERET to a ptrace "stop_pc" request.
So when gdb has inserted a breakpoint, orig_r8_IS_BRKPT is already set,
and anything else (i.e. orig_r8_IS_EXCPN) becoming same as it, really
doesn't hurt gdb. The corollary case, could be nasty but nobody uses the
ptrace "stop_pc" request in that case
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
CC drivers/mmc/host/mmc_spi.o
drivers/mmc/host/mmc_spi.c:118: error: redefinition of 'struct scratch'
make[3]: *** [drivers/mmc/host/mmc_spi.o] Error 1
make[2]: *** [drivers/mmc/host] Error 2
make[1]: *** [drivers/mmc] Error 2
make: *** [drivers] Error 2
CC arch/arc/kernel/kgdb.o
In file included from include/linux/kgdb.h:20,
from arch/arc/kernel/kgdb.c:11:
/home/vineetg/arc/k.org/arc-port/arch/arc/include/asm/kgdb.h:34:
warning: 'struct pt_regs' declared inside parameter list
/home/vineetg/arc/k.org/arc-port/arch/arc/include/asm/kgdb.h:34:
warning: its scope is only this definition or declaration, which is
probably not what you want
arch/arc/kernel/kgdb.c:172: error: conflicting types for 'kgdb_trap'
CC arch/arc/kernel/kgdb.o
arch/arc/kernel/kgdb.c: In function 'pt_regs_to_gdb_regs':
arch/arc/kernel/kgdb.c:62: error: dereferencing pointer to incomplete
type
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
The upstream kernel ABI (v3) is different from current out-of-tree (v2):
* no-legacy-syscalls
* user_regs_struct layout has changed
So we rev up the ABI version
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
ptrace regset interface relies on ELF_NGREG for ceiling the size of user
request. So any larger request (even if legit) would be clipped.
The existing def of ELF_NGREG didn't use user_regs_struct and was
technically one placeholder short (stop_pc) - although the current code
would still work because pt_regs includes a bunch of extra fields,
making
ELF_NGREG >= sizeof(struct user_regs_struct)/sizeof(long)
But we need to remove this ambiguity, specially since pt_regs should NOT
be directly associated with with anything userspace-ish.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
The flat DT (currently embedded in vmlinux) is in .init section.
The unflattened/binary tree doesn't copy strings through and references
them from orig flat DT - which could cause catestrohpy if of_* APIs are
called post init, say from a driver which is a loadable module.
Reported-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This again is for switch from singleton platform SMP API to
multi-platform paradigm
Platform code is not yet setup to populate the callbacks, that happens
in next commit
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
All the current platforms can work with 0x8000_0000 based dma_addr_t
since the Bus Bridges typically ignore the top bit (the only excpetion
was Angel4 PCI-AHB bridge which we no longer care for).
That way we don't need plat-specific cpu-addr to bus-addr conversion.
Hooks still provided - just in case a platform has an obscure device
which say needs 0 based bus address.
That way <asm/dma_mapping.h> no longer needs to unconditinally include
<plat/dma_addr.h>
Also verfied that on Angel4 board, other peripherals (IDE-disk / EMAC)
work fine with 0x8000_0000 based dma addresses.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
For now this will suffice for all platforms, later exotic ones needs to
get this from DeviceTree
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
-platform API is retired and instead callbacks are used
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
The orig platform code orgnaization was singleton design pattern - only
one platform (and board thereof) would build at a time.
Thus any platform/board specific code (e.g. irq init, early init ...)
expected by ARC common code was exported as well defined set of APIs,
with only ONE instance building ever.
Now with multiple-platform build requirement, that design of code no
longer holds - multiple board specific calls need to build at the same
time - so ARC common code can't use the API approach, it needs a
callback based design where each board registers it's specific set of
functions, and at runtime, depending on board detection, the callbacks
are used from the registry.
This commit adds all the infrastructure, where board specific callbacks
are specified as a "maThine description".
All the hooks are placed in right spots, no board callbacks registered
yet (with MACHINE_STARt/END constructs) so the hooks will not run.
Next commit will actually convert the platform to this infrastructure.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Implement ioremap_prot() to allow mapping IO memory with variable
protection
via TLB.
Implementing this allows the /dev/mem driver to use its generic access()
VMA callback, which in turn allows ptrace to examine data in memory
mapped regions mapped via /dev/mem, such as Arc DCCM.
The end result is that it is possible to examine values of variables
placed into DCCM in user space programs via GDB.
CC: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
CC: Noam Camus <noamc@ezchip.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
1. ./genfilelist.pl arch/arc/include/asm/
2. Create arch/arc/include/uapi/asm/Kbuild as follows
+# UAPI Header export list
+include include/uapi/asm-generic/Kbuild.asm
3. ./disintegrate-one.pl arch/arc/include/{,uapi/}asm/<above-list>
4. Edit arch/arc/include/asm/Kbuild to remove ref to
asm-generic/Kbuild.asm
- To work around empty uapi/asm/setup.h added a placholder comment.
- Also a manual #ifdef __ASSEMBLY__ for a late ptrace change
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: David Howells <dhowells@redhat.com>
* Includes mapping of CCMs in address space
* Annotations to move arbitrary code/data into CCM
* Moving some of the critical code/data into CCM
* Runtime detection/reporting
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
ARC700 doesn't natively support unaligned access, but can be emulated
-Unaligned Access Exception
-Disassembly at the Fault address to find the exact insn (long/short)
Also per Arnd's comment, we runtime control it using 2 sysctl knobs:
* SYSCTL_ARCH_UNALIGN_ALLOW: Runtime enable/disble
* SYSCTL_ARCH_UNALIGN_NO_WARN: Warn on each emulation attempt
Originally contributed by Tim Yao <tim.yao@amlogic.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Tim Yao <tim.yao@amlogic.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
-Originally written by Rajeshwar Ranga
-Derived off of generic unwinder in 2.6.19 and adapted to ARC
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Rajeshwar Ranga <rajeshwar.ranga@gmail.com>
ARC common code to enable a SMP system + ISS provided SMP extensions.
ARC700 natively lacks SMP support, hence some of the core features are
are only enabled if SoCs have the necessary h/w pixie-dust. This
includes:
-Inter Processor Interrupts (IPI)
-Cache coherency
-load-locked/store-conditional
...
The low level exception handling would be completely broken in SMP
because we don't have hardware assisted stack switching. Thus a fair bit
of this code is repurposing the MMU_SCRATCH reg for event handler
prologues to keep them re-entrant.
Many thanks to Rajeshwar Ranga for his initial "major" contributions to
SMP Port (back in 2008), and to Noam Camus and Gilad Ben-Yossef for help
with resurrecting that in 3.2 kernel (2012).
Note that this platform code is again singleton design pattern - so
multiple SMP platforms won't build at the moment - this deficiency is
addressed in subsequent patches within this series.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rajeshwar Ranga <rajeshwar.ranga@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
There is a bit of hack/kludge right now where we disable preemption if a
L2 (High prio) IRQ is taken while L1 (Low prio) is active.
Need to revisit this
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This is minimal infrastructure needed for devicetree work.
It uses an a sample "skeleton" devicetree - embedded in kernel image -
to print the board, manufacturer by parsing the top-level "compatible"
string.
As of now we don't need any additional "board" specific "machine_desc".
TODO: support interpreting the command line as boot-loader passed dtb
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: James Hogan <james.hogan@imgtec.com>
Reviewed-by: Rob Herring <rob.herring@calxeda.com>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
N.B. This is old style of hardcoding platform device specific info
in code and it's instantiation thererof using platform_add_devices().
Subsequent patches replace this with DeviceTree based runtime probe.
This patch has been retained just as an example of "don't-do-this" for
newer kernel ports.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
ARC700 MMU provides for tagging TLB entries with a 8-bit ASID to avoid
having to flush the TLB every task switch.
It also allows for a quick way to invalidate all the TLB entries for
task useful for:
* COW sementics during fork()
* task exit()ing
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
* ARC700 has VIPT L1 Caches
* Caches don't snoop and are not coherent
* Given the PAGE_SIZE and Cache associativity, we don't support aliasing
D$ configurations (yet), but do allow aliasing I$ configs
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Per Al Viro's "signals for dummies" https://lkml.org/lkml/2012/12/6/366
there are 3 golden rules for (not) restarting syscalls:
" What we need to guarantee is
* restarts do not happen on signals caught in interrupts or exceptions
* restarts do not happen on signals caught in sigreturn()
* restart should happen only once, even if we get through do_signal()
many times."
ARC Port already handled #1, this patch fixes#2 and #3.
We use the additional state in pt_regs->orig_r8 to ckh if restarting
has already been done once.
Thanks to Al Viro for spotting this.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
To avoid multiple syscall restarts (multiple signals) or no restart at
all (sigreturn), we need just an extra bit of state "literally 1 bit" in
struct pt_regs. orig_r8 is the best place to do this, however given the
way it is encoded currently, we can't add anything simplistically.
Current orig_r8:
* syscalls -> 1 to NR_SYSCALLS
* Exceptions -> NR_SYSCALLS + 1
* Break-point-> NR_SYSCALLS + 2
In new scheme it is a bit-field
* lower short word contains the exact event type (and a new bit to represent
restart semantics : if syscall was already / can't be restarted)
* upper short word optionally containing the syscall num - needed by
likes of tracehooks etc
This patch only changes how orig_r8 is organised and nothing should
change behaviourily.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Includes following fixes courtesy review by Al-Viro
* Tracer poke to Callee-regs were lost
Before going off into do_signal( ) we save the user-mode callee regs
(as they are not saved by default as part of pt_regs). This is to make
sure that that a Tracer (if tracing related signal) is able to do likes
of PEEKUSR(callee-reg).
However in return path we were simply discarding the user-mode callee
regs, which would break a POKEUSR(callee-reg) from a tracer.
* Issue related to multiple syscall restarts are addressed in next patch
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Jonas Bonn <jonas@southpole.se>
ARC700 includes 2 in-core 32bit timers TIMER0 and TIMER1.
Both have exactly same capabilies.
* programmable to count from TIMER<n>_CNT to TIMER<n>_LIMIT
* for count 0 and LIMIT ~1, provides a free-running counter by
auto-wrapping when limit is reached.
* optionally interrupt when LIMIT is reached (oneshot event semantics)
* rearming the interrupt provides periodic semantics
* run at CPU clk
ARC Linux uses TIMER0 for clockevent (periodic/oneshot) and TIMER1 for
clocksource (free-running clock).
Newer cores provide RTSC insn which gives a 64bit cpu clk snapshot hence
is more apt for clocksource when available.
SMP poses a bit of challenge for global timekeeping clocksource /
sched_clock() backend:
-TIMER1 based local clocks are out-of-sync hence can't be used
(thus we default to jiffies based cs as well as sched_clock() one/both
of which platform can override with it's specific hardware assist)
-RTSC is only allowed in SMP if it's cross-core-sync (Kconfig glue
ensures that) and thus usable for both requirements.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
This includes support for generic clone/for/vfork/execve
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Hand optimised asm code for ARC700 pipeline.
Originally written/optimized by Joern Rennecke
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Joern Rennecke <joern.rennecke@embecosm.com>
* L1_CACHE_SHIFT
* PAGE_SIZE, PAGE_OFFSET
* struct pt_regs, struct user_regs_struct
* struct thread_struct, cpu_relax(), task_pt_regs(), start_thread(), ...
* struct thread_info, THREAD_SIZE, INIT_THREAD_INFO(), TIF_*, ...
* BUG()
* ELF_*
* Elf_*
To disallow user-space visibility into some of the core kernel data-types
such as struct pt_regs, #ifdef __KERNEL__ which also makes the UAPI header
spit (further patch in the series) to NOT export it to asm/uapi/ptrace.h
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Jonas Bonn <jonas.bonn@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Override asm-generic implementations. We basically gain on 2 fronts
* checks for alignment no longer needed as we are only doing "unit"
sized copies.
(Careful observer could argue that While the kernel buffers are aligned,
the user buffer in theory might not be - however in that case the
user space is already broken when it tries to deref a hword/word
straddling word boundary - so we are not making it any worse).
* __copy_{to,from}_user( ) returns bytes that couldn't be copied,
whereas get_user() returns 0 for success or -EFAULT (not size). Thus
the code to do leftover bytes calculation can be avoided as well.
The savings were significant: ~17k of code.
bloat-o-meter vmlinux_uaccess_pre vmlinux_uaccess_post
add/remove: 0/4 grow/shrink: 8/118 up/down: 1262/-18758 (-17496)
^^^^^^^^^
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
This covers the UP / SMP (with no hardware assist for atomic r-m-w) as
well as ARC700 LLOCK/SCOND insns based.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
ARC700 has an in-core intc which provides 2 priorities (a.k.a.) "levels"
of interrupts (per IRQ) hencforth referred to as L1/L2 interrupts.
CPU flags register STATUS32 has Interrupt Enable bits per level (E1/E2)
to globally enable (or disable) all IRQs at a level. Hence the
implementation of arch_local_irq_{save,restore,enable,disable}( )
The STATUS32 reg can be r/w only using the AUX Interface of ARC, hence
the use of LR/SR instructions. Further, E1/E2 bits in there can only be
updated using the FLAG insn.
The intc supports 32 interrupts - and per IRQ enabling is controlled by
a bit in the AUX_IENABLE register, hence the implmentation of
arch_{,un}mask_irq( ) routines.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Arnd in his review pointed out that arch Kconfig organisation has several
deficiencies:
* Build time entries for things which can be runtime extracted from DT
(e.g. SDRAM size, core clk frequency..)
* Not multi-platform-image-build friendly (choice .. endchoice constructs)
* cpu variants support (750/770) is exclusive.
The first 2 have been fixed in subsequent patches.
Due to the nature of the 750 and 770, it is not possible to build for
both together, w/o special runtime glue code which would hurt
performance.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Sam Ravnborg <sam@ravnborg.org>