utf8data.h contains a large database table which is an auto-generated
decodification trie for the unicode normalization functions.
Allow building it into a separate module.
Based on a patch from Shreeya Patel <shreeya.patel@collabora.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
In old days, the "host-progs" syntax was used for specifying host
programs. It was renamed to the current "hostprogs-y" in 2004.
It is typically useful in scripts/Makefile because it allows Kbuild to
selectively compile host programs based on the kernel configuration.
This commit renames like follows:
always -> always-y
hostprogs-y -> hostprogs
So, scripts/Makefile will look like this:
always-$(CONFIG_BUILD_BIN2C) += ...
always-$(CONFIG_KALLSYMS) += ...
...
hostprogs := $(always-y) $(always-m)
I think this makes more sense because a host program is always a host
program, irrespective of the kernel configuration. We want to specify
which ones to compile by CONFIG options, so always-y will be handier.
The "always", "hostprogs-y", "hostprogs-m" will be kept for backward
compatibility for a while.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
scripts/mkutf8data is used only when regenerating utf8data.h,
which never happens in the normal kernel build. However, it is
irrespectively built if CONFIG_UNICODE is enabled.
Moreover, there is no good reason for it to reside in the scripts/
directory since it is only used in fs/unicode/.
Hence, move it from scripts/ to fs/unicode/.
In some cases, we bypass build artifacts in the normal build. The
conventional way to do so is to surround the code with ifdef REGENERATE_*.
For example,
- 7373f4f83c ("kbuild: add implicit rules for parser generation")
- 6aaf49b495 ("crypto: arm,arm64 - Fix random regeneration of S_shipped")
I rewrote the rule in a more kbuild'ish style.
In the normal build, utf8data.h is just shipped from the check-in file.
$ make
[ snip ]
SHIPPED fs/unicode/utf8data.h
CC fs/unicode/utf8-norm.o
CC fs/unicode/utf8-core.o
CC fs/unicode/utf8-selftest.o
AR fs/unicode/built-in.a
If you want to generate utf8data.h based on UCD, put *.txt files into
fs/unicode/, then pass REGENERATE_UTF8DATA=1 from the command line.
The mkutf8data tool will be automatically compiled to generate the
utf8data.h from the *.txt files.
$ make REGENERATE_UTF8DATA=1
[ snip ]
HOSTCC fs/unicode/mkutf8data
GEN fs/unicode/utf8data.h
CC fs/unicode/utf8-norm.o
CC fs/unicode/utf8-core.o
CC fs/unicode/utf8-selftest.o
AR fs/unicode/built-in.a
I renamed the check-in utf8data.h to utf8data.h_shipped so that this
will work for the out-of-tree build.
You can update it based on the latest UCD like this:
$ make REGENERATE_UTF8DATA=1 fs/unicode/
$ cp fs/unicode/utf8data.h fs/unicode/utf8data.h_shipped
Also, I added entries to .gitignore and dontdiff.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This implements a in-kernel sanity test module for the utf8
normalization core. At probe time, it will run basic sequences through
the utf8n core, to identify problems will equivalent sequences and
normalization/casefold code. This is supposed to be useful for
regression testing when adding support for a new version of utf8 to
linux.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch integrates the utf8n patches with some higher level API to
perform UTF-8 string comparison, normalization and casefolding
operations. Implemented is a variation of NFD, and casefold is
performed by doing full casefold on top of NFD. These algorithms are
based on the core implemented by Olaf Weber from SGI.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfdi and
nfdicf.
nfdi:
- Apply unicode normalization form NFD.
- Remove any Default_Ignorable_Code_Point.
nfdicf:
- Apply unicode normalization form NFD.
- Remove any Default_Ignorable_Code_Point.
- Apply a full casefold (C + F).
For the purposes of the code, a string is valid UTF-8 if:
- The values encoded are 0x1..0x10FFFF.
- The surrogate codepoints 0xD800..0xDFFFF are not encoded.
- The shortest possible encoding is used for all values.
The supporting functions work on null-terminated strings (utf8 prefix)
and on length-limited strings (utf8n prefix).
From the original SGI patch and for conformity with coding standards,
the utf8data_t typedef was dropped, since it was just masking the struct
keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I
decided to keep it, since they are simple pointers to memory buffers,
and using uchars here wouldn't provide any more meaningful information.
From the original submission, we also converted from the compatibility
form to canonical.
Changes made by Gabriel:
Rebase to Mainline
Fix up checkpatch.pl warnings
Drop typedefs
move out of libxfs
Convert from NFKD to NFD
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The decomposition and casefolding of UTF-8 characters are described in a
prefix tree in utf8data.h, which is a generate from the Unicode
Character Database (UCD), published by the Unicode Consortium, and
should not be edited by hand. The structures in utf8data.h are meant to
be used for lookup operations by the unicode subsystem, when decoding a
utf-8 string.
mkutf8data.c is the source for a program that generates utf8data.h. It
was written by Olaf Weber from SGI and originally proposed to be merged
into Linux in 2014. The original proposal performed the compatibility
decomposition, NFKD, but the current version was modified by me to do
canonical decomposition, NFD, as suggested by the community. The
changes from the original submission are:
* Rebase to mainline.
* Fix out-of-tree-build.
* Update makefile to build 11.0.0 ucd files.
* drop references to xfs.
* Convert NFKD to NFD.
* Merge back robustness fixes from original patch. Requested by
Dave Chinner.
The original submission is archived at:
<https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs>
The utf8data.h file can be regenerated using the instructions in
fs/unicode/README.utf8data.
- Notes on the update from 8.0.0 to 11.0:
The structure of the ucd files and special cases have not experienced
any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition
of Cherokee LC characters, which is an interesting case for
case-folding. The update is accompanied by new tests on the test_ucd
module to catch specific cases. No changes to mkutf8data script were
required for the updates.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>