llvm-project

Commit Graph

Author	SHA1	Message	Date
Corentin Jabot	c92056d038	[Clang][C++23] P2071 Named universal character escapes Implements [[ https://wg21.link/p2071r1 \| P2071 Named Universal Character Escapes ]] - as an extension in all language mode, the patch not warn in c++23 mode will be done later once this paper is plenary approved (in July). We add * A code generator that transforms `UnicodeData.txt` and `NameAliases.txt` to a space efficient data structure that can be queried in `O(NameLength)` * A set of functions in `Unicode.h` to query that data, including * A function to find an exact match of a given Unicode character name * A function to perform a loose (ignoring case, space, underscore, medial hyphen) matching * A function returning the best matching codepoint for a given string per edit distance * Support of `\N{}` escape sequences in String and character Literals, with loose and typos diagnostics/fixits * Support of `\N{}` as UCN with loose matching diagnostics/fixits. Loose matching is considered an error to match closely the semantics of P2071. The generated data contributes to 280kB of data to the binaries. `UnicodeData.txt` and `NameAliases.txt` are not committed to the repository in this patch, and regenerating the data is a manual process. Reviewed By: tahonermann Differential Revision: https://reviews.llvm.org/D123064	2022-06-25 19:03:33 +02:00
Sam McCall	817550919e	[Lex] Don't assert when decoding invalid UCNs. Currently if a lexically-valid UCN encodes an invalid codepoint, then we diagnose that, and then hit an assertion while trying to decode it. Since there isn't anything preventing us reaching this state, remove the assertion. expandUCNs("X\UAAAAAAAAY") will produce "XY". Differential Revision: https://reviews.llvm.org/D125059	2022-05-06 08:51:42 +02:00
Aaron Ballman	7de7161304	Use functions with prototypes when appropriate; NFC A significant number of our tests in C accidentally use functions without prototypes. This patch converts the function signatures to have a prototype for the situations where the test is not specific to K&R C declarations. e.g., void func(); becomes void func(void); This is the sixth batch of tests being updated (there are a significant number of other tests left to be updated).	2022-02-09 17:16:10 -05:00
Corentin Jabot	afb6223bc5	Support Unicode 14 identifiers This update the UAX tables to support new Unicode 14 identifiers.	2021-09-16 13:21:27 -04:00
Aaron Ballman	9f27364377	Use a more general test here. The interesting bit about that triple isn't the architecture, it's the fact that ps4 implies C99 as the standard rather than a newer C mode. Specify the language standard rather than the triple so the test is a bit more general.	2021-08-18 09:32:05 -04:00
Corentin Jabot	2715c4da50	Do not emit diagnostics for invalid unicode characters in preprocessing mode This amends `4e80636db7` with a fix for https://lab.llvm.org/buildbot/#/builders/139/builds/8943	2021-08-18 09:12:36 -04:00
Corentin Jabot	4e80636db7	Implement P1949 This adds the Unicode 13 data for XID_Start and XID_Continue. The definition of valid identifier is changed in all C++ modes as P1949 (https://wg21.link/p1949) was accepted by WG21 as a defect report.	2021-08-18 07:33:14 -04:00
Richard Smith	4e966e8135	Don't emit "will be treated as an identifier character" warning for UTF-8 characters that aren't identifier characters in the current language mode. llvm-svn: 343040	2018-09-25 22:34:45 +00:00
Richard Smith	8ed7776bc4	PR38870: Add warning for zero-width unicode characters appearing in identifiers. llvm-svn: 341700	2018-09-07 19:25:39 +00:00
Richard Smith	77091b167f	Warn if we find a Unicode homoglyph for a symbol in an identifier. Specifically, warn if: * we find a character that the language standard says we must treat as an identifier, and * that character is not reasonably an identifier character (it's a punctuation character or similar), and * it renders identically to a valid non-identifier character in common fixed-width fonts. Some tools "helpfully" substitute the surprising characters for the expected characters, and replacing semicolons with Greek question marks is a common "prank". llvm-svn: 320697	2017-12-14 13:15:08 +00:00
Richard Smith	664798c034	Add test that we correctly allow some non-letter unicode characters in identifiers, and extend existing test to also cover C++. llvm-svn: 248079	2015-09-19 02:14:12 +00:00
Jordan Rose	cc538345be	Lexer: Don't warn about Unicode in preprocessor directives. This allows people to use Unicode in their #pragma mark and in macros that exist only to be string-ized. <rdar://problem/13107323&13121362> llvm-svn: 174081	2013-01-31 19:48:48 +00:00
Jordan Rose	17441589c3	Don't warn about Unicode characters in -E mode. People use the C preprocessor for things other than C files. Some of them have Unicode characters. We shouldn't warn about Unicode characters appearing outside of identifiers in this case. There's not currently a way for the preprocessor to tell if it's in -E mode, so I added a new flag, derived from the PreprocessorOutputOptions. This is only used by the Unicode warnings for now, but could conceivably be used by other warnings or even behavioral differences later. <rdar://problem/13107323> llvm-svn: 173881	2013-01-30 01:52:57 +00:00
Jordan Rose	4246ae0089	As an extension, treat Unicode whitespace characters as whitespace. llvm-svn: 173370	2013-01-24 20:50:50 +00:00

14 Commits