llvm-project/llvm/unittests/Support/DJBTest.cpp

96 lines
2.9 KiB
C++
Raw Normal View History

Resubmit r325107 (case folding DJB hash) The issue was that the has function was generating different results depending on the signedness of char on the host platform. This commit fixes the issue by explicitly using an unsigned char type to prevent sign extension and adds some extra tests. The original commit message was: This patch implements a variant of the DJB hash function which folds the input according to the algorithm in the Dwarf 5 specification (Section 6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18, "Case Mappings"). To achieve this, I have added a llvm::sys::unicode::foldCharSimple function, which performs this mapping. The implementation of this function was generated from the CaseMatching.txt file from the Unicode spec using a python script (which is also included in this patch). The script tries to optimize the function by coalescing adjecant mappings with the same shift and stride (terms I made up). Theoretically, it could be made a bit smarter and merge adjecant blocks that were interrupted by only one or two characters with exceptional mapping, but this would save only a couple of branches, while it would greatly complicate the implementation, so I deemed it was not worth it. Since we assume that the vast majority of the input characters will be US-ASCII, the folding hash function has a fast-path for handling these, and only whips out the full decode+fold+encode logic if we encounter a character outside of this range. It might be possible to implement the folding directly on utf8 sequences, but this would also bring a lot of complexity for the few cases where we will actually need to process non-ascii characters. Reviewers: JDevlieghere, aprantl, probinson, dblaikie Subscribers: mgorny, hintonda, echristo, clayborg, vleschuk, llvm-commits Differential Revision: https://reviews.llvm.org/D42740 llvm-svn: 325732
2018-02-22 06:36:31 +08:00
//===---------- llvm/unittest/Support/DJBTest.cpp -------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
Resubmit r325107 (case folding DJB hash) The issue was that the has function was generating different results depending on the signedness of char on the host platform. This commit fixes the issue by explicitly using an unsigned char type to prevent sign extension and adds some extra tests. The original commit message was: This patch implements a variant of the DJB hash function which folds the input according to the algorithm in the Dwarf 5 specification (Section 6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18, "Case Mappings"). To achieve this, I have added a llvm::sys::unicode::foldCharSimple function, which performs this mapping. The implementation of this function was generated from the CaseMatching.txt file from the Unicode spec using a python script (which is also included in this patch). The script tries to optimize the function by coalescing adjecant mappings with the same shift and stride (terms I made up). Theoretically, it could be made a bit smarter and merge adjecant blocks that were interrupted by only one or two characters with exceptional mapping, but this would save only a couple of branches, while it would greatly complicate the implementation, so I deemed it was not worth it. Since we assume that the vast majority of the input characters will be US-ASCII, the folding hash function has a fast-path for handling these, and only whips out the full decode+fold+encode logic if we encounter a character outside of this range. It might be possible to implement the folding directly on utf8 sequences, but this would also bring a lot of complexity for the few cases where we will actually need to process non-ascii characters. Reviewers: JDevlieghere, aprantl, probinson, dblaikie Subscribers: mgorny, hintonda, echristo, clayborg, vleschuk, llvm-commits Differential Revision: https://reviews.llvm.org/D42740 llvm-svn: 325732
2018-02-22 06:36:31 +08:00
//
//===----------------------------------------------------------------------===//
#include "llvm/Support/DJB.h"
#include "llvm/ADT/Twine.h"
#include "gtest/gtest.h"
using namespace llvm;
TEST(DJBTest, caseFolding) {
struct TestCase {
StringLiteral One;
StringLiteral Two;
};
static constexpr TestCase Tests[] = {
{{"ASDF"}, {"asdf"}},
{{"qWeR"}, {"QwEr"}},
{{"qqqqqqqqqqqqqqqqqqqq"}, {"QQQQQQQQQQQQQQQQQQQQ"}},
{{"I"}, {"i"}},
// Latin Small Letter Dotless I
{{u8"\u0130"}, {"i"}},
// Latin Capital Letter I With Dot Above
{{u8"\u0131"}, {"i"}},
// Latin Capital Letter A With Grave
{{u8"\u00c0"}, {u8"\u00e0"}},
// Latin Capital Letter A With Macron
{{u8"\u0100"}, {u8"\u0101"}},
// Latin Capital Letter L With Acute
{{u8"\u0139"}, {u8"\u013a"}},
// Cyrillic Capital Letter Ie
{{u8"\u0415"}, {u8"\u0435"}},
// Latin Capital Letter A With Circumflex And Grave
{{u8"\u1ea6"}, {u8"\u1ea7"}},
// Kelvin Sign
{{u8"\u212a"}, {u8"\u006b"}},
// Glagolitic Capital Letter Chrivi
{{u8"\u2c1d"}, {u8"\u2c4d"}},
// Fullwidth Latin Capital Letter M
{{u8"\uff2d"}, {u8"\uff4d"}},
// Old Hungarian Capital Letter Ej
{{u8"\U00010c92"}, {u8"\U00010cd2"}},
};
for (const TestCase &T : Tests) {
SCOPED_TRACE("Comparing '" + T.One + "' and '" + T.Two + "'");
EXPECT_EQ(caseFoldingDjbHash(T.One), caseFoldingDjbHash(T.Two));
}
}
TEST(DJBTest, knownValuesLowerCase) {
struct TestCase {
StringLiteral Text;
uint32_t Hash;
};
static constexpr TestCase Tests[] = {
{{""}, 5381u},
{{"f"}, 177675u},
{{"fo"}, 5863386u},
{{"foo"}, 193491849u},
{{"foob"}, 2090263819u},
{{"fooba"}, 259229388u},
{{"foobar"}, 4259602622u},
{{"pneumonoultramicroscopicsilicovolcanoconiosis"}, 3999417781u},
};
for (const TestCase &T : Tests) {
SCOPED_TRACE("Text: '" + T.Text + "'");
EXPECT_EQ(T.Hash, djbHash(T.Text));
EXPECT_EQ(T.Hash, caseFoldingDjbHash(T.Text));
EXPECT_EQ(T.Hash, caseFoldingDjbHash(T.Text.upper()));
}
}
TEST(DJBTest, knownValuesUnicode) {
EXPECT_EQ(5866553u, djbHash(u8"\u0130"));
EXPECT_EQ(177678u, caseFoldingDjbHash(u8"\u0130"));
EXPECT_EQ(
1302161417u,
djbHash(
u8"\u0130\u0131\u00c0\u00e0\u0100\u0101\u0139\u013a\u0415\u0435\u1ea6"
u8"\u1ea7\u212a\u006b\u2c1d\u2c4d\uff2d\uff4d\U00010c92\U00010cd2"));
EXPECT_EQ(
1145571043u,
caseFoldingDjbHash(
u8"\u0130\u0131\u00c0\u00e0\u0100\u0101\u0139\u013a\u0415\u0435\u1ea6"
u8"\u1ea7\u212a\u006b\u2c1d\u2c4d\uff2d\uff4d\U00010c92\U00010cd2"));
}