llvm-project/llvm/lib/Support/DJB.cpp

//===-- Support/DJB.cpp ---DJB Hash -----------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// This file contains support for the DJ Bernstein hash function.
//
//===----------------------------------------------------------------------===//

#include "llvm/Support/DJB.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/ConvertUTF.h"
#include "llvm/Support/Unicode.h"

using namespace llvm;

static UTF32 chopOneUTF32(StringRef &Buffer) {
  UTF32 C;
  const UTF8 *const Begin8Const =
      reinterpret_cast<const UTF8 *>(Buffer.begin());
  const UTF8 *Begin8 = Begin8Const;
  UTF32 *Begin32 = &C;

  // In lenient mode we will always end up with a "reasonable" value in C for
  // non-empty input.
  assert(!Buffer.empty());
  ConvertUTF8toUTF32(&Begin8, reinterpret_cast<const UTF8 *>(Buffer.end()),
                     &Begin32, &C + 1, lenientConversion);
  Buffer = Buffer.drop_front(Begin8 - Begin8Const);
  return C;
}

static StringRef toUTF8(UTF32 C, MutableArrayRef<UTF8> Storage) {
  const UTF32 *Begin32 = &C;
  UTF8 *Begin8 = Storage.begin();

  // The case-folded output should always be a valid unicode character, so use
  // strict mode here.
  ConversionResult CR = ConvertUTF32toUTF8(&Begin32, &C + 1, &Begin8,
                                           Storage.end(), strictConversion);
  assert(CR == conversionOK && "Case folding produced invalid char?");
  (void)CR;
  return StringRef(reinterpret_cast<char *>(Storage.begin()),
                   Begin8 - Storage.begin());
}

static UTF32 foldCharDwarf(UTF32 C) {
  // DWARF v5 addition to the unicode folding rules.
  // Fold "Latin Small Letter Dotless I" and "Latin Capital Letter I With Dot
  // Above" into "i".
  if (C == 0x130 || C == 0x131)
    return 'i';
  return sys::unicode::foldCharSimple(C);
}

static Optional<uint32_t> fastCaseFoldingDjbHash(StringRef Buffer, uint32_t H) {
  bool AllASCII = true;
  for (unsigned char C : Buffer) {
    H = H * 33 + ('A' <= C && C <= 'Z' ? C - 'A' + 'a' : C);
    AllASCII &= C <= 0x7f;
  }
  if (AllASCII)
    return H;
  return None;
}

uint32_t llvm::caseFoldingDjbHash(StringRef Buffer, uint32_t H) {
  if (Optional<uint32_t> Result = fastCaseFoldingDjbHash(Buffer, H))
    return *Result;

  std::array<UTF8, UNI_MAX_UTF8_BYTES_PER_CODE_POINT> Storage;
  while (!Buffer.empty()) {
    UTF32 C = foldCharDwarf(chopOneUTF32(Buffer));
    StringRef Folded = toUTF8(C, Storage);
    H = djbHash(Folded, H);
  }
  return H;
}
[Support] Move DJB hash to support. NFC This patch moves the DJB hash to support. This is consistent with other hashing algorithms living there. The hash is used by the DWARF accelerator tables. We're doing this now because the hashing function is needed by dsymutil and we don't want to link against libBinaryFormat. Differential revision: https://reviews.llvm.org/D42594 llvm-svn: 323616 2018-01-28 19:05:10 +08:00			`//===-- Support/DJB.cpp ---DJB Hash ------------------------------ C++ --===//`
			`//`
Update the file headers across all of the LLVM projects in the monorepo to reflect the new license. We understand that people may be surprised that we're moving the header entirely to discuss the new license. We checked this carefully with the Foundation's lawyer and we believe this is the correct approach. Essentially, all code in the project is now made available by the LLVM project under our new license, so you will see that the license headers include that license only. Some of our contributors have contributed code under our old license, and accordingly, we have retained a copy of our old license notice in the top-level files in each project and repository. llvm-svn: 351636 2019-01-19 16:50:56 +08:00			`// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.`
			`// See https://llvm.org/LICENSE.txt for license information.`
			`// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception`
[Support] Move DJB hash to support. NFC This patch moves the DJB hash to support. This is consistent with other hashing algorithms living there. The hash is used by the DWARF accelerator tables. We're doing this now because the hashing function is needed by dsymutil and we don't want to link against libBinaryFormat. Differential revision: https://reviews.llvm.org/D42594 llvm-svn: 323616 2018-01-28 19:05:10 +08:00			`//`
			`//===----------------------------------------------------------------------===//`
			`//`
			`// This file contains support for the DJ Bernstein hash function.`
			`//`
			`//===----------------------------------------------------------------------===//`

			`#include "llvm/Support/DJB.h"`
Resubmit r325107 (case folding DJB hash) The issue was that the has function was generating different results depending on the signedness of char on the host platform. This commit fixes the issue by explicitly using an unsigned char type to prevent sign extension and adds some extra tests. The original commit message was: This patch implements a variant of the DJB hash function which folds the input according to the algorithm in the Dwarf 5 specification (Section 6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18, "Case Mappings"). To achieve this, I have added a llvm::sys::unicode::foldCharSimple function, which performs this mapping. The implementation of this function was generated from the CaseMatching.txt file from the Unicode spec using a python script (which is also included in this patch). The script tries to optimize the function by coalescing adjecant mappings with the same shift and stride (terms I made up). Theoretically, it could be made a bit smarter and merge adjecant blocks that were interrupted by only one or two characters with exceptional mapping, but this would save only a couple of branches, while it would greatly complicate the implementation, so I deemed it was not worth it. Since we assume that the vast majority of the input characters will be US-ASCII, the folding hash function has a fast-path for handling these, and only whips out the full decode+fold+encode logic if we encounter a character outside of this range. It might be possible to implement the folding directly on utf8 sequences, but this would also bring a lot of complexity for the few cases where we will actually need to process non-ascii characters. Reviewers: JDevlieghere, aprantl, probinson, dblaikie Subscribers: mgorny, hintonda, echristo, clayborg, vleschuk, llvm-commits Differential Revision: https://reviews.llvm.org/D42740 llvm-svn: 325732 2018-02-22 06:36:31 +08:00			`#include "llvm/ADT/ArrayRef.h"`
			`#include "llvm/Support/Compiler.h"`
			`#include "llvm/Support/ConvertUTF.h"`
			`#include "llvm/Support/Unicode.h"`

			`using namespace llvm;`

			`static UTF32 chopOneUTF32(StringRef &Buffer) {`
			`UTF32 C;`
			`const UTF8 *const Begin8Const =`
			`reinterpret_cast<const UTF8 *>(Buffer.begin());`
			`const UTF8 *Begin8 = Begin8Const;`
			`UTF32 *Begin32 = &C;`

			`// In lenient mode we will always end up with a "reasonable" value in C for`
			`// non-empty input.`
			`assert(!Buffer.empty());`
			`ConvertUTF8toUTF32(&Begin8, reinterpret_cast<const UTF8 *>(Buffer.end()),`
			`&Begin32, &C + 1, lenientConversion);`
			`Buffer = Buffer.drop_front(Begin8 - Begin8Const);`
			`return C;`
			`}`

			`static StringRef toUTF8(UTF32 C, MutableArrayRef<UTF8> Storage) {`
			`const UTF32 *Begin32 = &C;`
			`UTF8 *Begin8 = Storage.begin();`

			`// The case-folded output should always be a valid unicode character, so use`
			`// strict mode here.`
			`ConversionResult CR = ConvertUTF32toUTF8(&Begin32, &C + 1, &Begin8,`
			`Storage.end(), strictConversion);`
			`assert(CR == conversionOK && "Case folding produced invalid char?");`
			`(void)CR;`
			`return StringRef(reinterpret_cast<char *>(Storage.begin()),`
			`Begin8 - Storage.begin());`
			`}`

			`static UTF32 foldCharDwarf(UTF32 C) {`
			`// DWARF v5 addition to the unicode folding rules.`
			`// Fold "Latin Small Letter Dotless I" and "Latin Capital Letter I With Dot`
			`// Above" into "i".`
			`if (C == 0x130 \|\| C == 0x131)`
			`return 'i';`
			`return sys::unicode::foldCharSimple(C);`
			`}`

caseFoldingDjbHash: simplify and make the US-ASCII fast path faster The slow path (with at least one non US-ASCII) will be slower but that doesn't matter. Differential Revision: https://reviews.llvm.org/D61178 llvm-svn: 359294 2019-04-26 18:56:10 +08:00			`static Optional<uint32_t> fastCaseFoldingDjbHash(StringRef Buffer, uint32_t H) {`
[DJB] Fix variable case after D61178 llvm-svn: 359381 2019-04-27 23:33:22 +08:00			`bool AllASCII = true;`
caseFoldingDjbHash: simplify and make the US-ASCII fast path faster The slow path (with at least one non US-ASCII) will be slower but that doesn't matter. Differential Revision: https://reviews.llvm.org/D61178 llvm-svn: 359294 2019-04-26 18:56:10 +08:00			`for (unsigned char C : Buffer) {`
			`H = H * 33 + ('A' <= C && C <= 'Z' ? C - 'A' + 'a' : C);`
[DJB] Fix variable case after D61178 llvm-svn: 359381 2019-04-27 23:33:22 +08:00			`AllASCII &= C <= 0x7f;`
caseFoldingDjbHash: simplify and make the US-ASCII fast path faster The slow path (with at least one non US-ASCII) will be slower but that doesn't matter. Differential Revision: https://reviews.llvm.org/D61178 llvm-svn: 359294 2019-04-26 18:56:10 +08:00			`}`
[DJB] Fix variable case after D61178 llvm-svn: 359381 2019-04-27 23:33:22 +08:00			`if (AllASCII)`
caseFoldingDjbHash: simplify and make the US-ASCII fast path faster The slow path (with at least one non US-ASCII) will be slower but that doesn't matter. Differential Revision: https://reviews.llvm.org/D61178 llvm-svn: 359294 2019-04-26 18:56:10 +08:00			`return H;`
			`return None;`
Resubmit r325107 (case folding DJB hash) The issue was that the has function was generating different results depending on the signedness of char on the host platform. This commit fixes the issue by explicitly using an unsigned char type to prevent sign extension and adds some extra tests. The original commit message was: This patch implements a variant of the DJB hash function which folds the input according to the algorithm in the Dwarf 5 specification (Section 6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18, "Case Mappings"). To achieve this, I have added a llvm::sys::unicode::foldCharSimple function, which performs this mapping. The implementation of this function was generated from the CaseMatching.txt file from the Unicode spec using a python script (which is also included in this patch). The script tries to optimize the function by coalescing adjecant mappings with the same shift and stride (terms I made up). Theoretically, it could be made a bit smarter and merge adjecant blocks that were interrupted by only one or two characters with exceptional mapping, but this would save only a couple of branches, while it would greatly complicate the implementation, so I deemed it was not worth it. Since we assume that the vast majority of the input characters will be US-ASCII, the folding hash function has a fast-path for handling these, and only whips out the full decode+fold+encode logic if we encounter a character outside of this range. It might be possible to implement the folding directly on utf8 sequences, but this would also bring a lot of complexity for the few cases where we will actually need to process non-ascii characters. Reviewers: JDevlieghere, aprantl, probinson, dblaikie Subscribers: mgorny, hintonda, echristo, clayborg, vleschuk, llvm-commits Differential Revision: https://reviews.llvm.org/D42740 llvm-svn: 325732 2018-02-22 06:36:31 +08:00			`}`

			`uint32_t llvm::caseFoldingDjbHash(StringRef Buffer, uint32_t H) {`
caseFoldingDjbHash: simplify and make the US-ASCII fast path faster The slow path (with at least one non US-ASCII) will be slower but that doesn't matter. Differential Revision: https://reviews.llvm.org/D61178 llvm-svn: 359294 2019-04-26 18:56:10 +08:00			`if (Optional<uint32_t> Result = fastCaseFoldingDjbHash(Buffer, H))`
			`return *Result;`

			`std::array<UTF8, UNI_MAX_UTF8_BYTES_PER_CODE_POINT> Storage;`
Resubmit r325107 (case folding DJB hash) The issue was that the has function was generating different results depending on the signedness of char on the host platform. This commit fixes the issue by explicitly using an unsigned char type to prevent sign extension and adds some extra tests. The original commit message was: This patch implements a variant of the DJB hash function which folds the input according to the algorithm in the Dwarf 5 specification (Section 6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18, "Case Mappings"). To achieve this, I have added a llvm::sys::unicode::foldCharSimple function, which performs this mapping. The implementation of this function was generated from the CaseMatching.txt file from the Unicode spec using a python script (which is also included in this patch). The script tries to optimize the function by coalescing adjecant mappings with the same shift and stride (terms I made up). Theoretically, it could be made a bit smarter and merge adjecant blocks that were interrupted by only one or two characters with exceptional mapping, but this would save only a couple of branches, while it would greatly complicate the implementation, so I deemed it was not worth it. Since we assume that the vast majority of the input characters will be US-ASCII, the folding hash function has a fast-path for handling these, and only whips out the full decode+fold+encode logic if we encounter a character outside of this range. It might be possible to implement the folding directly on utf8 sequences, but this would also bring a lot of complexity for the few cases where we will actually need to process non-ascii characters. Reviewers: JDevlieghere, aprantl, probinson, dblaikie Subscribers: mgorny, hintonda, echristo, clayborg, vleschuk, llvm-commits Differential Revision: https://reviews.llvm.org/D42740 llvm-svn: 325732 2018-02-22 06:36:31 +08:00			`while (!Buffer.empty()) {`
caseFoldingDjbHash: simplify and make the US-ASCII fast path faster The slow path (with at least one non US-ASCII) will be slower but that doesn't matter. Differential Revision: https://reviews.llvm.org/D61178 llvm-svn: 359294 2019-04-26 18:56:10 +08:00			`UTF32 C = foldCharDwarf(chopOneUTF32(Buffer));`
			`StringRef Folded = toUTF8(C, Storage);`
			`H = djbHash(Folded, H);`
Resubmit r325107 (case folding DJB hash) The issue was that the has function was generating different results depending on the signedness of char on the host platform. This commit fixes the issue by explicitly using an unsigned char type to prevent sign extension and adds some extra tests. The original commit message was: This patch implements a variant of the DJB hash function which folds the input according to the algorithm in the Dwarf 5 specification (Section 6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18, "Case Mappings"). To achieve this, I have added a llvm::sys::unicode::foldCharSimple function, which performs this mapping. The implementation of this function was generated from the CaseMatching.txt file from the Unicode spec using a python script (which is also included in this patch). The script tries to optimize the function by coalescing adjecant mappings with the same shift and stride (terms I made up). Theoretically, it could be made a bit smarter and merge adjecant blocks that were interrupted by only one or two characters with exceptional mapping, but this would save only a couple of branches, while it would greatly complicate the implementation, so I deemed it was not worth it. Since we assume that the vast majority of the input characters will be US-ASCII, the folding hash function has a fast-path for handling these, and only whips out the full decode+fold+encode logic if we encounter a character outside of this range. It might be possible to implement the folding directly on utf8 sequences, but this would also bring a lot of complexity for the few cases where we will actually need to process non-ascii characters. Reviewers: JDevlieghere, aprantl, probinson, dblaikie Subscribers: mgorny, hintonda, echristo, clayborg, vleschuk, llvm-commits Differential Revision: https://reviews.llvm.org/D42740 llvm-svn: 325732 2018-02-22 06:36:31 +08:00			`}`
[Support] Move DJB hash to support. NFC This patch moves the DJB hash to support. This is consistent with other hashing algorithms living there. The hash is used by the DWARF accelerator tables. We're doing this now because the hashing function is needed by dsymutil and we don't want to link against libBinaryFormat. Differential revision: https://reviews.llvm.org/D42594 llvm-svn: 323616 2018-01-28 19:05:10 +08:00			`return H;`
			`}`