llvm-project/libcxx/include/__format/buffer.h

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

504 lines
18 KiB
C
Raw Normal View History

// -*- C++ -*-
//===----------------------------------------------------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
#ifndef _LIBCPP___FORMAT_BUFFER_H
#define _LIBCPP___FORMAT_BUFFER_H
#include <__algorithm/copy_n.h>
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
#include <__algorithm/fill_n.h>
#include <__algorithm/max.h>
#include <__algorithm/min.h>
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
#include <__algorithm/transform.h>
#include <__algorithm/unwrap_iter.h>
#include <__config>
#include <__format/concepts.h>
#include <__format/enable_insertable.h>
#include <__format/format_to_n_result.h>
#include <__iterator/back_insert_iterator.h>
#include <__iterator/concepts.h>
#include <__iterator/incrementable_traits.h>
#include <__iterator/iterator_traits.h>
#include <__iterator/wrap_iter.h>
#include <__utility/move.h>
#include <concepts>
#include <cstddef>
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
#include <string_view>
#include <type_traits>
#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
# pragma GCC system_header
#endif
_LIBCPP_PUSH_MACROS
#include <__undef_macros>
_LIBCPP_BEGIN_NAMESPACE_STD
#if _LIBCPP_STD_VER > 17
namespace __format {
/// A "buffer" that handles writing to the proper iterator.
///
/// This helper is used together with the @ref back_insert_iterator to offer
/// type-erasure for the formatting functions. This reduces the number to
/// template instantiations.
template <__fmt_char_type _CharT>
class _LIBCPP_TEMPLATE_VIS __output_buffer {
public:
using value_type = _CharT;
template <class _Tp>
_LIBCPP_HIDE_FROM_ABI explicit __output_buffer(_CharT* __ptr, size_t __capacity, _Tp* __obj)
: __ptr_(__ptr),
__capacity_(__capacity),
__flush_([](_CharT* __p, size_t __n, void* __o) { static_cast<_Tp*>(__o)->__flush(__p, __n); }),
__obj_(__obj) {}
_LIBCPP_HIDE_FROM_ABI void __reset(_CharT* __ptr, size_t __capacity) {
__ptr_ = __ptr;
__capacity_ = __capacity;
}
_LIBCPP_HIDE_FROM_ABI auto __make_output_iterator() { return std::back_insert_iterator{*this}; }
// Used in std::back_insert_iterator.
_LIBCPP_HIDE_FROM_ABI void push_back(_CharT __c) {
__ptr_[__size_++] = __c;
// Profiling showed flushing after adding is more efficient than flushing
// when entering the function.
if (__size_ == __capacity_)
__flush();
}
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
/// Copies the input __str to the buffer.
///
/// Since some of the input is generated by std::to_chars, there needs to be a
/// conversion when _CharT is wchar_t.
template <__fmt_char_type _InCharT>
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
_LIBCPP_HIDE_FROM_ABI void __copy(basic_string_view<_InCharT> __str) {
// When the underlying iterator is a simple iterator the __capacity_ is
// infinite. For a string or container back_inserter it isn't. This means
// adding a large string the the buffer can cause some overhead. In that
// case a better approach could be:
// - flush the buffer
// - container.append(__str.begin(), __str.end());
// The same holds true for the fill.
// For transform it might be slightly harder, however the use case for
// transform is slightly less common; it converts hexadecimal values to
// upper case. For integral these strings are short.
// TODO FMT Look at the improvements above.
size_t __n = __str.size();
__flush_on_overflow(__n);
if (__n <= __capacity_) {
_VSTD::copy_n(__str.data(), __n, _VSTD::addressof(__ptr_[__size_]));
__size_ += __n;
return;
}
// The output doesn't fit in the internal buffer.
// Copy the data in "__capacity_" sized chunks.
_LIBCPP_ASSERT(__size_ == 0, "the buffer should be flushed by __flush_on_overflow");
const _InCharT* __first = __str.data();
do {
size_t __chunk = _VSTD::min(__n, __capacity_);
_VSTD::copy_n(__first, __chunk, _VSTD::addressof(__ptr_[__size_]));
__size_ = __chunk;
__first += __chunk;
__n -= __chunk;
__flush();
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
} while (__n);
}
/// A std::transform wrapper.
///
/// Like @ref __copy it may need to do type conversion.
template <__fmt_char_type _InCharT, class _UnaryOperation>
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
_LIBCPP_HIDE_FROM_ABI void __transform(const _InCharT* __first, const _InCharT* __last, _UnaryOperation __operation) {
_LIBCPP_ASSERT(__first <= __last, "not a valid range");
size_t __n = static_cast<size_t>(__last - __first);
__flush_on_overflow(__n);
if (__n <= __capacity_) {
_VSTD::transform(__first, __last, _VSTD::addressof(__ptr_[__size_]), _VSTD::move(__operation));
__size_ += __n;
return;
}
// The output doesn't fit in the internal buffer.
// Transform the data in "__capacity_" sized chunks.
_LIBCPP_ASSERT(__size_ == 0, "the buffer should be flushed by __flush_on_overflow");
do {
size_t __chunk = _VSTD::min(__n, __capacity_);
_VSTD::transform(__first, __first + __chunk, _VSTD::addressof(__ptr_[__size_]), __operation);
__size_ = __chunk;
__first += __chunk;
__n -= __chunk;
__flush();
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
} while (__n);
}
/// A \c fill_n wrapper.
_LIBCPP_HIDE_FROM_ABI void __fill(size_t __n, _CharT __value) {
__flush_on_overflow(__n);
if (__n <= __capacity_) {
_VSTD::fill_n(_VSTD::addressof(__ptr_[__size_]), __n, __value);
__size_ += __n;
return;
}
// The output doesn't fit in the internal buffer.
// Fill the buffer in "__capacity_" sized chunks.
_LIBCPP_ASSERT(__size_ == 0, "the buffer should be flushed by __flush_on_overflow");
do {
size_t __chunk = _VSTD::min(__n, __capacity_);
_VSTD::fill_n(_VSTD::addressof(__ptr_[__size_]), __chunk, __value);
__size_ = __chunk;
__n -= __chunk;
__flush();
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
} while (__n);
}
_LIBCPP_HIDE_FROM_ABI void __flush() {
__flush_(__ptr_, __size_, __obj_);
__size_ = 0;
}
private:
_CharT* __ptr_;
size_t __capacity_;
size_t __size_{0};
void (*__flush_)(_CharT*, size_t, void*);
void* __obj_;
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
/// Flushes the buffer when the output operation would overflow the buffer.
///
/// A simple approach for the overflow detection would be something along the
/// lines:
/// \code
/// // The internal buffer is large enough.
/// if (__n <= __capacity_) {
/// // Flush when we really would overflow.
/// if (__size_ + __n >= __capacity_)
/// __flush();
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
/// ...
/// }
/// \endcode
///
/// This approach works for all cases but one:
/// A __format_to_n_buffer_base where \ref __enable_direct_output is true.
/// In that case the \ref __capacity_ of the buffer changes during the first
/// \ref __flush. During that operation the output buffer switches from its
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
/// __writer_ to its __storage_. The \ref __capacity_ of the former depends
/// on the value of n, of the latter is a fixed size. For example:
/// - a format_to_n call with a 10'000 char buffer,
/// - the buffer is filled with 9'500 chars,
/// - adding 1'000 elements would overflow the buffer so the buffer gets
/// changed and the \ref __capacity_ decreases from 10'000 to
/// __buffer_size (256 at the time of writing).
///
/// This means that the \ref __flush for this class may need to copy a part of
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
/// the internal buffer to the proper output. In this example there will be
/// 500 characters that need this copy operation.
///
/// Note it would be more efficient to write 500 chars directly and then swap
/// the buffers. This would make the code more complex and \ref format_to_n is
/// not the most common use case. Therefore the optimization isn't done.
_LIBCPP_HIDE_FROM_ABI void __flush_on_overflow(size_t __n) {
if (__size_ + __n >= __capacity_)
__flush();
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
}
};
/// A storage using an internal buffer.
///
/// This storage is used when writing a single element to the output iterator
/// is expensive.
template <__fmt_char_type _CharT>
class _LIBCPP_TEMPLATE_VIS __internal_storage {
public:
_LIBCPP_HIDE_FROM_ABI _CharT* __begin() { return __buffer_; }
static constexpr size_t __buffer_size = 256 / sizeof(_CharT);
private:
_CharT __buffer_[__buffer_size];
};
/// A storage writing directly to the storage.
///
/// This requires the storage to be a contiguous buffer of \a _CharT.
/// Since the output is directly written to the underlying storage this class
/// is just an empty class.
template <__fmt_char_type _CharT>
class _LIBCPP_TEMPLATE_VIS __direct_storage {};
template <class _OutIt, class _CharT>
concept __enable_direct_output = __fmt_char_type<_CharT> &&
(same_as<_OutIt, _CharT*>
#ifndef _LIBCPP_ENABLE_DEBUG_MODE
|| same_as<_OutIt, __wrap_iter<_CharT*>>
#endif
);
/// Write policy for directly writing to the underlying output.
template <class _OutIt, __fmt_char_type _CharT>
class _LIBCPP_TEMPLATE_VIS __writer_direct {
public:
_LIBCPP_HIDE_FROM_ABI explicit __writer_direct(_OutIt __out_it)
: __out_it_(__out_it) {}
_LIBCPP_HIDE_FROM_ABI _OutIt __out_it() { return __out_it_; }
_LIBCPP_HIDE_FROM_ABI void __flush(_CharT*, size_t __n) {
// _OutIt can be a __wrap_iter<CharT*>. Therefore the original iterator
// is adjusted.
__out_it_ += __n;
}
private:
_OutIt __out_it_;
};
/// Write policy for copying the buffer to the output.
template <class _OutIt, __fmt_char_type _CharT>
class _LIBCPP_TEMPLATE_VIS __writer_iterator {
public:
_LIBCPP_HIDE_FROM_ABI explicit __writer_iterator(_OutIt __out_it)
: __out_it_{_VSTD::move(__out_it)} {}
_LIBCPP_HIDE_FROM_ABI _OutIt __out_it() { return __out_it_; }
_LIBCPP_HIDE_FROM_ABI void __flush(_CharT* __ptr, size_t __n) {
__out_it_ = _VSTD::copy_n(__ptr, __n, _VSTD::move(__out_it_));
}
private:
_OutIt __out_it_;
};
/// Concept to see whether a \a _Container is insertable.
///
/// The concept is used to validate whether multiple calls to a
/// \ref back_insert_iterator can be replace by a call to \c _Container::insert.
///
/// \note a \a _Container needs to opt-in to the concept by specializing
/// \ref __enable_insertable.
template <class _Container>
concept __insertable =
__enable_insertable<_Container> && __fmt_char_type<typename _Container::value_type> &&
requires(_Container& __t, add_pointer_t<typename _Container::value_type> __first,
add_pointer_t<typename _Container::value_type> __last) { __t.insert(__t.end(), __first, __last); };
/// Extract the container type of a \ref back_insert_iterator.
template <class _It>
struct _LIBCPP_TEMPLATE_VIS __back_insert_iterator_container {
using type = void;
};
template <__insertable _Container>
struct _LIBCPP_TEMPLATE_VIS __back_insert_iterator_container<back_insert_iterator<_Container>> {
using type = _Container;
};
/// Write policy for inserting the buffer in a container.
template <class _Container>
class _LIBCPP_TEMPLATE_VIS __writer_container {
public:
using _CharT = typename _Container::value_type;
_LIBCPP_HIDE_FROM_ABI explicit __writer_container(back_insert_iterator<_Container> __out_it)
: __container_{__out_it.__get_container()} {}
_LIBCPP_HIDE_FROM_ABI auto __out_it() { return std::back_inserter(*__container_); }
_LIBCPP_HIDE_FROM_ABI void __flush(_CharT* __ptr, size_t __n) {
__container_->insert(__container_->end(), __ptr, __ptr + __n);
}
private:
_Container* __container_;
};
/// Selects the type of the writer used for the output iterator.
template <class _OutIt, class _CharT>
class _LIBCPP_TEMPLATE_VIS __writer_selector {
using _Container = typename __back_insert_iterator_container<_OutIt>::type;
public:
using type = conditional_t<!same_as<_Container, void>, __writer_container<_Container>,
conditional_t<__enable_direct_output<_OutIt, _CharT>, __writer_direct<_OutIt, _CharT>,
__writer_iterator<_OutIt, _CharT>>>;
};
/// The generic formatting buffer.
template <class _OutIt, __fmt_char_type _CharT>
requires(output_iterator<_OutIt, const _CharT&>) class _LIBCPP_TEMPLATE_VIS
__format_buffer {
using _Storage =
conditional_t<__enable_direct_output<_OutIt, _CharT>,
__direct_storage<_CharT>, __internal_storage<_CharT>>;
public:
_LIBCPP_HIDE_FROM_ABI explicit __format_buffer(_OutIt __out_it)
requires(same_as<_Storage, __internal_storage<_CharT>>)
: __output_(__storage_.__begin(), __storage_.__buffer_size, this), __writer_(_VSTD::move(__out_it)) {}
_LIBCPP_HIDE_FROM_ABI explicit __format_buffer(_OutIt __out_it) requires(
same_as<_Storage, __direct_storage<_CharT>>)
: __output_(_VSTD::__unwrap_iter(__out_it), size_t(-1), this),
__writer_(_VSTD::move(__out_it)) {}
_LIBCPP_HIDE_FROM_ABI auto __make_output_iterator() { return __output_.__make_output_iterator(); }
_LIBCPP_HIDE_FROM_ABI void __flush(_CharT* __ptr, size_t __n) { __writer_.__flush(__ptr, __n); }
_LIBCPP_HIDE_FROM_ABI _OutIt __out_it() && {
__output_.__flush();
return _VSTD::move(__writer_).__out_it();
}
private:
_LIBCPP_NO_UNIQUE_ADDRESS _Storage __storage_;
__output_buffer<_CharT> __output_;
typename __writer_selector<_OutIt, _CharT>::type __writer_;
};
/// A buffer that counts the number of insertions.
///
/// Since \ref formatted_size only needs to know the size, the output itself is
/// discarded.
template <__fmt_char_type _CharT>
class _LIBCPP_TEMPLATE_VIS __formatted_size_buffer {
public:
_LIBCPP_HIDE_FROM_ABI auto __make_output_iterator() { return __output_.__make_output_iterator(); }
_LIBCPP_HIDE_FROM_ABI void __flush(const _CharT*, size_t __n) { __size_ += __n; }
_LIBCPP_HIDE_FROM_ABI size_t __result() && {
__output_.__flush();
return __size_;
}
private:
__internal_storage<_CharT> __storage_;
__output_buffer<_CharT> __output_{__storage_.__begin(), __storage_.__buffer_size, this};
size_t __size_{0};
};
/// The base of a buffer that counts and limits the number of insertions.
template <class _OutIt, __fmt_char_type _CharT, bool>
requires(output_iterator<_OutIt, const _CharT&>)
struct _LIBCPP_TEMPLATE_VIS __format_to_n_buffer_base {
using _Size = iter_difference_t<_OutIt>;
public:
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
_LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer_base(_OutIt __out_it, _Size __max_size)
: __writer_(_VSTD::move(__out_it)), __max_size_(_VSTD::max(_Size(0), __max_size)) {}
_LIBCPP_HIDE_FROM_ABI void __flush(_CharT* __ptr, size_t __n) {
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
if (_Size(__size_) <= __max_size_)
__writer_.__flush(__ptr, _VSTD::min(_Size(__n), __max_size_ - __size_));
__size_ += __n;
}
protected:
__internal_storage<_CharT> __storage_;
__output_buffer<_CharT> __output_{__storage_.__begin(), __storage_.__buffer_size, this};
typename __writer_selector<_OutIt, _CharT>::type __writer_;
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
_Size __max_size_;
_Size __size_{0};
};
/// The base of a buffer that counts and limits the number of insertions.
///
/// This version is used when \c __enable_direct_output<_OutIt, _CharT> == true.
///
/// This class limits the size available to the direct writer so it will not
/// exceed the maximum number of code units.
template <class _OutIt, __fmt_char_type _CharT>
requires(output_iterator<_OutIt, const _CharT&>)
class _LIBCPP_TEMPLATE_VIS __format_to_n_buffer_base<_OutIt, _CharT, true> {
using _Size = iter_difference_t<_OutIt>;
public:
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
_LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer_base(_OutIt __out_it, _Size __max_size)
: __output_(_VSTD::__unwrap_iter(__out_it), __max_size, this),
__writer_(_VSTD::move(__out_it)),
__max_size_(__max_size) {
if (__max_size <= 0) [[unlikely]]
__output_.__reset(__storage_.__begin(), __storage_.__buffer_size);
}
_LIBCPP_HIDE_FROM_ABI void __flush(_CharT* __ptr, size_t __n) {
// A __flush to the direct writer happens in the following occasions:
// - The format function has written the maximum number of allowed code
// units. At this point it's no longer valid to write to this writer. So
// switch to the internal storage. This internal storage doesn't need to
// be written anywhere so the __flush for that storage writes no output.
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
// - Like above, but the next "mass write" operation would overflow the
// buffer. In that case the buffer is pre-emptively switched. The still
// valid code units will be written separately.
// - The format_to_n function is finished. In this case there's no need to
// switch the buffer, but for simplicity the buffers are still switched.
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
// When the __max_size <= 0 the constructor already switched the buffers.
if (__size_ == 0 && __ptr != __storage_.__begin()) {
__writer_.__flush(__ptr, __n);
__output_.__reset(__storage_.__begin(), __storage_.__buffer_size);
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
} else if (__size_ < __max_size_) {
// Copies a part of the internal buffer to the output up to n characters.
// See __output_buffer<_CharT>::__flush_on_overflow for more information.
_Size __s = _VSTD::min(_Size(__n), __max_size_ - __size_);
std::copy_n(__ptr, __s, __writer_.__out_it());
__writer_.__flush(__ptr, __s);
}
__size_ += __n;
}
protected:
__internal_storage<_CharT> __storage_;
__output_buffer<_CharT> __output_;
__writer_direct<_OutIt, _CharT> __writer_;
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
_Size __max_size_;
_Size __size_{0};
};
/// The buffer that counts and limits the number of insertions.
template <class _OutIt, __fmt_char_type _CharT>
requires(output_iterator<_OutIt, const _CharT&>)
struct _LIBCPP_TEMPLATE_VIS __format_to_n_buffer final
: public __format_to_n_buffer_base< _OutIt, _CharT, __enable_direct_output<_OutIt, _CharT>> {
using _Base = __format_to_n_buffer_base<_OutIt, _CharT, __enable_direct_output<_OutIt, _CharT>>;
using _Size = iter_difference_t<_OutIt>;
public:
[libc++][format] Improve format buffer. Allow bulk output operations on the buffer instead of adding one code unit at a time. This has a huge performance benefit at the cost of larger binary. This doesn't implement @vitaut's earlier suggestion to avoid buffering for std::string when writing a strings. That can be done in a follow-up patch. There are some minor complications for the non-buffered format_to_n. When writing one character at a time it's easy to detect when reaching the limit n. This is solved by adding a small overhead for format_to_n. When the next write would overflow it stores the data in the internal buffer and copies that up-to n code units. The overhead isn't measured, but it's expected to only be an issue for small values of n; for larger values the general improvements will outweight the new overhead. ``` text data bss dec hex filename 349081 6096 440 355617 56d21 format.libcxx.out-baseline 344442 6088 440 350970 55afa formatted_size.libcxx.out-baseline 4567980 57272 424 4625676 46950c formatter_float.libcxx.out-baseline 718800 12472 488 731760 b2a70 formatter_int.libcxx.out-baseline 376341 6096 552 382989 5d80d format_to.libcxx.out-beaseline 370169 6096 440 376705 5bf81 format.libcxx.out 365530 6088 440 372058 5ad5a formatted_size.libcxx.out 4575116 57272 424 4632812 46b0ec formatter_float.libcxx.out 725936 12472 488 738896 b4650 formatter_int.libcxx.out 397429 6096 552 404077 62a6d format_to.libcxx.out ``` For very small strings the new method is slower, from 4 characters there's already a small gain. ``` Comparing ./format.libcxx.out-baseline to ./format.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------- BM_format_string<char>/1 +0.0268 +0.0268 43 44 43 44 BM_format_string<char>/2 +0.0133 +0.0133 22 22 22 22 BM_format_string<char>/4 -0.0248 -0.0248 12 11 12 11 BM_format_string<char>/8 -0.0831 -0.0831 6 6 6 6 BM_format_string<char>/16 -0.2976 -0.2976 4 3 4 3 BM_format_string<char>/32 -0.4369 -0.4369 3 2 3 2 BM_format_string<char>/64 -0.6375 -0.6375 3 1 3 1 BM_format_string<char>/128 -0.7685 -0.7685 2 1 2 1 ``` The int benchmark has benefits for the simple formatting, but shines for the complex formatting: ``` Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- BM_Basic<uint32_t> -0.2307 -0.2307 60 46 60 46 BM_Basic<int32_t> -0.1985 -0.1985 61 49 61 49 BM_Basic<uint64_t> -0.3478 -0.3479 81 53 81 53 BM_Basic<int64_t> -0.3475 -0.3475 81 53 81 53 BM_BasicLow<__uint128_t> -0.3388 -0.3388 86 57 86 57 BM_BasicLow<__int128_t> -0.3431 -0.3431 86 57 86 57 BM_Basic<__uint128_t> -0.2822 -0.2822 236 170 236 170 BM_Basic<__int128_t> -0.3107 -0.3107 219 151 219 151 Integral_LocFalse_BaseBin_AlignNone_Int64 -0.5781 -0.5781 178 75 178 75 Integral_LocFalse_BaseBin_AlignmentLeft_Int64 -0.9231 -0.9231 1156 89 1156 89 Integral_LocFalse_BaseBin_AlignmentCenter_Int64 -0.9179 -0.9179 1107 91 1107 91 Integral_LocFalse_BaseBin_AlignmentRight_Int64 -0.9238 -0.9238 1147 87 1147 87 Integral_LocFalse_BaseBin_ZeroPadding_Int64 -0.9170 -0.9170 1137 94 1137 94 Integral_LocFalse_BaseBin_AlignNone_Uint64 -0.5923 -0.5923 175 71 175 71 Integral_LocFalse_BaseBin_AlignmentLeft_Uint64 -0.9251 -0.9251 1154 86 1154 86 Integral_LocFalse_BaseBin_AlignmentCenter_Uint64 -0.9204 -0.9204 1105 88 1105 88 Integral_LocFalse_BaseBin_AlignmentRight_Uint64 -0.9242 -0.9242 1125 85 1125 85 Integral_LocFalse_BaseBin_ZeroPadding_Uint64 -0.9232 -0.9232 1139 88 1139 88 Integral_LocFalse_BaseOct_AlignNone_Int64 -0.3241 -0.3241 100 67 100 67 Integral_LocFalse_BaseOct_AlignmentLeft_Int64 -0.9322 -0.9322 1166 79 1166 79 Integral_LocFalse_BaseOct_AlignmentCenter_Int64 -0.9251 -0.9251 1108 83 1108 83 Integral_LocFalse_BaseOct_AlignmentRight_Int64 -0.9303 -0.9303 1136 79 1136 79 Integral_LocFalse_BaseOct_ZeroPadding_Int64 -0.9264 -0.9264 1156 85 1156 85 Integral_LocFalse_BaseOct_AlignNone_Uint64 -0.3116 -0.3116 96 66 96 66 Integral_LocFalse_BaseOct_AlignmentLeft_Uint64 -0.9310 -0.9310 1168 81 1168 81 Integral_LocFalse_BaseOct_AlignmentCenter_Uint64 -0.9281 -0.9281 1128 81 1128 81 Integral_LocFalse_BaseOct_AlignmentRight_Uint64 -0.9299 -0.9299 1148 80 1148 80 Integral_LocFalse_BaseOct_ZeroPadding_Uint64 -0.9288 -0.9288 1153 82 1153 82 Integral_LocFalse_BaseDec_AlignNone_Int64 -0.3342 -0.3342 95 63 95 63 Integral_LocFalse_BaseDec_AlignmentLeft_Int64 -0.9360 -0.9360 1157 74 1157 74 Integral_LocFalse_BaseDec_AlignmentCenter_Int64 -0.9303 -0.9303 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Int64 -0.9369 -0.9369 1164 73 1164 73 Integral_LocFalse_BaseDec_ZeroPadding_Int64 -0.9323 -0.9323 1157 78 1157 78 Integral_LocFalse_BaseDec_AlignNone_Uint64 -0.3198 -0.3198 93 63 93 63 Integral_LocFalse_BaseDec_AlignmentLeft_Uint64 -0.9351 -0.9351 1158 75 1158 75 Integral_LocFalse_BaseDec_AlignmentCenter_Uint64 -0.9298 -0.9298 1128 79 1128 79 Integral_LocFalse_BaseDec_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseDec_ZeroPadding_Uint64 -0.9333 -0.9333 1151 77 1151 77 Integral_LocFalse_BaseHex_AlignNone_Int64 -0.3020 -0.3020 89 62 89 62 Integral_LocFalse_BaseHex_AlignmentLeft_Int64 -0.9357 -0.9357 1174 75 1174 75 Integral_LocFalse_BaseHex_AlignmentCenter_Int64 -0.9319 -0.9319 1129 77 1129 77 Integral_LocFalse_BaseHex_AlignmentRight_Int64 -0.9350 -0.9350 1161 75 1161 75 Integral_LocFalse_BaseHex_ZeroPadding_Int64 -0.9293 -0.9293 1150 81 1150 81 Integral_LocFalse_BaseHex_AlignNone_Uint64 -0.3056 -0.3057 86 59 86 59 Integral_LocFalse_BaseHex_AlignmentLeft_Uint64 -0.9378 -0.9378 1174 73 1174 73 Integral_LocFalse_BaseHex_AlignmentCenter_Uint64 -0.9341 -0.9341 1129 74 1130 74 Integral_LocFalse_BaseHex_AlignmentRight_Uint64 -0.9361 -0.9361 1157 74 1157 74 Integral_LocFalse_BaseHex_ZeroPadding_Uint64 -0.9315 -0.9315 1147 79 1147 79 Integral_LocFalse_BaseHexUpper_AlignNone_Int64 -0.0019 -0.0019 91 90 91 90 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64 -0.9099 -0.9099 1162 105 1162 105 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64 -0.9041 -0.9041 1121 108 1121 108 Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64 -0.9086 -0.9086 1162 106 1162 106 Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64 -0.9057 -0.9057 1164 110 1164 110 Integral_LocFalse_BaseHexUpper_AlignNone_Uint64 +0.0110 +0.0110 86 87 86 87 Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64 -0.9136 -0.9136 1161 100 1161 100 Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64 -0.9078 -0.9078 1133 104 1133 104 Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64 -0.9132 -0.9132 1177 102 1177 102 Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64 -0.9091 -0.9091 1160 105 1160 105 ``` Other benchmarks give similar results. Reviewed By: #libc, ldionne Differential Revision: https://reviews.llvm.org/D129964
2022-07-16 23:03:27 +08:00
_LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer(_OutIt __out_it, _Size __max_size)
: _Base(_VSTD::move(__out_it), __max_size) {}
_LIBCPP_HIDE_FROM_ABI auto __make_output_iterator() { return this->__output_.__make_output_iterator(); }
_LIBCPP_HIDE_FROM_ABI format_to_n_result<_OutIt> __result() && {
this->__output_.__flush();
return {_VSTD::move(this->__writer_).__out_it(), this->__size_};
}
};
} // namespace __format
#endif //_LIBCPP_STD_VER > 17
_LIBCPP_END_NAMESPACE_STD
_LIBCPP_POP_MACROS
#endif // _LIBCPP___FORMAT_BUFFER_H