2018-08-15 00:03:32 +08:00
|
|
|
//===--- Quality.cpp ---------------------------------------------*- C++-*-===//
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
//
|
2019-01-19 16:50:56 +08:00
|
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
//
|
2018-08-15 00:03:32 +08:00
|
|
|
//===----------------------------------------------------------------------===//
|
2019-02-28 21:49:25 +08:00
|
|
|
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
#include "Quality.h"
|
2018-10-18 20:23:05 +08:00
|
|
|
#include "AST.h"
|
[clangd] Use Decision Forest to score code completions.
By default clangd will score a code completion item using heuristics model.
Scoring can be done by Decision Forest model by passing `--ranking_model=decision_forest` to
clangd.
Features omitted from the model:
- `NameMatch` is excluded because the final score must be multiplicative in `NameMatch` to allow rescoring by the editor.
- `NeedsFixIts` is excluded because the generating dataset that needs 'fixits' is non-trivial.
There are multiple ways (heuristics) to combine the above two features with the prediction of the DF:
- `NeedsFixIts` is used as is with a penalty of `0.5`.
Various alternatives of combining NameMatch `N` and Decision forest Prediction `P`
- N * scale(P, 0, 1): Linearly scale the output of model to range [0, 1]
- N * a^P:
- More natural: Prediction of each Decision Tree can be considered as a multiplicative boost (like NameMatch)
- Ordering is independent of the absolute value of P. Order of two items is proportional to `a^{difference in model prediction score}`. Higher `a` gives higher weightage to model output as compared to NameMatch score.
Baseline MRR = 0.619
MRR for various combinations:
N * P = 0.6346, advantage%=2.5768
N * 1.1^P = 0.6600, advantage%=6.6853
N * **1.2**^P = 0.6669, advantage%=**7.8005**
N * **1.3**^P = 0.6668, advantage%=**7.7795**
N * **1.4**^P = 0.6659, advantage%=**7.6270**
N * 1.5^P = 0.6646, advantage%=7.4200
N * 1.6^P = 0.6636, advantage%=7.2671
N * 1.7^P = 0.6629, advantage%=7.1450
N * 2^P = 0.6612, advantage%=6.8673
N * 2.5^P = 0.6598, advantage%=6.6491
N * 3^P = 0.6590, advantage%=6.5242
N * scaled[0, 1] = 0.6465, advantage%=4.5054
Differential Revision: https://reviews.llvm.org/D88281
2020-09-22 13:56:08 +08:00
|
|
|
#include "CompletionModel.h"
|
2018-07-03 16:09:29 +08:00
|
|
|
#include "FileDistance.h"
|
2019-07-19 16:33:39 +08:00
|
|
|
#include "SourceCode.h"
|
2018-06-15 16:58:12 +08:00
|
|
|
#include "URI.h"
|
2019-02-28 21:49:25 +08:00
|
|
|
#include "index/Symbol.h"
|
2018-06-04 22:50:59 +08:00
|
|
|
#include "clang/AST/ASTContext.h"
|
2018-07-23 18:56:37 +08:00
|
|
|
#include "clang/AST/Decl.h"
|
2018-07-05 16:14:04 +08:00
|
|
|
#include "clang/AST/DeclCXX.h"
|
2018-07-23 18:56:37 +08:00
|
|
|
#include "clang/AST/DeclTemplate.h"
|
2018-06-06 16:53:36 +08:00
|
|
|
#include "clang/AST/DeclVisitor.h"
|
2018-07-03 16:09:29 +08:00
|
|
|
#include "clang/Basic/CharInfo.h"
|
2018-06-04 22:50:59 +08:00
|
|
|
#include "clang/Basic/SourceManager.h"
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
#include "clang/Sema/CodeCompleteConsumer.h"
|
2018-10-17 19:19:02 +08:00
|
|
|
#include "llvm/ADT/ArrayRef.h"
|
|
|
|
#include "llvm/ADT/SmallString.h"
|
|
|
|
#include "llvm/ADT/SmallVector.h"
|
|
|
|
#include "llvm/ADT/StringExtras.h"
|
|
|
|
#include "llvm/ADT/StringRef.h"
|
2018-07-05 16:14:04 +08:00
|
|
|
#include "llvm/Support/Casting.h"
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
#include "llvm/Support/FormatVariadic.h"
|
|
|
|
#include "llvm/Support/MathExtras.h"
|
|
|
|
#include "llvm/Support/raw_ostream.h"
|
2018-10-17 19:19:02 +08:00
|
|
|
#include <algorithm>
|
2018-07-03 16:09:29 +08:00
|
|
|
#include <cmath>
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
|
|
|
|
namespace clang {
|
|
|
|
namespace clangd {
|
2019-01-07 23:45:19 +08:00
|
|
|
static bool isReserved(llvm::StringRef Name) {
|
2018-06-08 17:36:34 +08:00
|
|
|
// FIXME: Should we exclude _Bool and others recognized by the standard?
|
|
|
|
return Name.size() >= 2 && Name[0] == '_' &&
|
|
|
|
(isUppercase(Name[1]) || Name[1] == '_');
|
|
|
|
}
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
|
2018-06-04 22:50:59 +08:00
|
|
|
static bool hasDeclInMainFile(const Decl &D) {
|
|
|
|
auto &SourceMgr = D.getASTContext().getSourceManager();
|
|
|
|
for (auto *Redecl : D.redecls()) {
|
2019-07-19 16:33:39 +08:00
|
|
|
if (isInsideMainFile(Redecl->getLocation(), SourceMgr))
|
2018-06-04 22:50:59 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2018-07-11 22:49:49 +08:00
|
|
|
static bool hasUsingDeclInMainFile(const CodeCompletionResult &R) {
|
|
|
|
const auto &Context = R.Declaration->getASTContext();
|
|
|
|
const auto &SourceMgr = Context.getSourceManager();
|
|
|
|
if (R.ShadowDecl) {
|
2019-07-19 16:33:39 +08:00
|
|
|
if (isInsideMainFile(R.ShadowDecl->getLocation(), SourceMgr))
|
2018-07-11 22:49:49 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2018-06-06 16:53:36 +08:00
|
|
|
static SymbolQualitySignals::SymbolCategory categorize(const NamedDecl &ND) {
|
2018-11-30 19:17:15 +08:00
|
|
|
if (const auto *FD = dyn_cast<FunctionDecl>(&ND)) {
|
|
|
|
if (FD->isOverloadedOperator())
|
|
|
|
return SymbolQualitySignals::Operator;
|
|
|
|
}
|
2018-06-06 16:53:36 +08:00
|
|
|
class Switch
|
|
|
|
: public ConstDeclVisitor<Switch, SymbolQualitySignals::SymbolCategory> {
|
|
|
|
public:
|
|
|
|
#define MAP(DeclType, Category) \
|
|
|
|
SymbolQualitySignals::SymbolCategory Visit##DeclType(const DeclType *) { \
|
|
|
|
return SymbolQualitySignals::Category; \
|
|
|
|
}
|
|
|
|
MAP(NamespaceDecl, Namespace);
|
|
|
|
MAP(NamespaceAliasDecl, Namespace);
|
|
|
|
MAP(TypeDecl, Type);
|
|
|
|
MAP(TypeAliasTemplateDecl, Type);
|
|
|
|
MAP(ClassTemplateDecl, Type);
|
2018-07-24 16:51:52 +08:00
|
|
|
MAP(CXXConstructorDecl, Constructor);
|
2018-11-30 19:17:15 +08:00
|
|
|
MAP(CXXDestructorDecl, Destructor);
|
2018-06-06 16:53:36 +08:00
|
|
|
MAP(ValueDecl, Variable);
|
|
|
|
MAP(VarTemplateDecl, Variable);
|
|
|
|
MAP(FunctionDecl, Function);
|
|
|
|
MAP(FunctionTemplateDecl, Function);
|
|
|
|
MAP(Decl, Unknown);
|
|
|
|
#undef MAP
|
|
|
|
};
|
|
|
|
return Switch().Visit(&ND);
|
|
|
|
}
|
|
|
|
|
2018-07-05 17:37:26 +08:00
|
|
|
static SymbolQualitySignals::SymbolCategory
|
|
|
|
categorize(const CodeCompletionResult &R) {
|
2018-06-14 21:42:21 +08:00
|
|
|
if (R.Declaration)
|
|
|
|
return categorize(*R.Declaration);
|
|
|
|
if (R.Kind == CodeCompletionResult::RK_Macro)
|
|
|
|
return SymbolQualitySignals::Macro;
|
|
|
|
// Everything else is a keyword or a pattern. Patterns are mostly keywords
|
|
|
|
// too, except a few which we recognize by cursor kind.
|
|
|
|
switch (R.CursorKind) {
|
2018-07-05 17:37:26 +08:00
|
|
|
case CXCursor_CXXMethod:
|
|
|
|
return SymbolQualitySignals::Function;
|
|
|
|
case CXCursor_ModuleImportDecl:
|
|
|
|
return SymbolQualitySignals::Namespace;
|
|
|
|
case CXCursor_MacroDefinition:
|
|
|
|
return SymbolQualitySignals::Macro;
|
|
|
|
case CXCursor_TypeRef:
|
|
|
|
return SymbolQualitySignals::Type;
|
|
|
|
case CXCursor_MemberRef:
|
|
|
|
return SymbolQualitySignals::Variable;
|
2018-07-24 16:51:52 +08:00
|
|
|
case CXCursor_Constructor:
|
|
|
|
return SymbolQualitySignals::Constructor;
|
2018-07-05 17:37:26 +08:00
|
|
|
default:
|
|
|
|
return SymbolQualitySignals::Keyword;
|
2018-06-14 21:42:21 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-06-06 16:53:36 +08:00
|
|
|
static SymbolQualitySignals::SymbolCategory
|
|
|
|
categorize(const index::SymbolInfo &D) {
|
|
|
|
switch (D.Kind) {
|
2018-07-05 17:37:26 +08:00
|
|
|
case index::SymbolKind::Namespace:
|
|
|
|
case index::SymbolKind::NamespaceAlias:
|
|
|
|
return SymbolQualitySignals::Namespace;
|
|
|
|
case index::SymbolKind::Macro:
|
|
|
|
return SymbolQualitySignals::Macro;
|
|
|
|
case index::SymbolKind::Enum:
|
|
|
|
case index::SymbolKind::Struct:
|
|
|
|
case index::SymbolKind::Class:
|
|
|
|
case index::SymbolKind::Protocol:
|
|
|
|
case index::SymbolKind::Extension:
|
|
|
|
case index::SymbolKind::Union:
|
|
|
|
case index::SymbolKind::TypeAlias:
|
2020-01-30 21:07:42 +08:00
|
|
|
case index::SymbolKind::TemplateTypeParm:
|
|
|
|
case index::SymbolKind::TemplateTemplateParm:
|
2018-07-05 17:37:26 +08:00
|
|
|
return SymbolQualitySignals::Type;
|
|
|
|
case index::SymbolKind::Function:
|
|
|
|
case index::SymbolKind::ClassMethod:
|
|
|
|
case index::SymbolKind::InstanceMethod:
|
|
|
|
case index::SymbolKind::StaticMethod:
|
|
|
|
case index::SymbolKind::InstanceProperty:
|
|
|
|
case index::SymbolKind::ClassProperty:
|
|
|
|
case index::SymbolKind::StaticProperty:
|
|
|
|
case index::SymbolKind::ConversionFunction:
|
|
|
|
return SymbolQualitySignals::Function;
|
2018-11-30 19:17:15 +08:00
|
|
|
case index::SymbolKind::Destructor:
|
|
|
|
return SymbolQualitySignals::Destructor;
|
2018-07-24 16:51:52 +08:00
|
|
|
case index::SymbolKind::Constructor:
|
|
|
|
return SymbolQualitySignals::Constructor;
|
2018-07-05 17:37:26 +08:00
|
|
|
case index::SymbolKind::Variable:
|
|
|
|
case index::SymbolKind::Field:
|
|
|
|
case index::SymbolKind::EnumConstant:
|
|
|
|
case index::SymbolKind::Parameter:
|
2020-01-30 21:07:42 +08:00
|
|
|
case index::SymbolKind::NonTypeTemplateParm:
|
2018-07-05 17:37:26 +08:00
|
|
|
return SymbolQualitySignals::Variable;
|
|
|
|
case index::SymbolKind::Using:
|
|
|
|
case index::SymbolKind::Module:
|
|
|
|
case index::SymbolKind::Unknown:
|
|
|
|
return SymbolQualitySignals::Unknown;
|
2018-06-06 16:53:36 +08:00
|
|
|
}
|
2018-06-06 21:28:49 +08:00
|
|
|
llvm_unreachable("Unknown index::SymbolKind");
|
2018-06-06 16:53:36 +08:00
|
|
|
}
|
|
|
|
|
2018-07-23 18:56:37 +08:00
|
|
|
static bool isInstanceMember(const NamedDecl *ND) {
|
|
|
|
if (!ND)
|
|
|
|
return false;
|
|
|
|
if (const auto *TP = dyn_cast<FunctionTemplateDecl>(ND))
|
|
|
|
ND = TP->TemplateDecl::getTemplatedDecl();
|
|
|
|
if (const auto *CM = dyn_cast<CXXMethodDecl>(ND))
|
|
|
|
return !CM->isStatic();
|
|
|
|
return isa<FieldDecl>(ND); // Note that static fields are VarDecl.
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool isInstanceMember(const index::SymbolInfo &D) {
|
|
|
|
switch (D.Kind) {
|
|
|
|
case index::SymbolKind::InstanceMethod:
|
|
|
|
case index::SymbolKind::InstanceProperty:
|
|
|
|
case index::SymbolKind::Field:
|
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
void SymbolQualitySignals::merge(const CodeCompletionResult &SemaCCResult) {
|
2018-09-07 02:52:26 +08:00
|
|
|
Deprecated |= (SemaCCResult.Availability == CXAvailability_Deprecated);
|
2018-06-14 21:42:21 +08:00
|
|
|
Category = categorize(SemaCCResult);
|
2018-06-08 17:36:34 +08:00
|
|
|
|
|
|
|
if (SemaCCResult.Declaration) {
|
2018-10-18 20:23:05 +08:00
|
|
|
ImplementationDetail |= isImplementationDetail(SemaCCResult.Declaration);
|
2018-06-08 17:36:34 +08:00
|
|
|
if (auto *ID = SemaCCResult.Declaration->getIdentifier())
|
2018-07-26 20:05:31 +08:00
|
|
|
ReservedName = ReservedName || isReserved(ID->getName());
|
2018-06-08 17:36:34 +08:00
|
|
|
} else if (SemaCCResult.Kind == CodeCompletionResult::RK_Macro)
|
2018-07-26 20:05:31 +08:00
|
|
|
ReservedName = ReservedName || isReserved(SemaCCResult.Macro->getName());
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void SymbolQualitySignals::merge(const Symbol &IndexResult) {
|
2018-09-07 02:52:26 +08:00
|
|
|
Deprecated |= (IndexResult.Flags & Symbol::Deprecated);
|
2018-10-18 20:23:05 +08:00
|
|
|
ImplementationDetail |= (IndexResult.Flags & Symbol::ImplementationDetail);
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
References = std::max(IndexResult.References, References);
|
2018-06-06 16:53:36 +08:00
|
|
|
Category = categorize(IndexResult.SymInfo);
|
2018-07-26 20:05:31 +08:00
|
|
|
ReservedName = ReservedName || isReserved(IndexResult.Name);
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
}
|
|
|
|
|
2020-09-29 01:19:51 +08:00
|
|
|
float SymbolQualitySignals::evaluateHeuristics() const {
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
float Score = 1;
|
|
|
|
|
|
|
|
// This avoids a sharp gradient for tail symbols, and also neatly avoids the
|
|
|
|
// question of whether 0 references means a bad symbol or missing data.
|
2018-07-25 19:26:35 +08:00
|
|
|
if (References >= 10) {
|
|
|
|
// Use a sigmoid style boosting function, which flats out nicely for large
|
2020-04-05 14:28:11 +08:00
|
|
|
// numbers (e.g. 2.58 for 1M references).
|
2018-07-25 19:26:35 +08:00
|
|
|
// The following boosting function is equivalent to:
|
|
|
|
// m = 0.06
|
|
|
|
// f = 12.0
|
|
|
|
// boost = f * sigmoid(m * std::log(References)) - 0.5 * f + 0.59
|
|
|
|
// Sample data points: (10, 1.00), (100, 1.41), (1000, 1.82),
|
|
|
|
// (10K, 2.21), (100K, 2.58), (1M, 2.94)
|
2018-07-26 20:05:31 +08:00
|
|
|
float S = std::pow(References, -0.06);
|
|
|
|
Score *= 6.0 * (1 - S) / (1 + S) + 0.59;
|
2018-07-25 19:26:35 +08:00
|
|
|
}
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
|
|
|
|
if (Deprecated)
|
2018-05-18 21:18:41 +08:00
|
|
|
Score *= 0.1f;
|
2018-06-08 17:36:34 +08:00
|
|
|
if (ReservedName)
|
|
|
|
Score *= 0.1f;
|
2018-10-18 20:23:05 +08:00
|
|
|
if (ImplementationDetail)
|
|
|
|
Score *= 0.2f;
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
|
2018-06-06 16:53:36 +08:00
|
|
|
switch (Category) {
|
2018-07-05 17:37:26 +08:00
|
|
|
case Keyword: // Often relevant, but misses most signals.
|
|
|
|
Score *= 4; // FIXME: important keywords should have specific boosts.
|
|
|
|
break;
|
|
|
|
case Type:
|
|
|
|
case Function:
|
|
|
|
case Variable:
|
|
|
|
Score *= 1.1f;
|
|
|
|
break;
|
|
|
|
case Namespace:
|
|
|
|
Score *= 0.8f;
|
|
|
|
break;
|
|
|
|
case Macro:
|
2018-11-30 19:17:15 +08:00
|
|
|
case Destructor:
|
|
|
|
case Operator:
|
2018-09-05 15:40:38 +08:00
|
|
|
Score *= 0.5f;
|
2018-07-05 17:37:26 +08:00
|
|
|
break;
|
2018-07-24 16:51:52 +08:00
|
|
|
case Constructor: // No boost constructors so they are after class types.
|
2018-11-30 19:17:15 +08:00
|
|
|
case Unknown:
|
2018-07-05 17:37:26 +08:00
|
|
|
break;
|
2018-06-06 16:53:36 +08:00
|
|
|
}
|
|
|
|
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
return Score;
|
|
|
|
}
|
|
|
|
|
2019-01-07 23:45:19 +08:00
|
|
|
llvm::raw_ostream &operator<<(llvm::raw_ostream &OS,
|
|
|
|
const SymbolQualitySignals &S) {
|
2020-09-29 01:19:51 +08:00
|
|
|
OS << llvm::formatv("=== Symbol quality: {0}\n", S.evaluateHeuristics());
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv("\tReferences: {0}\n", S.References);
|
|
|
|
OS << llvm::formatv("\tDeprecated: {0}\n", S.Deprecated);
|
|
|
|
OS << llvm::formatv("\tReserved name: {0}\n", S.ReservedName);
|
2020-11-05 15:47:21 +08:00
|
|
|
OS << llvm::formatv("\tImplementation detail: {0}\n", S.ImplementationDetail);
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv("\tCategory: {0}\n", static_cast<int>(S.Category));
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
return OS;
|
|
|
|
}
|
|
|
|
|
2021-01-29 21:59:16 +08:00
|
|
|
static SymbolRelevanceSignals::AccessibleScope
|
|
|
|
computeScope(const NamedDecl *D) {
|
|
|
|
// Injected "Foo" within the class "Foo" has file scope, not class scope.
|
|
|
|
const DeclContext *DC = D->getDeclContext();
|
|
|
|
if (auto *R = dyn_cast_or_null<RecordDecl>(D))
|
|
|
|
if (R->isInjectedClassName())
|
|
|
|
DC = DC->getParent();
|
|
|
|
// Class constructor should have the same scope as the class.
|
|
|
|
if (isa<CXXConstructorDecl>(D))
|
|
|
|
DC = DC->getParent();
|
|
|
|
bool InClass = false;
|
|
|
|
for (; !DC->isFileContext(); DC = DC->getParent()) {
|
|
|
|
if (DC->isFunctionOrMethod())
|
|
|
|
return SymbolRelevanceSignals::FunctionScope;
|
|
|
|
InClass = InClass || DC->isRecord();
|
|
|
|
}
|
|
|
|
if (InClass)
|
|
|
|
return SymbolRelevanceSignals::ClassScope;
|
|
|
|
// ExternalLinkage threshold could be tweaked, e.g. module-visible as global.
|
|
|
|
// Avoid caching linkage if it may change after enclosing code completion.
|
|
|
|
if (hasUnstableLinkage(D) || D->getLinkageInternal() < ExternalLinkage)
|
|
|
|
return SymbolRelevanceSignals::FileScope;
|
|
|
|
return SymbolRelevanceSignals::GlobalScope;
|
|
|
|
}
|
|
|
|
|
2018-06-06 00:30:25 +08:00
|
|
|
void SymbolRelevanceSignals::merge(const Symbol &IndexResult) {
|
2018-06-15 16:58:12 +08:00
|
|
|
SymbolURI = IndexResult.CanonicalDeclaration.FileURI;
|
2021-01-29 21:59:16 +08:00
|
|
|
SymbolScope = IndexResult.Scope;
|
2018-07-23 18:56:37 +08:00
|
|
|
IsInstanceMember |= isInstanceMember(IndexResult.SymInfo);
|
2019-02-01 21:07:37 +08:00
|
|
|
if (!(IndexResult.Flags & Symbol::VisibleOutsideFile)) {
|
2021-01-29 21:59:16 +08:00
|
|
|
Scope = AccessibleScope::FileScope;
|
2019-02-01 21:07:37 +08:00
|
|
|
}
|
2021-01-10 23:32:00 +08:00
|
|
|
if (MainFileSignals) {
|
|
|
|
MainFileRefs =
|
|
|
|
std::max(MainFileRefs,
|
|
|
|
MainFileSignals->ReferencedSymbols.lookup(IndexResult.ID));
|
|
|
|
ScopeRefsInFile =
|
|
|
|
std::max(ScopeRefsInFile,
|
|
|
|
MainFileSignals->RelatedNamespaces.lookup(IndexResult.Scope));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void SymbolRelevanceSignals::computeASTSignals(
|
|
|
|
const CodeCompletionResult &SemaResult) {
|
|
|
|
if (!MainFileSignals)
|
|
|
|
return;
|
|
|
|
if ((SemaResult.Kind != CodeCompletionResult::RK_Declaration) &&
|
|
|
|
(SemaResult.Kind != CodeCompletionResult::RK_Pattern))
|
|
|
|
return;
|
|
|
|
if (const NamedDecl *ND = SemaResult.getDeclaration()) {
|
|
|
|
auto ID = getSymbolID(ND);
|
|
|
|
if (!ID)
|
|
|
|
return;
|
|
|
|
MainFileRefs =
|
|
|
|
std::max(MainFileRefs, MainFileSignals->ReferencedSymbols.lookup(ID));
|
|
|
|
if (const auto *NSD = dyn_cast<NamespaceDecl>(ND->getDeclContext())) {
|
|
|
|
if (NSD->isAnonymousNamespace())
|
|
|
|
return;
|
|
|
|
std::string Scope = printNamespaceScope(*NSD);
|
|
|
|
if (!Scope.empty())
|
|
|
|
ScopeRefsInFile = std::max(
|
|
|
|
ScopeRefsInFile, MainFileSignals->RelatedNamespaces.lookup(Scope));
|
|
|
|
}
|
|
|
|
}
|
2018-06-06 00:30:25 +08:00
|
|
|
}
|
|
|
|
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
void SymbolRelevanceSignals::merge(const CodeCompletionResult &SemaCCResult) {
|
|
|
|
if (SemaCCResult.Availability == CXAvailability_NotAvailable ||
|
|
|
|
SemaCCResult.Availability == CXAvailability_NotAccessible)
|
|
|
|
Forbidden = true;
|
2018-06-04 22:50:59 +08:00
|
|
|
|
|
|
|
if (SemaCCResult.Declaration) {
|
2018-10-17 19:19:02 +08:00
|
|
|
SemaSaysInScope = true;
|
2018-06-15 16:58:12 +08:00
|
|
|
// We boost things that have decls in the main file. We give a fixed score
|
|
|
|
// for all other declarations in sema as they are already included in the
|
|
|
|
// translation unit.
|
2018-07-11 22:49:49 +08:00
|
|
|
float DeclProximity = (hasDeclInMainFile(*SemaCCResult.Declaration) ||
|
|
|
|
hasUsingDeclInMainFile(SemaCCResult))
|
|
|
|
? 1.0
|
|
|
|
: 0.6;
|
2018-10-17 19:19:02 +08:00
|
|
|
SemaFileProximityScore = std::max(DeclProximity, SemaFileProximityScore);
|
2018-07-23 18:56:37 +08:00
|
|
|
IsInstanceMember |= isInstanceMember(SemaCCResult.Declaration);
|
2018-10-24 21:45:17 +08:00
|
|
|
InBaseClass |= SemaCCResult.InBaseClass;
|
2018-06-04 22:50:59 +08:00
|
|
|
}
|
2018-06-06 00:30:25 +08:00
|
|
|
|
2021-01-10 23:32:00 +08:00
|
|
|
computeASTSignals(SemaCCResult);
|
2018-06-06 00:30:25 +08:00
|
|
|
// Declarations are scoped, others (like macros) are assumed global.
|
2018-06-06 01:58:12 +08:00
|
|
|
if (SemaCCResult.Declaration)
|
2021-01-29 21:59:16 +08:00
|
|
|
Scope = std::min(Scope, computeScope(SemaCCResult.Declaration));
|
2018-08-08 16:59:29 +08:00
|
|
|
|
|
|
|
NeedsFixIts = !SemaCCResult.FixIts.empty();
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
}
|
|
|
|
|
2020-09-23 20:37:07 +08:00
|
|
|
static float fileProximityScore(unsigned FileDistance) {
|
|
|
|
// Range: [0, 1]
|
|
|
|
// FileDistance = [0, 1, 2, 3, 4, .., FileDistance::Unreachable]
|
|
|
|
// Score = [1, 0.82, 0.67, 0.55, 0.45, .., 0]
|
|
|
|
if (FileDistance == FileDistance::Unreachable)
|
|
|
|
return 0;
|
2018-07-03 16:09:29 +08:00
|
|
|
// Assume approximately default options are used for sensible scoring.
|
2020-09-23 20:37:07 +08:00
|
|
|
return std::exp(FileDistance * -0.4f / FileDistanceOptions().UpCost);
|
2018-07-03 16:09:29 +08:00
|
|
|
}
|
|
|
|
|
2020-09-23 20:37:07 +08:00
|
|
|
static float scopeProximityScore(unsigned ScopeDistance) {
|
|
|
|
// Range: [0.6, 2].
|
|
|
|
// ScopeDistance = [0, 1, 2, 3, 4, 5, 6, 7, .., FileDistance::Unreachable]
|
|
|
|
// Score = [2.0, 1.55, 1.2, 0.93, 0.72, 0.65, 0.65, 0.65, .., 0.6]
|
|
|
|
if (ScopeDistance == FileDistance::Unreachable)
|
2018-11-28 21:45:25 +08:00
|
|
|
return 0.6f;
|
2020-09-23 20:37:07 +08:00
|
|
|
return std::max(0.65, 2.0 * std::pow(0.6, ScopeDistance / 2.0));
|
2018-10-17 19:19:02 +08:00
|
|
|
}
|
|
|
|
|
2019-05-06 18:25:10 +08:00
|
|
|
static llvm::Optional<llvm::StringRef>
|
|
|
|
wordMatching(llvm::StringRef Name, const llvm::StringSet<> *ContextWords) {
|
|
|
|
if (ContextWords)
|
2020-09-23 20:37:07 +08:00
|
|
|
for (const auto &Word : ContextWords->keys())
|
2019-05-06 18:25:10 +08:00
|
|
|
if (Name.contains_lower(Word))
|
|
|
|
return Word;
|
|
|
|
return llvm::None;
|
|
|
|
}
|
|
|
|
|
2020-09-23 20:37:07 +08:00
|
|
|
SymbolRelevanceSignals::DerivedSignals
|
|
|
|
SymbolRelevanceSignals::calculateDerivedSignals() const {
|
|
|
|
DerivedSignals Derived;
|
|
|
|
Derived.NameMatchesContext = wordMatching(Name, ContextWords).hasValue();
|
|
|
|
Derived.FileProximityDistance = !FileProximityMatch || SymbolURI.empty()
|
|
|
|
? FileDistance::Unreachable
|
|
|
|
: FileProximityMatch->distance(SymbolURI);
|
|
|
|
if (ScopeProximityMatch) {
|
|
|
|
// For global symbol, the distance is 0.
|
|
|
|
Derived.ScopeProximityDistance =
|
2021-01-29 21:59:16 +08:00
|
|
|
SymbolScope ? ScopeProximityMatch->distance(*SymbolScope) : 0;
|
2020-09-23 20:37:07 +08:00
|
|
|
}
|
|
|
|
return Derived;
|
|
|
|
}
|
|
|
|
|
2020-09-29 01:19:51 +08:00
|
|
|
float SymbolRelevanceSignals::evaluateHeuristics() const {
|
2020-09-23 20:37:07 +08:00
|
|
|
DerivedSignals Derived = calculateDerivedSignals();
|
2018-06-06 00:30:25 +08:00
|
|
|
float Score = 1;
|
|
|
|
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
if (Forbidden)
|
|
|
|
return 0;
|
2018-06-04 22:50:59 +08:00
|
|
|
|
2018-06-06 00:30:25 +08:00
|
|
|
Score *= NameMatch;
|
|
|
|
|
2018-10-17 19:19:02 +08:00
|
|
|
// File proximity scores are [0,1] and we translate them into a multiplier in
|
|
|
|
// the range from 1 to 3.
|
2020-09-23 20:37:07 +08:00
|
|
|
Score *= 1 + 2 * std::max(fileProximityScore(Derived.FileProximityDistance),
|
2018-10-17 19:19:02 +08:00
|
|
|
SemaFileProximityScore);
|
|
|
|
|
|
|
|
if (ScopeProximityMatch)
|
|
|
|
// Use a constant scope boost for sema results, as scopes of sema results
|
|
|
|
// can be tricky (e.g. class/function scope). Set to the max boost as we
|
|
|
|
// don't load top-level symbols from the preamble and sema results are
|
|
|
|
// always in the accessible scope.
|
2020-09-23 20:37:07 +08:00
|
|
|
Score *= SemaSaysInScope
|
|
|
|
? 2.0
|
|
|
|
: scopeProximityScore(Derived.ScopeProximityDistance);
|
2018-06-06 00:30:25 +08:00
|
|
|
|
2020-09-23 20:37:07 +08:00
|
|
|
if (Derived.NameMatchesContext)
|
2019-05-06 18:25:10 +08:00
|
|
|
Score *= 1.5;
|
|
|
|
|
2018-06-06 00:30:25 +08:00
|
|
|
// Symbols like local variables may only be referenced within their scope.
|
|
|
|
// Conversely if we're in that scope, it's likely we'll reference them.
|
|
|
|
if (Query == CodeComplete) {
|
|
|
|
// The narrower the scope where a symbol is visible, the more likely it is
|
|
|
|
// to be relevant when it is available.
|
2021-01-29 21:59:16 +08:00
|
|
|
switch (Scope) {
|
|
|
|
case GlobalScope:
|
2018-06-06 00:30:25 +08:00
|
|
|
break;
|
2021-01-29 21:59:16 +08:00
|
|
|
case FileScope:
|
2019-02-01 21:07:37 +08:00
|
|
|
Score *= 1.5f;
|
2018-06-07 16:16:36 +08:00
|
|
|
break;
|
2021-01-29 21:59:16 +08:00
|
|
|
case ClassScope:
|
2018-06-06 00:30:25 +08:00
|
|
|
Score *= 2;
|
2018-06-07 16:16:36 +08:00
|
|
|
break;
|
2021-01-29 21:59:16 +08:00
|
|
|
case FunctionScope:
|
2018-06-06 00:30:25 +08:00
|
|
|
Score *= 4;
|
2018-06-07 16:16:36 +08:00
|
|
|
break;
|
2018-06-06 00:30:25 +08:00
|
|
|
}
|
2019-02-01 21:07:37 +08:00
|
|
|
} else {
|
|
|
|
// For non-completion queries, the wider the scope where a symbol is
|
|
|
|
// visible, the more likely it is to be relevant.
|
2021-01-29 21:59:16 +08:00
|
|
|
switch (Scope) {
|
|
|
|
case GlobalScope:
|
2019-02-01 21:07:37 +08:00
|
|
|
break;
|
2021-01-29 21:59:16 +08:00
|
|
|
case FileScope:
|
2019-02-01 21:07:37 +08:00
|
|
|
Score *= 0.5f;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
// TODO: Handle other scopes as we start to use them for index results.
|
|
|
|
break;
|
|
|
|
}
|
2018-06-06 00:30:25 +08:00
|
|
|
}
|
|
|
|
|
2018-11-26 23:38:01 +08:00
|
|
|
if (TypeMatchesPreferred)
|
|
|
|
Score *= 5.0;
|
|
|
|
|
2018-07-23 18:56:37 +08:00
|
|
|
// Penalize non-instance members when they are accessed via a class instance.
|
|
|
|
if (!IsInstanceMember &&
|
|
|
|
(Context == CodeCompletionContext::CCC_DotMemberAccess ||
|
|
|
|
Context == CodeCompletionContext::CCC_ArrowMemberAccess)) {
|
2018-10-25 03:31:24 +08:00
|
|
|
Score *= 0.2f;
|
2018-07-23 18:56:37 +08:00
|
|
|
}
|
|
|
|
|
2018-10-24 21:45:17 +08:00
|
|
|
if (InBaseClass)
|
2018-10-25 03:31:24 +08:00
|
|
|
Score *= 0.5f;
|
2018-10-24 21:45:17 +08:00
|
|
|
|
2018-08-08 16:59:29 +08:00
|
|
|
// Penalize for FixIts.
|
|
|
|
if (NeedsFixIts)
|
2018-10-25 03:31:24 +08:00
|
|
|
Score *= 0.5f;
|
2018-08-08 16:59:29 +08:00
|
|
|
|
2021-01-19 04:01:46 +08:00
|
|
|
// Use a sigmoid style boosting function similar to `References`, which flats
|
|
|
|
// out nicely for large values. This avoids a sharp gradient for heavily
|
|
|
|
// referenced symbols. Use smaller gradient for ScopeRefsInFile since ideally
|
|
|
|
// MainFileRefs <= ScopeRefsInFile.
|
|
|
|
if (MainFileRefs >= 2) {
|
|
|
|
// E.g.: (2, 1.12), (9, 2.0), (48, 3.0).
|
|
|
|
float S = std::pow(MainFileRefs, -0.11);
|
|
|
|
Score *= 11.0 * (1 - S) / (1 + S) + 0.7;
|
|
|
|
}
|
|
|
|
if (ScopeRefsInFile >= 2) {
|
|
|
|
// E.g.: (2, 1.04), (14, 2.0), (109, 3.0), (400, 3.6).
|
|
|
|
float S = std::pow(ScopeRefsInFile, -0.10);
|
|
|
|
Score *= 10.0 * (1 - S) / (1 + S) + 0.7;
|
|
|
|
}
|
|
|
|
|
2018-06-04 22:50:59 +08:00
|
|
|
return Score;
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
}
|
2018-06-15 16:58:12 +08:00
|
|
|
|
2019-01-07 23:45:19 +08:00
|
|
|
llvm::raw_ostream &operator<<(llvm::raw_ostream &OS,
|
|
|
|
const SymbolRelevanceSignals &S) {
|
2020-09-29 01:19:51 +08:00
|
|
|
OS << llvm::formatv("=== Symbol relevance: {0}\n", S.evaluateHeuristics());
|
2019-05-06 18:25:10 +08:00
|
|
|
OS << llvm::formatv("\tName: {0}\n", S.Name);
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv("\tName match: {0}\n", S.NameMatch);
|
2019-05-06 18:25:10 +08:00
|
|
|
if (S.ContextWords)
|
|
|
|
OS << llvm::formatv(
|
|
|
|
"\tMatching context word: {0}\n",
|
|
|
|
wordMatching(S.Name, S.ContextWords).getValueOr("<none>"));
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv("\tForbidden: {0}\n", S.Forbidden);
|
|
|
|
OS << llvm::formatv("\tNeedsFixIts: {0}\n", S.NeedsFixIts);
|
|
|
|
OS << llvm::formatv("\tIsInstanceMember: {0}\n", S.IsInstanceMember);
|
2020-11-05 15:47:21 +08:00
|
|
|
OS << llvm::formatv("\tInBaseClass: {0}\n", S.InBaseClass);
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv("\tContext: {0}\n", getCompletionKindString(S.Context));
|
|
|
|
OS << llvm::formatv("\tQuery type: {0}\n", static_cast<int>(S.Query));
|
2021-01-29 21:59:16 +08:00
|
|
|
OS << llvm::formatv("\tScope: {0}\n", static_cast<int>(S.Scope));
|
2019-01-07 23:45:19 +08:00
|
|
|
|
|
|
|
OS << llvm::formatv("\tSymbol URI: {0}\n", S.SymbolURI);
|
|
|
|
OS << llvm::formatv("\tSymbol scope: {0}\n",
|
2021-01-29 21:59:16 +08:00
|
|
|
S.SymbolScope ? *S.SymbolScope : "<None>");
|
2018-10-17 19:19:02 +08:00
|
|
|
|
2020-09-23 20:37:07 +08:00
|
|
|
SymbolRelevanceSignals::DerivedSignals Derived = S.calculateDerivedSignals();
|
2018-06-15 16:58:12 +08:00
|
|
|
if (S.FileProximityMatch) {
|
2020-09-23 20:37:07 +08:00
|
|
|
unsigned Score = fileProximityScore(Derived.FileProximityDistance);
|
|
|
|
OS << llvm::formatv("\tIndex URI proximity: {0} (distance={1})\n", Score,
|
|
|
|
Derived.FileProximityDistance);
|
2018-06-15 16:58:12 +08:00
|
|
|
}
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv("\tSema file proximity: {0}\n", S.SemaFileProximityScore);
|
2018-10-17 19:19:02 +08:00
|
|
|
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv("\tSema says in scope: {0}\n", S.SemaSaysInScope);
|
2018-10-17 19:19:02 +08:00
|
|
|
if (S.ScopeProximityMatch)
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv("\tIndex scope boost: {0}\n",
|
2020-09-23 20:37:07 +08:00
|
|
|
scopeProximityScore(Derived.ScopeProximityDistance));
|
2018-10-17 19:19:02 +08:00
|
|
|
|
2019-01-07 23:45:19 +08:00
|
|
|
OS << llvm::formatv(
|
2018-11-26 23:38:01 +08:00
|
|
|
"\tType matched preferred: {0} (Context type: {1}, Symbol type: {2}\n",
|
|
|
|
S.TypeMatchesPreferred, S.HadContextType, S.HadSymbolType);
|
|
|
|
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
return OS;
|
|
|
|
}
|
|
|
|
|
|
|
|
float evaluateSymbolAndRelevance(float SymbolQuality, float SymbolRelevance) {
|
|
|
|
return SymbolQuality * SymbolRelevance;
|
|
|
|
}
|
|
|
|
|
2020-10-23 16:19:53 +08:00
|
|
|
DecisionForestScores
|
|
|
|
evaluateDecisionForest(const SymbolQualitySignals &Quality,
|
|
|
|
const SymbolRelevanceSignals &Relevance, float Base) {
|
[clangd] Use Decision Forest to score code completions.
By default clangd will score a code completion item using heuristics model.
Scoring can be done by Decision Forest model by passing `--ranking_model=decision_forest` to
clangd.
Features omitted from the model:
- `NameMatch` is excluded because the final score must be multiplicative in `NameMatch` to allow rescoring by the editor.
- `NeedsFixIts` is excluded because the generating dataset that needs 'fixits' is non-trivial.
There are multiple ways (heuristics) to combine the above two features with the prediction of the DF:
- `NeedsFixIts` is used as is with a penalty of `0.5`.
Various alternatives of combining NameMatch `N` and Decision forest Prediction `P`
- N * scale(P, 0, 1): Linearly scale the output of model to range [0, 1]
- N * a^P:
- More natural: Prediction of each Decision Tree can be considered as a multiplicative boost (like NameMatch)
- Ordering is independent of the absolute value of P. Order of two items is proportional to `a^{difference in model prediction score}`. Higher `a` gives higher weightage to model output as compared to NameMatch score.
Baseline MRR = 0.619
MRR for various combinations:
N * P = 0.6346, advantage%=2.5768
N * 1.1^P = 0.6600, advantage%=6.6853
N * **1.2**^P = 0.6669, advantage%=**7.8005**
N * **1.3**^P = 0.6668, advantage%=**7.7795**
N * **1.4**^P = 0.6659, advantage%=**7.6270**
N * 1.5^P = 0.6646, advantage%=7.4200
N * 1.6^P = 0.6636, advantage%=7.2671
N * 1.7^P = 0.6629, advantage%=7.1450
N * 2^P = 0.6612, advantage%=6.8673
N * 2.5^P = 0.6598, advantage%=6.6491
N * 3^P = 0.6590, advantage%=6.5242
N * scaled[0, 1] = 0.6465, advantage%=4.5054
Differential Revision: https://reviews.llvm.org/D88281
2020-09-22 13:56:08 +08:00
|
|
|
Example E;
|
|
|
|
E.setIsDeprecated(Quality.Deprecated);
|
|
|
|
E.setIsReservedName(Quality.ReservedName);
|
|
|
|
E.setIsImplementationDetail(Quality.ImplementationDetail);
|
|
|
|
E.setNumReferences(Quality.References);
|
|
|
|
E.setSymbolCategory(Quality.Category);
|
|
|
|
|
|
|
|
SymbolRelevanceSignals::DerivedSignals Derived =
|
|
|
|
Relevance.calculateDerivedSignals();
|
2021-01-15 01:01:25 +08:00
|
|
|
int NumMatch = 0;
|
|
|
|
if (Relevance.ContextWords) {
|
|
|
|
for (const auto &Word : Relevance.ContextWords->keys()) {
|
|
|
|
if (Relevance.Name.contains_lower(Word)) {
|
|
|
|
++NumMatch;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
E.setIsNameInContext(NumMatch > 0);
|
|
|
|
E.setNumNameInContext(NumMatch);
|
|
|
|
E.setFractionNameInContext(
|
2021-01-17 22:26:40 +08:00
|
|
|
Relevance.ContextWords && !Relevance.ContextWords->empty()
|
2021-01-15 01:01:25 +08:00
|
|
|
? NumMatch * 1.0 / Relevance.ContextWords->size()
|
|
|
|
: 0);
|
[clangd] Use Decision Forest to score code completions.
By default clangd will score a code completion item using heuristics model.
Scoring can be done by Decision Forest model by passing `--ranking_model=decision_forest` to
clangd.
Features omitted from the model:
- `NameMatch` is excluded because the final score must be multiplicative in `NameMatch` to allow rescoring by the editor.
- `NeedsFixIts` is excluded because the generating dataset that needs 'fixits' is non-trivial.
There are multiple ways (heuristics) to combine the above two features with the prediction of the DF:
- `NeedsFixIts` is used as is with a penalty of `0.5`.
Various alternatives of combining NameMatch `N` and Decision forest Prediction `P`
- N * scale(P, 0, 1): Linearly scale the output of model to range [0, 1]
- N * a^P:
- More natural: Prediction of each Decision Tree can be considered as a multiplicative boost (like NameMatch)
- Ordering is independent of the absolute value of P. Order of two items is proportional to `a^{difference in model prediction score}`. Higher `a` gives higher weightage to model output as compared to NameMatch score.
Baseline MRR = 0.619
MRR for various combinations:
N * P = 0.6346, advantage%=2.5768
N * 1.1^P = 0.6600, advantage%=6.6853
N * **1.2**^P = 0.6669, advantage%=**7.8005**
N * **1.3**^P = 0.6668, advantage%=**7.7795**
N * **1.4**^P = 0.6659, advantage%=**7.6270**
N * 1.5^P = 0.6646, advantage%=7.4200
N * 1.6^P = 0.6636, advantage%=7.2671
N * 1.7^P = 0.6629, advantage%=7.1450
N * 2^P = 0.6612, advantage%=6.8673
N * 2.5^P = 0.6598, advantage%=6.6491
N * 3^P = 0.6590, advantage%=6.5242
N * scaled[0, 1] = 0.6465, advantage%=4.5054
Differential Revision: https://reviews.llvm.org/D88281
2020-09-22 13:56:08 +08:00
|
|
|
E.setIsInBaseClass(Relevance.InBaseClass);
|
2021-01-15 01:01:25 +08:00
|
|
|
E.setFileProximityDistanceCost(Derived.FileProximityDistance);
|
[clangd] Use Decision Forest to score code completions.
By default clangd will score a code completion item using heuristics model.
Scoring can be done by Decision Forest model by passing `--ranking_model=decision_forest` to
clangd.
Features omitted from the model:
- `NameMatch` is excluded because the final score must be multiplicative in `NameMatch` to allow rescoring by the editor.
- `NeedsFixIts` is excluded because the generating dataset that needs 'fixits' is non-trivial.
There are multiple ways (heuristics) to combine the above two features with the prediction of the DF:
- `NeedsFixIts` is used as is with a penalty of `0.5`.
Various alternatives of combining NameMatch `N` and Decision forest Prediction `P`
- N * scale(P, 0, 1): Linearly scale the output of model to range [0, 1]
- N * a^P:
- More natural: Prediction of each Decision Tree can be considered as a multiplicative boost (like NameMatch)
- Ordering is independent of the absolute value of P. Order of two items is proportional to `a^{difference in model prediction score}`. Higher `a` gives higher weightage to model output as compared to NameMatch score.
Baseline MRR = 0.619
MRR for various combinations:
N * P = 0.6346, advantage%=2.5768
N * 1.1^P = 0.6600, advantage%=6.6853
N * **1.2**^P = 0.6669, advantage%=**7.8005**
N * **1.3**^P = 0.6668, advantage%=**7.7795**
N * **1.4**^P = 0.6659, advantage%=**7.6270**
N * 1.5^P = 0.6646, advantage%=7.4200
N * 1.6^P = 0.6636, advantage%=7.2671
N * 1.7^P = 0.6629, advantage%=7.1450
N * 2^P = 0.6612, advantage%=6.8673
N * 2.5^P = 0.6598, advantage%=6.6491
N * 3^P = 0.6590, advantage%=6.5242
N * scaled[0, 1] = 0.6465, advantage%=4.5054
Differential Revision: https://reviews.llvm.org/D88281
2020-09-22 13:56:08 +08:00
|
|
|
E.setSemaFileProximityScore(Relevance.SemaFileProximityScore);
|
2021-01-15 01:01:25 +08:00
|
|
|
E.setSymbolScopeDistanceCost(Derived.ScopeProximityDistance);
|
[clangd] Use Decision Forest to score code completions.
By default clangd will score a code completion item using heuristics model.
Scoring can be done by Decision Forest model by passing `--ranking_model=decision_forest` to
clangd.
Features omitted from the model:
- `NameMatch` is excluded because the final score must be multiplicative in `NameMatch` to allow rescoring by the editor.
- `NeedsFixIts` is excluded because the generating dataset that needs 'fixits' is non-trivial.
There are multiple ways (heuristics) to combine the above two features with the prediction of the DF:
- `NeedsFixIts` is used as is with a penalty of `0.5`.
Various alternatives of combining NameMatch `N` and Decision forest Prediction `P`
- N * scale(P, 0, 1): Linearly scale the output of model to range [0, 1]
- N * a^P:
- More natural: Prediction of each Decision Tree can be considered as a multiplicative boost (like NameMatch)
- Ordering is independent of the absolute value of P. Order of two items is proportional to `a^{difference in model prediction score}`. Higher `a` gives higher weightage to model output as compared to NameMatch score.
Baseline MRR = 0.619
MRR for various combinations:
N * P = 0.6346, advantage%=2.5768
N * 1.1^P = 0.6600, advantage%=6.6853
N * **1.2**^P = 0.6669, advantage%=**7.8005**
N * **1.3**^P = 0.6668, advantage%=**7.7795**
N * **1.4**^P = 0.6659, advantage%=**7.6270**
N * 1.5^P = 0.6646, advantage%=7.4200
N * 1.6^P = 0.6636, advantage%=7.2671
N * 1.7^P = 0.6629, advantage%=7.1450
N * 2^P = 0.6612, advantage%=6.8673
N * 2.5^P = 0.6598, advantage%=6.6491
N * 3^P = 0.6590, advantage%=6.5242
N * scaled[0, 1] = 0.6465, advantage%=4.5054
Differential Revision: https://reviews.llvm.org/D88281
2020-09-22 13:56:08 +08:00
|
|
|
E.setSemaSaysInScope(Relevance.SemaSaysInScope);
|
2021-01-29 21:59:16 +08:00
|
|
|
E.setScope(Relevance.Scope);
|
[clangd] Use Decision Forest to score code completions.
By default clangd will score a code completion item using heuristics model.
Scoring can be done by Decision Forest model by passing `--ranking_model=decision_forest` to
clangd.
Features omitted from the model:
- `NameMatch` is excluded because the final score must be multiplicative in `NameMatch` to allow rescoring by the editor.
- `NeedsFixIts` is excluded because the generating dataset that needs 'fixits' is non-trivial.
There are multiple ways (heuristics) to combine the above two features with the prediction of the DF:
- `NeedsFixIts` is used as is with a penalty of `0.5`.
Various alternatives of combining NameMatch `N` and Decision forest Prediction `P`
- N * scale(P, 0, 1): Linearly scale the output of model to range [0, 1]
- N * a^P:
- More natural: Prediction of each Decision Tree can be considered as a multiplicative boost (like NameMatch)
- Ordering is independent of the absolute value of P. Order of two items is proportional to `a^{difference in model prediction score}`. Higher `a` gives higher weightage to model output as compared to NameMatch score.
Baseline MRR = 0.619
MRR for various combinations:
N * P = 0.6346, advantage%=2.5768
N * 1.1^P = 0.6600, advantage%=6.6853
N * **1.2**^P = 0.6669, advantage%=**7.8005**
N * **1.3**^P = 0.6668, advantage%=**7.7795**
N * **1.4**^P = 0.6659, advantage%=**7.6270**
N * 1.5^P = 0.6646, advantage%=7.4200
N * 1.6^P = 0.6636, advantage%=7.2671
N * 1.7^P = 0.6629, advantage%=7.1450
N * 2^P = 0.6612, advantage%=6.8673
N * 2.5^P = 0.6598, advantage%=6.6491
N * 3^P = 0.6590, advantage%=6.5242
N * scaled[0, 1] = 0.6465, advantage%=4.5054
Differential Revision: https://reviews.llvm.org/D88281
2020-09-22 13:56:08 +08:00
|
|
|
E.setContextKind(Relevance.Context);
|
|
|
|
E.setIsInstanceMember(Relevance.IsInstanceMember);
|
|
|
|
E.setHadContextType(Relevance.HadContextType);
|
|
|
|
E.setHadSymbolType(Relevance.HadSymbolType);
|
|
|
|
E.setTypeMatchesPreferred(Relevance.TypeMatchesPreferred);
|
2020-10-23 16:19:53 +08:00
|
|
|
|
|
|
|
DecisionForestScores Scores;
|
|
|
|
// Exponentiating DecisionForest prediction makes the score of each tree a
|
|
|
|
// multiplciative boost (like NameMatch). This allows us to weigh the
|
|
|
|
// prediciton score and NameMatch appropriately.
|
|
|
|
Scores.ExcludingName = pow(Base, Evaluate(E));
|
2021-03-02 23:36:11 +08:00
|
|
|
// Following cases are not part of the generated training dataset:
|
|
|
|
// - Symbols with `NeedsFixIts`.
|
|
|
|
// - Forbidden symbols.
|
|
|
|
// - Keywords: Dataset contains only macros and decls.
|
2020-10-23 16:19:53 +08:00
|
|
|
if (Relevance.NeedsFixIts)
|
|
|
|
Scores.ExcludingName *= 0.5;
|
2021-01-15 01:01:25 +08:00
|
|
|
if (Relevance.Forbidden)
|
|
|
|
Scores.ExcludingName *= 0;
|
2021-03-02 23:36:11 +08:00
|
|
|
if (Quality.Category == SymbolQualitySignals::Keyword)
|
|
|
|
Scores.ExcludingName *= 4;
|
2021-01-15 01:01:25 +08:00
|
|
|
|
2020-10-23 16:19:53 +08:00
|
|
|
// NameMatch should be a multiplier on total score to support rescoring.
|
|
|
|
Scores.Total = Relevance.NameMatch * Scores.ExcludingName;
|
|
|
|
return Scores;
|
[clangd] Use Decision Forest to score code completions.
By default clangd will score a code completion item using heuristics model.
Scoring can be done by Decision Forest model by passing `--ranking_model=decision_forest` to
clangd.
Features omitted from the model:
- `NameMatch` is excluded because the final score must be multiplicative in `NameMatch` to allow rescoring by the editor.
- `NeedsFixIts` is excluded because the generating dataset that needs 'fixits' is non-trivial.
There are multiple ways (heuristics) to combine the above two features with the prediction of the DF:
- `NeedsFixIts` is used as is with a penalty of `0.5`.
Various alternatives of combining NameMatch `N` and Decision forest Prediction `P`
- N * scale(P, 0, 1): Linearly scale the output of model to range [0, 1]
- N * a^P:
- More natural: Prediction of each Decision Tree can be considered as a multiplicative boost (like NameMatch)
- Ordering is independent of the absolute value of P. Order of two items is proportional to `a^{difference in model prediction score}`. Higher `a` gives higher weightage to model output as compared to NameMatch score.
Baseline MRR = 0.619
MRR for various combinations:
N * P = 0.6346, advantage%=2.5768
N * 1.1^P = 0.6600, advantage%=6.6853
N * **1.2**^P = 0.6669, advantage%=**7.8005**
N * **1.3**^P = 0.6668, advantage%=**7.7795**
N * **1.4**^P = 0.6659, advantage%=**7.6270**
N * 1.5^P = 0.6646, advantage%=7.4200
N * 1.6^P = 0.6636, advantage%=7.2671
N * 1.7^P = 0.6629, advantage%=7.1450
N * 2^P = 0.6612, advantage%=6.8673
N * 2.5^P = 0.6598, advantage%=6.6491
N * 3^P = 0.6590, advantage%=6.5242
N * scaled[0, 1] = 0.6465, advantage%=4.5054
Differential Revision: https://reviews.llvm.org/D88281
2020-09-22 13:56:08 +08:00
|
|
|
}
|
|
|
|
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
// Produces an integer that sorts in the same order as F.
|
|
|
|
// That is: a < b <==> encodeFloat(a) < encodeFloat(b).
|
|
|
|
static uint32_t encodeFloat(float F) {
|
|
|
|
static_assert(std::numeric_limits<float>::is_iec559, "");
|
|
|
|
constexpr uint32_t TopBit = ~(~uint32_t{0} >> 1);
|
|
|
|
|
|
|
|
// Get the bits of the float. Endianness is the same as for integers.
|
2019-01-07 23:45:19 +08:00
|
|
|
uint32_t U = llvm::FloatToBits(F);
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
// IEEE 754 floats compare like sign-magnitude integers.
|
|
|
|
if (U & TopBit) // Negative float.
|
|
|
|
return 0 - U; // Map onto the low half of integers, order reversed.
|
|
|
|
return U + TopBit; // Positive floats map onto the high half of integers.
|
|
|
|
}
|
|
|
|
|
2019-01-07 23:45:19 +08:00
|
|
|
std::string sortText(float Score, llvm::StringRef Name) {
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
// We convert -Score to an integer, and hex-encode for readability.
|
|
|
|
// Example: [0.5, "foo"] -> "41000000foo"
|
|
|
|
std::string S;
|
2019-01-07 23:45:19 +08:00
|
|
|
llvm::raw_string_ostream OS(S);
|
|
|
|
llvm::write_hex(OS, encodeFloat(-Score), llvm::HexPrintStyle::Lower,
|
|
|
|
/*Width=*/2 * sizeof(Score));
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
OS << Name;
|
|
|
|
OS.flush();
|
|
|
|
return S;
|
|
|
|
}
|
|
|
|
|
2019-01-07 23:45:19 +08:00
|
|
|
llvm::raw_ostream &operator<<(llvm::raw_ostream &OS,
|
|
|
|
const SignatureQualitySignals &S) {
|
|
|
|
OS << llvm::formatv("=== Signature Quality:\n");
|
|
|
|
OS << llvm::formatv("\tNumber of parameters: {0}\n", S.NumberOfParameters);
|
|
|
|
OS << llvm::formatv("\tNumber of optional parameters: {0}\n",
|
|
|
|
S.NumberOfOptionalParameters);
|
|
|
|
OS << llvm::formatv("\tKind: {0}\n", S.Kind);
|
2018-08-13 16:40:05 +08:00
|
|
|
return OS;
|
|
|
|
}
|
|
|
|
|
[clangd] Extract scoring/ranking logic, and shave yaks.
Summary:
Code completion scoring was embedded in CodeComplete.cpp, which is bad:
- awkward to test. The mechanisms (extracting info from index/sema) can be
unit-tested well, the policy (scoring) should be quantitatively measured.
Neither was easily possible, and debugging was hard.
The intermediate signal struct makes this easier.
- hard to reuse. This is a bug in workspaceSymbols: it just presents the
results in the index order, which is not sorted in practice, it needs to rank
them!
Also, index implementations care about scoring (both query-dependent and
independent) in order to truncate result lists appropriately.
The main yak shaved here is the build() function that had 3 variants across
unit tests is unified in TestTU.h (rather than adding a 4th variant).
Reviewers: ilya-biryukov
Subscribers: klimek, mgorny, ioeric, MaskRay, jkorous, mgrang, cfe-commits
Differential Revision: https://reviews.llvm.org/D46524
llvm-svn: 332378
2018-05-16 01:43:27 +08:00
|
|
|
} // namespace clangd
|
|
|
|
} // namespace clang
|