Docu: RAG (#846)

This commit is contained in:
LangChain4j 2024-03-27 17:44:48 +01:00 committed by GitHub
parent 7aa75324ed
commit 0bdade2865
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 309 additions and 38 deletions

View File

@ -1,13 +1,12 @@
---
sidebar_position: 18
sidebar_position: 7
---
# Tools (Function Calling)
Similar to RAG, in that LLMs by default do not have access to _external knowledge_,
LLMs by default also lack the ability to perform _external actions_.
Some LLMs, in addition to generating text, can also trigger actions.
To solve this problem, there is a concept known as "tools," or "function calling."
There is a concept known as "tools," or "function calling".
It allows the LLM to call, when necessary, one or more available tools, usually defined by the developer.
A tool can be anything: a web search, a call to an external API, or the execution of a specific piece of code, etc.
LLMs cannot actually call the tool themselves; instead, they express the intent

View File

@ -0,0 +1,306 @@
---
sidebar_position: 8
---
# RAG (Retrieval-Augmented Generation)
[Great tutorial on RAG](https://www.sivalabs.in/langchain4j-retrieval-augmented-generation-tutorial/)
by [Siva](https://www.sivalabs.in/).
LLM's knowledge is limited to the data it has been trained on.
To enable LLM to "know" your private data, like internal company documentation, you can:
- Use RAG, which we will cover here
- Fine-tune LLM with your data
- [Combine RAG and fine-tuning](https://gorilla.cs.berkeley.edu/blogs/9_raft.html)
Simply put, RAG is the way to find and inject relevant pieces of information from your private knowledge base
into the prompt before sending it to the LLM.
This way LLM will get (hopefully) relevant information and will be able to reply using this information.
## Easy RAG
LangChain4j has an "Easy RAG" feature that makes it as easy as possible to get started with RAG.
You don't need to learn about embeddings, choose a vector store, find the right embedding model,
figure out how to parse and split documents, etc.
Just point to your document(s), and LangChain4j will work its magic.
:::note
The quality of such "Easy RAG" will, of course, by definition, be lower than that of a tailored solution.
However, this is the easiest way to start learning about RAG and/or make a proof of concept.
Later, you will be able to transition smoothly from Easy RAG to more advanced RAG,
adjusting and customizing more aspects.
:::
First, import the `langchain4j-easy-rag` dependency, which contains everything we need inside:
```xml
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-easy-rag</artifactId>
<version>0.29.0</version>
</dependency>
```
Then, let's load our documents:
```java
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/home/langchain4j/documents");
```
This will load all documents from the specified directory.
<details>
<summary>What is happening under the hood?</summary>
The Apache Tika library, which supports a wide variety of document types,
is used to detect document types and parse them.
Since we did not explicitly specify which `DocumentParser` to use,
the `FileSystemDocumentLoader` will load an `ApacheTikaDocumentParser`,
provided by `langchain4j-easy-rag` dependency through SPI.
</details>
<details>
<summary>How to customize loading documents?</summary>
If you want to load documents from all subdirectories, you can use the `loadDocumentsRecursively` method:
```java
List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively("/home/langchain4j/documents");
```
Additionally, you can filter documents by using a glob or regex:
```java
PathMatcher pathMatcher = FileSystems.getDefault().getPathMatcher("glob:*.pdf");
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/home/langchain4j/documentation", pathMatcher);
```
:::note
When using `loadDocumentsRecursively` method, you might want to use a double asterisk (instead of a single one)
in glob: `glob:**.pdf`.
:::
</details>
Now, once we have loaded our documents into memory,
we need to store them in a specialized embedding (vector) store to enable semantic search.
We can use any of our 15+ [integrations](/category/embedding-stores) with various embedding stores,
but for simplicity, we will use an in-memory one:
```java
InMemoryEmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
```
Now, let's ingest our documents into the store:
```java
EmbeddingStoreIngestor.ingest(documents, embeddingStore);
```
<details>
<summary>What is happening under the hood?</summary>
1. Through SPI, the `EmbeddingStoreIngestor` loads a `DocumentSplitter` from the `langchain4j-easy-rag` dependency.
Each `Document` is split into smaller pieces (`TextSegment`s) each consisting of 300 tokens and with a 30-token overlap.
2. Through SPI, the `EmbeddingStoreIngestor` loads an `EmbeddingModel` from the `langchain4j-easy-rag` dependency.
For the Easy RAG, a [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) embedding model is used.
Each `TextSegment` is converted into an `Embedding` using the `EmbeddingModel`.
:::note
This embedding model has achieved an impressive score
on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard), and its quantized version
occupies only 24 megabytes of space.
Therefore, we can easily load it into memory and run it in the same process using [ONNX Runtime](https://onnxruntime.ai/).
Yes, that's right, you can convert text into embeddings entirely offline, without any external services,
in the same JVM process.
LangChain4j offers 5 popular embedding models
[out-of-the-box](https://github.com/langchain4j/langchain4j-embeddings).
:::
3. All `TextSegment`-`Embedding` pairs are stored in the `EmbeddingStore`.
</details>
The last step is to create an [AI Service](/tutorials/ai-services) that will serve as our API to the LLM:
```java
interface Assistant {
String chat(String userMessage);
}
Assistant assistant = AiServices.builder(Assistant.class)
.chatLanguageModel(OpenAiChatModel.withApiKey(OPENAI_API_KEY))
.chatMemory(MessageWindowChatMemory.withMaxMessages(10))
.contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
.build();
```
Here, we configure our AI Service to use an OpenAI LLM, remember the 10 latest messages in the conversation,
and use an `EmbeddingStore` with our documents.
And now we are ready to chat with it!
```java
String answer = assistant.chat("How to do RAG with LangChain4j?");
```
## RAG APIs
LangChain4j offers a broad set of APIs to make it easier for you to build your custom RAG pipelines.
### Document
LangChain4j's domain model includes a `Document` class, which represents an entire document,
such as a single PDF file.
At the moment, the `Document` can only represent textual information,
but future updates will enable it to support images and tables as well.
### Document Loader
You can create a `Document` from a `String`, but a simpler method is to use one of our `DocumentLoader`s included in the library:
- `FileSystemDocumentLoader` from the main (`langchain4j`) module
- `UrlDocumentLoader` from the main (`langchain4j`) module
- `AmazonS3DocumentLoader` from the `langchain4j-document-loader-amazon-s3` module
- `AzureBlobStorageDocumentLoader` from the `langchain4j-document-loader-azure-storage-blob` module
- `GitHubDocumentLoader` from the `langchain4j-document-loader-github` module
- `TencentCosDocumentLoader` from the `langchain4j-document-loader-tencent-cos` module
### Document Parser
`Document`s can represent files in various formats, such as PDF, DOC, TXT, etc.
To parse each of these formats, there's a `DocumentParser` interface with several implementations included in the library:
- `TextDocumentParser` from the main (`langchain4j`) module, which can parse files in plain text format (e.g. TXT, HTML, MD, etc.)
- `ApachePdfBoxDocumentParser` from the `langchain4j-document-parser-apache-pdfbox` module, which can parse PDF files
- `ApachePoiDocumentParser` from the `langchain4j-document-parser-apache-poi` module, which can parse MS Office file formats
(e.g. DOC, DOCX, PPT, PPTX, XLS, XLSX, etc.)
- `ApacheTikaDocumentParser` from the `langchain4j-document-parser-apache-tika` module,
which can automatically detect and parse almost all existing file formats
Here is an example of how to load one or multiple `Document`s from the file system:
```java
// Load a single document
Document document = FileSystemDocumentLoader.loadDocument("/home/langchain4j/file.txt", new TextDocumentParser());
// Load all documents from a directory
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/home/langchain4j", new TextDocumentParser());
// Load all *.txt documents from a directory
PathMatcher pathMatcher = FileSystems.getDefault().getPathMatcher("glob:*.txt");
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/home/langchain4j", pathMatcher, new TextDocumentParser());
// Load all documents from a directory and its subdirectories
List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively("/home/langchain4j", new TextDocumentParser());
```
You can also load documents without explicitly specifying a `DocumentParser`.
In this case, a default `DocumentParser` will be used.
The default one is loaded through SPI (e.g. from `langchain4j-document-parser-apache-tika` or `langchain4j-easy-rag`).
If no `DocumentParser`s are found through SPI, a `TextDocumentParser` is used as a fallback.
### Document Transformer
More details are coming soon.
See `HtmlTextExtractor` in the `langchain4j` module.
### Text Segment
Once your `Document`s are loaded, it is time to split (chunk) them into smaller segments (pieces).
LangChain4j's domain model includes a `TextSegment` class that represents a segment of a `Document`.
As the name suggests, `TextSegment` can represent only textual information.
<details>
<summary>To split or not to split?</summary>
There are several reasons why you might want to include only a few relevant segments
instead of the entire knowledge base in the prompt:
- LLMs have a limited context window, so the entire knowledge base might not fit
- The more information you provide in the prompt, the longer it takes for the LLM to process it and respond
- The more information you provide in the prompt, the more you pay
- Irrelevant information in the prompt might confuse or distract the LLM and increase the chance of hallucinations
RAG addresses these concerns by splitting your knowledge base into smaller, more digestible segments.
How big should those segments be? That is a good question. As always, it depends.
There are currently 2 widely used approaches:
1. Each document (e.g., a PDF file, a web page, etc.) is atomic and indivisible.
During retrieval in the RAG pipeline, the N most relevant documents are retrieved and injected into the prompt.
You will most probably need to use a long-context LLM for this since documents can be quite long.
This approach is suitable if processing complete documents is important,
such as when you can't afford to miss some details.
- Pros: No context is lost.
- Cons:
- More tokens are consumed.
- Sometimes, documents can contain multiple sections/topics, and not all of them are relevant to the query.
- Vector search quality suffers because complete documents of various sizes are compressed into a single vector.
2. Documents are split into smaller segments, such as chapters, paragraphs, or sometimes even sentences.
During retrieval in the RAG pipeline, the N most relevant segments are retrieved and injected into the prompt.
The challenge lies in ensuring each segment provides sufficient context/information for the LLM to understand it.
Missing context can lead to the LLM misinterpreting the given segment and hallucinating.
A common strategy is to split documents into segments with overlap, but this doesn't completely solve the problem.
Several advanced techniques can help here, for example, "sentence window retrieval", "auto-merging retrieval",
and "parent document retrieval".
We won't go into details here, but essentially, these methods help to fetch more context around the retrieved segments,
providing the LLM with additional information before and after the retrieved segment.
- Pros:
- Better quality of vector search.
- Reduced token consumption.
- Cons: Some context may still be lost.
</details>
### Document Splitter
LangChain4j has a DocumentSplitter interface with several out-of-the-box implementations:
- `DocumentByParagraphSplitter`
- `DocumentByLineSplitter`
- `DocumentBySentenceSplitter`
- `DocumentByWordSplitter`
- `DocumentByCharacterSplitter`
- `DocumentByRegexSplitter`
- Recursive: `DocumentSplitters.recursive(...)`
They all work as follows:
1. You instantiate a `DocumentSplitter`, specifying the desired size of `TextSegment`s and,
optionally, an overlap in characters or tokens.
2. You use the `split(Document)` or `splitAll(List<Document>)` methods of the `DocumentSplitter`.
3. The `DocumentSplitter` splits the given `Document`s into smaller units,
the nature of which varies with the splitter. For instance, `DocumentByParagraphSplitter` divides
a document into paragraphs (defined by two or more consecutive newline characters (`\n`)),
while `DocumentBySentenceSplitter` uses the OpenNLP library's sentence detector to split
a document into sentences, and so on.
4. The `DocumentSplitter` then processes these smaller units (paragraphs, sentences, words, etc.),
recombining them into a `TextSegment` and trying to include as many units as possible
into a single `TextSegment`, without exceeding the limit set in step 1.
### Text Segment Transformer
More details are coming soon.
### Embedding
More details are coming soon.
### Embedding Model
More details are coming soon.
Currently supported embedding models can be found [here](/category/embedding-models).
### Embedding Store
More details are coming soon.
Currently supported embedding stores can be found [here](/category/embedding-stores).
### Embedding Store Ingestor
More details are coming soon.
## RAG Stages
### Ingestion
[![](/img/rag-ingestion.png)](/tutorials/rag)
### Retrieval
[![](/img/rag-retrieval.png)](/tutorials/rag)
## Advanced RAG
More details are coming soon.
In the meantime, please read [this](https://github.com/langchain4j/langchain4j/pull/538).
[![](/img/advanced-rag.png)](/tutorials/rag)
## Examples
- [Easy RAG](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_1_easy/Easy_RAG_Example.java)
- [Naive RAG](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_2_naive/Naive_RAG_Example.java)
- [Advanced RAG with Query Compression](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_3_advanced/_01_Advanced_RAG_with_Query_Compression_Example.java)
- [Advanced RAG with Query Routing](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_3_advanced/_02_Advanced_RAG_with_Query_Routing_Example.java)
- [Advanced RAG with Re-Ranking](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_3_advanced/_03_Advanced_RAG_with_ReRanking_Example.java)
- [Advanced RAG with Including Metadata](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_3_advanced/_04_Advanced_RAG_with_Metadata_Example.java)
- [RAG + Tools](https://github.com/langchain4j/langchain4j-examples/blob/main/spring-boot-example/src/test/java/dev/example/CustomerSupportApplicationTest.java)
- [Loading Documents](https://github.com/langchain4j/langchain4j-examples/blob/main/other-examples/src/main/java/DocumentLoaderExamples.java)
- [ConversationalRetrievalChain](https://github.com/langchain4j/langchain4j-examples/blob/main/other-examples/src/main/java/ChatWithDocumentsExamples.java)

View File

@ -1,34 +0,0 @@
---
sidebar_position: 15
---
# RAG (Retrieval-Augmented Generation)
[Great tutorial on RAG](https://www.sivalabs.in/langchain4j-retrieval-augmented-generation-tutorial/)
by [Siva](https://www.sivalabs.in/).
## Ingestion
[![](/img/rag-ingestion.png)](/tutorials/rag)
## Retrieval
[![](/img/rag-retrieval.png)](/tutorials/rag)
## Advanced Retrieval
More info on advanced RAG can be found [here](https://github.com/langchain4j/langchain4j/pull/538).
[![](/img/advanced-rag.png)](/tutorials/rag)
## Examples
- [Easy RAG](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_1_easy/Easy_RAG_Example.java)
- [Naive RAG](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_2_naive/Naive_RAG_Example.java)
- [Advanced RAG with Query Compression](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_3_advanced/_01_Advanced_RAG_with_Query_Compression_Example.java)
- [Advanced RAG with Query Routing](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_3_advanced/_02_Advanced_RAG_with_Query_Routing_Example.java)
- [Advanced RAG with Re-Ranking](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_3_advanced/_03_Advanced_RAG_with_ReRanking_Example.java)
- [Advanced RAG with Including Metadata](https://github.com/langchain4j/langchain4j-examples/blob/main/rag-examples/src/main/java/_3_advanced/_04_Advanced_RAG_with_Metadata_Example.java)
- [RAG + Tools](https://github.com/langchain4j/langchain4j-examples/blob/main/spring-boot-example/src/test/java/dev/example/CustomerSupportApplicationTest.java)
- [Loading Documents](https://github.com/langchain4j/langchain4j-examples/blob/main/other-examples/src/main/java/DocumentLoaderExamples.java)
- [ConversationalRetrievalChain](https://github.com/langchain4j/langchain4j-examples/blob/main/other-examples/src/main/java/ChatWithDocumentsExamples.java)