langchain4j/langchain4j-cassandra/pom.xml

136 lines
4.6 KiB
XML
Raw Normal View History

Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<artifactId>langchain4j-cassandra</artifactId>
<name>LangChain4j :: Integration :: Cassandra and AstraDb</name>
<description>Some dependencies have a "Public Domain" license</description>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
<parent>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-parent</artifactId>
2024-09-09 16:11:09 +08:00
<version>0.35.0-SNAPSHOT</version>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
<relativePath>../langchain4j-parent/pom.xml</relativePath>
</parent>
<properties>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<astra-db-client.version>1.2.4</astra-db-client.version>
<jackson.version>2.16.1</jackson.version>
<maven.compiler.release>11</maven.compiler.release>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
</properties>
<dependencies>
<dependency>
<groupId>dev.langchain4j</groupId>
2023-11-18 22:33:17 +08:00
<artifactId>langchain4j-core</artifactId>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
</dependency>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.datastax.astra</groupId>
<artifactId>astra-db-client</artifactId>
<version>${astra-db-client.version}</version>
</dependency>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<!-- TESTS -->
<!-- Visibility for EmbeddingStoreIT -->
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
<dependency>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-core</artifactId>
<classifier>tests</classifier>
<type>test-jar</type>
<scope>test</scope>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
</dependency>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<!-- Same embeddings model to keep the 1% -->
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
<dependency>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-all-minilm-l6-v2-q</artifactId>
<scope>test</scope>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
</dependency>
<dependency>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-open-ai</artifactId>
<scope>test</scope>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<scope>test</scope>
</dependency>
EmbeddingStore (Metadata) Filter API (#610) ## New EmbeddingStore (metadata) `Filter` API Many embedding stores, such as [Pinecone](https://docs.pinecone.io/docs/metadata-filtering) and [Milvus](https://milvus.io/docs/boolean.md) support strict filtering (think of an SQL "WHERE" clause) during similarity search. So, if one has an embedding store with movies, for example, one could search not only for the most semantically similar movies to the given user query but also apply strict filtering by metadata fields like year, genre, rating, etc. In this case, the similarity search will be performed only on those movies that match the filter expression. Since LangChain4j supports (and abstracts away) many embedding stores, there needs to be an embedding-store-agnostic way for users to define the filter expression. This PR introduces a `Filter` interface, which can represent both simple (e.g., `type = "documentation"`) and composite (e.g., `type in ("documentation", "tutorial") AND year > 2020`) filter expressions in an embedding-store-agnostic manner. `Filter` currently supports the following operations: - Comparison: - `IsEqualTo` - `IsNotEqualTo` - `IsGreaterThan` - `IsGreaterThanOrEqualTo` - `IsLessThan` - `IsLessThanOrEqualTo` - `IsIn` - `IsNotIn` - Logical: - `And` - `Not` - `Or` These operations are supported by most embedding stores and serve as a good starting point. However, the list of operations will expand over time to include other operations (e.g., `Contains`) supported by embedding stores. Currently, the DSL looks like this: ```java Filter onlyDocs = metadataKey("type").isEqualTo("documentation"); Filter docsAndTutorialsAfter2020 = metadataKey("type").isIn("documentation", "tutorial").and(metadataKey("year").isGreaterThan(2020)); // or Filter docsAndTutorialsAfter2020 = and( metadataKey("type").isIn("documentation", "tutorial"), metadataKey("year").isGreaterThan(2020) ); ``` ## Filter expression as a `String` Filter expression can also be specified as a `String`. This might be necessary, for example, if the filter expression is generated dynamically by the application or by the LLM (as in [self querying](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/)). This PR introduces a `FilterParser` interface with a simple `Filter parse(String)` API, allowing for future support of multiple syntaxes (if this will be required). For the out-of-the-box filter syntax, ANSI SQL's `WHERE` clause is proposed as a suitable candidate for several reasons: - SQL is well-known among Java developers - There is extensive tooling available for SQL (e.g., parsers) - LLMs are pretty good at generating valid SQL, as there are tons of SQL queries on the internet, which are included in the LLM training datasets. There are also specialized LLMs that are trained for text-to-SQL task, such as [SQLCoder](https://huggingface.co/defog). The downside is that SQL's `WHERE` clause might not support all operations and data types that could be supported in the future by various embedding stores. In such case, we could extend it to a superset of ANSI SQL `WHERE` syntax and/or provide an option to express filters in the native syntax of the store. An out-of-the-box implementation of the SQL `FilterParser` is provided as a `SqlFilterParser` in a separate module `langchain4j-embedding-store-filter-parser-sql`, using [JSqlParser](https://github.com/JSQLParser/JSqlParser) under the hood. `SqlFilterParser` can parse SQL "SELECT" (or just "WHERE" clause) statement into a `Filter` object: - `SELECT * FROM fake_table WHERE userId = '123-456'` -> `metadataKey("userId").isEqualTo("123-456")` - `userId = '123-456'` -> `metadataKey("userId").isEqualTo("123-456")` It can also resolve `CURDATE()` and `CURRENT_DATE`/`CURRENT_TIME`/`CURRENT_TIMESTAMP`: `SELECT * FROM fake_table WHERE year = EXTRACT(YEAR FROM CURRENT_DATE` -> `metadataKey("year").isEqualTo(LocalDate.now().getYear())` ## Changes in `Metadata` API Until now, `Metadata` supported only `String` values. This PR expands the list of supported value types to `Integer`, `Long`, `Float` and `Double`. In the future, more types may be added (if needed). The method `String get(String key)` will be deprecated later in favor of: - `String getString(String key)` - `Integer getInteger(String key)` - `Long getLong(String key)` - etc New overloaded `put(key, value)` methods are introduced to support more value types: - `put(String key, int value)` - `put(String key, long value)` - etc ## Changes in `EmbeddingStore` API New method `search` is added that will become the main entry point for search in the future. All `findRelevant` methods will be deprecated later. New `search` method accepts `EmbeddingSearchRequest` and returns `EmbeddingSearchResult`. `EmbeddingSearchRequest` contains all search criteria (e.g. `maxResults`, `minScore`), including new `Filter`. `EmbeddingSearchResult` contains a list of `EmbeddingMatch`. ```java EmbeddingSearchResult search(EmbeddingSearchRequest request); ``` ## Changes in `EmbeddingStoreContentRetriever` API `EmbeddingStoreContentRetriever` can now be configured with a static `filter` as well as dynamic `dynamicMaxResults`, `dynamicMinScore` and `dynamicFilter` in the builder: ```java ContentRetriever contentRetriever = EmbeddingStoreContentRetriever.builder() .embeddingStore(embeddingStore) .embeddingModel(embeddingModel) ... .maxResults(3) // or .dynamicMaxResults(query -> 3) // You can define maxResults dynamically. The value could, for example, depend on the query or the user associated with the query. ... .minScore(0.3) // or .dynamicMinScore(query -> 0.3) ... .filter(metadataKey("userId").isEqualTo("123-456")) // Assuming your TextSegments contain Metadata with key "userId" // or .dynamicFilter(query -> metadataKey("userId").isEqualTo(query.metadata().chatMemoryId().toString())) ... .build(); ``` So now you can define `maxResults`, `minScore` and `filter` both statically and dynamically (they can depend on the query, user, etc.). These values will be propagated to the underlying `EmbeddingStore`. ## ["Self-querying"](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/) This PR also introduces `LanguageModelSqlFilterBuilder` in `langchain4j-embedding-store-filter-parser-sql` module which can be used with `EmbeddingStoreContentRetriever`'s `dynamicFilter` to automatically build a `Filter` object from the `Query` using language model and `SqlFilterParser`. For example: ```java TextSegment groundhogDay = TextSegment.from("Groundhog Day", new Metadata().put("genre", "comedy").put("year", 1993)); TextSegment forrestGump = TextSegment.from("Forrest Gump", new Metadata().put("genre", "drama").put("year", 1994)); TextSegment dieHard = TextSegment.from("Die Hard", new Metadata().put("genre", "action").put("year", 1998)); // describe metadata keys as if they were columns in the SQL table TableDefinition tableDefinition = TableDefinition.builder() .name("movies") .addColumn("genre", "VARCHAR", "one of [comedy, drama, action]") .addColumn("year", "INT") .build(); LanguageModelSqlFilterBuilder sqlFilterBuilder = new LanguageModelSqlFilterBuilder(model, tableDefinition); ContentRetriever contentRetriever = EmbeddingStoreContentRetriever.builder() .embeddingStore(embeddingStore) .embeddingModel(embeddingModel) .dynamicFilter(sqlFilterBuilder::build) .build(); String answer = assistant.answer("Recommend me a good drama from 90s"); // Forrest Gump ``` ## Which embedding store integrations will support `Filter`? In the long run, all (provided the embedding store itself supports it). In the first iteration, I aim to add support to just a few: - `InMemoryEmbeddingStore` - Elasticsearch - Milvus <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Introduced filters for checking key's value existence in a collection for improved data handling. - **Enhancements** - Updated `InMemoryEmbeddingStoreTest` to extend a different class for improved testing coverage and added a new test method. - **Refactor** - Made minor formatting adjustments in the assertion block for better readability. - **Documentation** - Updated class hierarchy information for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2024-03-09 00:06:58 +08:00
<!-- junit-jupiter-params should be declared explicitly
to run parameterized tests inherited from EmbeddingStore*IT-->
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-params</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
<version>${assertj.version}</version>
<scope>test</scope>
</dependency>
2023-11-18 22:33:17 +08:00
<dependency>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
2023-11-18 22:33:17 +08:00
<scope>test</scope>
</dependency>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
<dependency>
Rework support of AstraDB and Cassandra (#548) In the Datastax Astra DB saas solution, a new way to integrate with vector databases has been introduced: using an HTTP APi instead of the Cassandra Cluster. It is called the DataAPI and use the MongoDB principles with collections. The pull request includes the following: ### Update on previous implementations - Previous implementations of embedding stores have been grouped in a single `CassandraEmbeddingStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Previous implementations of chat memory stores have been grouped in a single `CassandraChatMemoryStore`. It can be instantiated for Astra or OSS Cassandra based on 2 different constructor builders but everything else is the same. - Integration test for OSS Cassandra now using test containers (as Cassandra 5-alpha2 image is out) - Usage ```java // Using with Astra (Cassandra AAS in the cloud) CassandraEmbeddingStore.builderAstra() .token(token) .databaseId(dbId) .databaseRegion(TEST_REGION) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); // Using OSS Cassandra CassandraEmbeddingStore.builder() .contactPoints(Arrays.asList(contactPoint.getHostName())) .port(contactPoint.getPort()) .localDataCenter(DATACENTER) .keyspace(KEYSPACE) .table(TEST_INDEX) .dimension(11) .metric(CassandraSimilarityMetric.COSINE) .build(); ``` -Adding jdk11 in the pom ``` <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> ``` - introducing `insertMany()`, distributed to all bulk loading - Extending the variables `EmbeddingStoreIT` - Using `MessageWindowChatMemory` for the tests.
2024-02-08 22:54:53 +08:00
<groupId>org.testcontainers</groupId>
<artifactId>cassandra</artifactId>
<version>${testcontainers.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>junit-jupiter</artifactId>
<version>${testcontainers.version}</version>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.awaitility</groupId>
<artifactId>awaitility</artifactId>
<scope>test</scope>
</dependency>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.honton.chas</groupId>
<artifactId>license-maven-plugin</artifactId>
<configuration>
<!-- org.json:json has a "Public Domain" license -->
<skipCompliance>true</skipCompliance>
</configuration>
</plugin>
</plugins>
</build>
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162) #### Context Apache Cassandra is a popular open-source database created back in 2008. This year with [CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes) support for vector and similarity searches have been introduced. Cassandra is very fast in read and write and is used as a cache by many companies, it as an opportunity to implement the ChatMemoryStore. This feature is expected for Cassandra 5 at the end of the year but some docker images are already available. DataStax AstraDb is a distribution of Apache Cassandra available as Saas providing a free tier (free forever) of 80 millions queries/month. [Registration](https://astra.datastax.com). The vector capability is there production ready. #### Data Modelling With the proper data model in Cassandra we can perform both similarity search, keyword search, metadata search. ```sql CREATE TABLE sample_vector_table ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map<text, text>, vector vector<float, 1536> ); ``` #### Implementation Throughts - The **configuration** to connect to Astra and Cassandra are not exactly the same so 2 different classes with associated builder are provided: [Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java) and [OSS Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java). A couple of fields are mutualized but creating a superclass to inherit from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not able to found out what to do. - Instead of passing a large number of arguments like other stores I prefer to wrap them as a bean. With this trick you can add or remove attributes, make then optional or mandatory at will. If you need to add a new attribute in the configuration you do not have to change the implementation of `XXXStore` and `XXXStoreImpl` - I create an [AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java) that could very well become the super class for any store. It handles the different call of the real concrete implementation. (_delegate pattern_). Some default implementation can be implemented ```java /** * Add a list of embeddings to the store. * * @param embeddings * list of embeddings (hold vector) * @return * list of ids */ @Override public List<String> addAll(List<Embedding> embeddings) { Objects.requireNonNull(embeddings, "embeddings must not be null"); return embeddings.stream().map(this::add).collect(Collectors.toList()); } ``` The only method to implement at the Store level is: ```java /** * Initialize the concrete implementation. * @return create implementation class for the store */ protected abstract EmbeddingStore<T> loadImplementation() throws ClassNotFoundException, NoSuchMethodException, InstantiationException, IllegalAccessException, InvocationTargetException; ``` - [CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30) proposes 2 constructors, one could override the implementation class if they want (extension point) #### Tests - Test classes are provided including some long form examples based on classed found in `langchain4j-examples` but test are disabled. - To start a local cassandra use docker and the [docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml) ``` docker compose up -d ``` - To run Test with Astra signin with your github account, create a token (api Key) with role `Organization Administrator` following this [procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure) <img width="926" alt="Screenshot 2023-09-06 at 18 14 12" src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1"> - Pick the full value of the `token` from the json <img width="713" alt="Screenshot 2023-09-06 at 18 15 53" src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997"> - Create the environment variable `ASTRA_DB_APPLICATION_TOKEN` ```console export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token> ```
2023-09-27 21:50:04 +08:00
</project>