Commit Graph

10 Commits

Author SHA1 Message Date
deep-learning-dynamo 16f60dbef9 reducing duplication of *EmbeddingStoreIT 2023-11-18 16:23:29 +01:00
deep-learning-dynamo 9897d65a54 cleanup 2023-11-18 16:23:26 +01:00
deep-learning-dynamo 21dfc8b317 released 0.24.0 2023-11-12 18:58:31 +01:00
deep-learning-dynamo f8871900be *EmbeddingStoreTest -> *EmbeddingStoreIT 2023-11-10 13:48:32 +01:00
Simon Verhoeven d0399d023b
fix: remove unused imports (#209) 2023-10-09 09:48:00 +02:00
Cedrick Lunven 2a3b5406de
Astra and Cassandra Store fixes (#201)
- Change the order of the returned list for chat 
- Improve `AstraDbEmbeddingStoreTest.java` that could actually failed if
the env environment was not set and we do not want that
2023-10-09 09:22:32 +02:00
deep-learning-dynamo 1c7eb6edd1 skipping compliance check for langchain4j-cassandra due to "Public Domain" license of org.json:json 2023-10-09 09:13:00 +02:00
deep-learning-dynamo 315eab8641 released 0.23.0 2023-09-29 14:27:51 +02:00
deep-learning-dynamo ef8f04015b Removed dynamic loading from AstraDB/Cassandra 2023-09-27 17:11:01 +02:00
Cedrick Lunven c632322493
Cassandra and Astra (dbaas) as VectorStore and ChatMemoryStore (#162)
#### Context

Apache Cassandra is a popular open-source database created back in 2008.
This year with
[CEP30](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes)
support for vector and similarity searches have been introduced.
Cassandra is very fast in read and write and is used as a cache by many
companies, it as an opportunity to implement the ChatMemoryStore. This
feature is expected for Cassandra 5 at the end of the year but some
docker images are already available.

DataStax AstraDb is a distribution of Apache Cassandra available as Saas
providing a free tier (free forever) of 80 millions queries/month.
[Registration](https://astra.datastax.com). The vector capability is
there production ready.

#### Data Modelling

With the proper data model in Cassandra we can perform both similarity
search, keyword search, metadata search.

```sql
CREATE TABLE sample_vector_table (
    row_id text PRIMARY KEY,
    attributes_blob text,
    body_blob text,
    metadata_s map<text, text>,
    vector vector<float, 1536>
);
```

#### Implementation Throughts

- The **configuration** to connect to Astra and Cassandra are not
exactly the same so 2 different classes with associated builder are
provided:
[Astra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/AstraDbEmbeddingConfiguration.java)
and [OSS
Cassandra](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingConfiguration.java).
A couple of fields are mutualized but creating a superclass to inherit
from lead to the use of Lombok `@SuperBuilder` and the Javadoc was not
able to found out what to do.

- Instead of passing a large number of arguments like other stores I
prefer to wrap them as a bean. With this trick you can add or remove
attributes, make then optional or mandatory at will. If you need to add
a new attribute in the configuration you do not have to change the
implementation of `XXXStore` and `XXXStoreImpl`

- I create an
[AstractEmbeddedStore<T>](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/AbstractEmbeddingStore.java)
that could very well become the super class for any store. It handles
the different call of the real concrete implementation. (_delegate
pattern_). Some default implementation can be implemented

```java
/**
 * Add a list of embeddings to the store.
 *
 * @param embeddings
 *      list of embeddings (hold vector)
 * @return
 *      list of ids
*/
@Override
public List<String> addAll(List<Embedding> embeddings) {
   Objects.requireNonNull(embeddings, "embeddings must not be null");
   return embeddings.stream().map(this::add).collect(Collectors.toList());
}
```

The only method to implement at the Store level is:

```java
/**
* Initialize the concrete implementation.
* @return create implementation class for the store
*/
protected abstract EmbeddingStore<T> loadImplementation()
throws ClassNotFoundException, NoSuchMethodException, InstantiationException,
       IllegalAccessException, InvocationTargetException;
```

-
[CassandraEmbeddedStore](https://github.com/clun/langchain4j/blob/main/langchain4j/src/main/java/dev/langchain4j/store/embedding/cassandra/CassandraEmbeddingStore.java#L30)
proposes 2 constructors, one could override the implementation class if
they want (extension point)

#### Tests

- Test classes are provided including some long form examples based on
classed found in `langchain4j-examples` but test are disabled.

- To start a local cassandra use docker and the
[docker-compose](https://github.com/clun/langchain4j/blob/main/langchain4j-cassandra/src/test/resources/docker-compose.yml)

```
docker compose up -d
```

- To run Test with Astra signin with your github account, create a token
(api Key) with role `Organization Administrator` following this
[procedure](https://awesome-astra.github.io/docs/pages/astra/create-token/#c-procedure)

<img width="926" alt="Screenshot 2023-09-06 at 18 14 12"
src="https://github.com/langchain4j/langchain4j/assets/726536/dfd2d9e5-09c9-4504-bfaa-31cfd87704a1">

- Pick the full value of the `token` from the json

<img width="713" alt="Screenshot 2023-09-06 at 18 15 53"
src="https://github.com/langchain4j/langchain4j/assets/726536/1be56234-dd98-4f59-af71-03df42ed6997">

- Create the environment variable `ASTRA_DB_APPLICATION_TOKEN`

```console
export ASTRA_DB_APPLICATION_TOKEN=AstraCS:....<your_token>
```
2023-09-27 15:50:04 +02:00