Fix: provide valid Prompt and Completion Token usage counts from create_stream (#3972)

* Fix: `create_stream` to return valid usage token counts * documentation --------- Co-authored-by: Eric Zhu <ekzhu@users.noreply.github.com>
2024-10-30 12:20:03 +13:00 · 2024-10-30 12:20:03 +13:00 · 87bd1de396
parent bd9c371605
commit 87bd1de396
4 changed files with 341 additions and 43 deletions
--- a/python/packages/autogen-core/docs/src/user-guide/core-user-guide/framework/model-clients.ipynb
+++ b/python/packages/autogen-core/docs/src/user-guide/core-user-guide/framework/model-clients.ipynb
@ -74,6 +74,24 @@
    "print(response.content)"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "RequestUsage(prompt_tokens=15, completion_tokens=7)\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Print the response token usage\n",
+    "print(response.usage)"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -86,7 +104,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
@ -94,24 +112,26 @@
     "output_type": "stream",
     "text": [
      "Streamed responses:\n",
-      "In a secluded valley where the sun painted the sky with hues of gold, a solitary dragon named Bremora stood guard. Her emerald scales shimmered with an ancient light as she watched over the village below. Unlike her fiery kin, Bremora had no desire for destruction; her soul was bound by a promise to protect.\n",
+      "In the heart of an ancient forest, beneath the shadow of snow-capped peaks, a dragon named Elara lived secretly for centuries. Elara was unlike any dragon from the old tales; her scales shimmered with a deep emerald hue, each scale engraved with symbols of lost wisdom. The villagers in the nearby valley spoke of mysterious lights dancing across the night sky, but none dared venture close enough to solve the enigma.\n",
      "\n",
-      "Generations ago, a wise elder had befriended Bremora, offering her companionship instead of fear. In gratitude, she vowed to shield the village from calamity. Years passed, and children grew up believing in the legends of a watchful dragon who brought them prosperity and peace.\n",
+      "One cold winter's eve, a young girl named Lira, brimming with curiosity and armed with the innocence of youth, wandered into Elara’s domain. Instead of fire and fury, she found warmth and a gentle gaze. The dragon shared stories of a world long forgotten and in return, Lira gifted her simple stories of human life, rich in laughter and scent of earth.\n",
      "\n",
-      "One summer, an ominous storm threatened the valley, with ravenous winds and torrents of rain. Bremora rose into the tempest, her mighty wings defying the chaos. She channeled her breath—not of fire, but of warmth and tranquility—calming the storm and saving her cherished valley.\n",
-      "\n",
-      "When dawn broke and the village emerged unscathed, the people looked to the sky. There, Bremora soared gracefully, a guardian spirit woven into their lives, silently promising her eternal vigilance.\n",
+      "From that night on, the villagers noticed subtle changes—the crops grew taller, and the air seemed sweeter. Elara had infused the valley with ancient magic, a guardian of balance, watching quietly as her new friend thrived under the stars. And so, Lira and Elara’s bond marked the beginning of a timeless friendship that spun tales of hope whispered through the leaves of the ever-verdant forest.\n",
      "\n",
      "------------\n",
      "\n",
      "The complete response:\n",
-      "In a secluded valley where the sun painted the sky with hues of gold, a solitary dragon named Bremora stood guard. Her emerald scales shimmered with an ancient light as she watched over the village below. Unlike her fiery kin, Bremora had no desire for destruction; her soul was bound by a promise to protect.\n",
+      "In the heart of an ancient forest, beneath the shadow of snow-capped peaks, a dragon named Elara lived secretly for centuries. Elara was unlike any dragon from the old tales; her scales shimmered with a deep emerald hue, each scale engraved with symbols of lost wisdom. The villagers in the nearby valley spoke of mysterious lights dancing across the night sky, but none dared venture close enough to solve the enigma.\n",
      "\n",
-      "Generations ago, a wise elder had befriended Bremora, offering her companionship instead of fear. In gratitude, she vowed to shield the village from calamity. Years passed, and children grew up believing in the legends of a watchful dragon who brought them prosperity and peace.\n",
+      "One cold winter's eve, a young girl named Lira, brimming with curiosity and armed with the innocence of youth, wandered into Elara’s domain. Instead of fire and fury, she found warmth and a gentle gaze. The dragon shared stories of a world long forgotten and in return, Lira gifted her simple stories of human life, rich in laughter and scent of earth.\n",
      "\n",
-      "One summer, an ominous storm threatened the valley, with ravenous winds and torrents of rain. Bremora rose into the tempest, her mighty wings defying the chaos. She channeled her breath—not of fire, but of warmth and tranquility—calming the storm and saving her cherished valley.\n",
+      "From that night on, the villagers noticed subtle changes—the crops grew taller, and the air seemed sweeter. Elara had infused the valley with ancient magic, a guardian of balance, watching quietly as her new friend thrived under the stars. And so, Lira and Elara’s bond marked the beginning of a timeless friendship that spun tales of hope whispered through the leaves of the ever-verdant forest.\n",
      "\n",
-      "When dawn broke and the village emerged unscathed, the people looked to the sky. There, Bremora soared gracefully, a guardian spirit woven into their lives, silently promising her eternal vigilance.\n"
+      "\n",
+      "------------\n",
+      "\n",
+      "The token usage was:\n",
+      "RequestUsage(prompt_tokens=0, completion_tokens=0)\n"
     ]
    }
   ],
@ -133,7 +153,10 @@
    "        # The last response is a CreateResult object with the complete message.\n",
    "        print(\"\\n\\n------------\\n\")\n",
    "        print(\"The complete response:\", flush=True)\n",
-    "        print(response.content, flush=True)"
+    "        print(response.content, flush=True)\n",
+    "        print(\"\\n\\n------------\\n\")\n",
+    "        print(\"The token usage was:\", flush=True)\n",
+    "        print(response.usage, flush=True)"
   ]
  },
  {
@ -143,7 +166,86 @@
    "```{note}\n",
    "The last response in the streaming response is always the final response\n",
    "of the type {py:class}`~autogen_core.components.models.CreateResult`.\n",
-    "```"
+    "```\n",
+    "\n",
+    "**NB the default usage response is to return zero values**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### A Note on Token usage counts with streaming example\n",
+    "Comparing usage returns in  the above Non Streaming `model_client.create(messages=messages)` vs streaming `model_client.create_stream(messages=messages)` we see differences.\n",
+    "The non streaming response by default returns valid prompt and completion token usage counts. \n",
+    "The streamed response by default returns zero values.\n",
+    "\n",
+    "as documented in the OPENAI API Reference an additional parameter `stream_options` can be specified to return valid usage counts. see [stream_options](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options)\n",
+    "\n",
+    "Only set this when you using streaming ie , using `create_stream` \n",
+    "\n",
+    "to enable this in `create_stream` set `extra_create_args={\"stream_options\": {\"include_usage\": True}},`\n",
+    "\n",
+    "- **Note whilst other API's like LiteLLM also support this, it is not always guarenteed that it is fully supported or correct**\n",
+    "\n",
+    "#### Streaming example with token usage\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Streamed responses:\n",
+      "In a lush, emerald valley hidden by towering peaks, there lived a dragon named Ember. Unlike others of her kind, Ember cherished solitude over treasure, and the songs of the stream over the roar of flames. One misty dawn, a young shepherd stumbled into her sanctuary, lost and frightened. \n",
+      "\n",
+      "Instead of fury, he was met with kindness as Ember extended a wing, guiding him back to safety. In gratitude, the shepherd visited yearly, bringing tales of his world beyond the mountains. Over time, a friendship blossomed, binding man and dragon in shared stories and laughter.\n",
+      "\n",
+      "As the years passed, the legend of Ember the gentle-hearted spread far and wide, forever changing the way dragons were seen in the hearts of many.\n",
+      "\n",
+      "------------\n",
+      "\n",
+      "The complete response:\n",
+      "In a lush, emerald valley hidden by towering peaks, there lived a dragon named Ember. Unlike others of her kind, Ember cherished solitude over treasure, and the songs of the stream over the roar of flames. One misty dawn, a young shepherd stumbled into her sanctuary, lost and frightened. \n",
+      "\n",
+      "Instead of fury, he was met with kindness as Ember extended a wing, guiding him back to safety. In gratitude, the shepherd visited yearly, bringing tales of his world beyond the mountains. Over time, a friendship blossomed, binding man and dragon in shared stories and laughter.\n",
+      "\n",
+      "As the years passed, the legend of Ember the gentle-hearted spread far and wide, forever changing the way dragons were seen in the hearts of many.\n",
+      "\n",
+      "\n",
+      "------------\n",
+      "\n",
+      "The token usage was:\n",
+      "RequestUsage(prompt_tokens=17, completion_tokens=146)\n"
+     ]
+    }
+   ],
+   "source": [
+    "messages = [\n",
+    "    UserMessage(content=\"Write a very short story about a dragon.\", source=\"user\"),\n",
+    "]\n",
+    "\n",
+    "# Create a stream.\n",
+    "stream = model_client.create_stream(messages=messages, extra_create_args={\"stream_options\": {\"include_usage\": True}})\n",
+    "\n",
+    "# Iterate over the stream and print the responses.\n",
+    "print(\"Streamed responses:\")\n",
+    "async for response in stream:  # type: ignore\n",
+    "    if isinstance(response, str):\n",
+    "        # A partial response is a string.\n",
+    "        print(response, flush=True, end=\"\")\n",
+    "    else:\n",
+    "        # The last response is a CreateResult object with the complete message.\n",
+    "        print(\"\\n\\n------------\\n\")\n",
+    "        print(\"The complete response:\", flush=True)\n",
+    "        print(response.content, flush=True)\n",
+    "        print(\"\\n\\n------------\\n\")\n",
+    "        print(\"The token usage was:\", flush=True)\n",
+    "        print(response.usage, flush=True)"
   ]
  },
  {
@ -234,7 +336,8 @@
    "from autogen_core.application import SingleThreadedAgentRuntime\n",
    "from autogen_core.base import MessageContext\n",
    "from autogen_core.components import RoutedAgent, message_handler\n",
-    "from autogen_core.components.models import ChatCompletionClient, OpenAIChatCompletionClient, SystemMessage, UserMessage\n",
+    "from autogen_core.components.models import ChatCompletionClient, SystemMessage, UserMessage\n",
+    "from autogen_ext.models import OpenAIChatCompletionClient\n",
    "\n",
    "\n",
    "@dataclass\n",
--- a/python/packages/autogen-core/src/autogen_core/components/models/_openai_client.py
+++ b/python/packages/autogen-core/src/autogen_core/components/models/_openai_client.py
@ -39,6 +39,7 @@ from openai.types.chat import (
    completion_create_params,
 )
 from openai.types.chat.chat_completion import Choice
+from openai.types.chat.chat_completion_chunk import Choice as ChunkChoice
 from openai.types.shared_params import FunctionDefinition, FunctionParameters
 from pydantic import BaseModel
 from typing_extensions import Unpack
@ -555,6 +556,31 @@ class BaseOpenAIChatCompletionClient(ChatCompletionClient):
        extra_create_args: Mapping[str, Any] = {},
        cancellation_token: Optional[CancellationToken] = None,
    ) -> AsyncGenerator[Union[str, CreateResult], None]:
+        """
+        Creates an AsyncGenerator that will yield a stream of chat completions based on the provided messages and tools.
+
+        Args:
+            messages (Sequence[LLMMessage]): A sequence of messages to be processed.
+            tools (Sequence[Tool | ToolSchema], optional): A sequence of tools to be used in the completion. Defaults to `[]`.
+            json_output (Optional[bool], optional): If True, the output will be in JSON format. Defaults to None.
+            extra_create_args (Mapping[str, Any], optional): Additional arguments for the creation process. Default to `{}`.
+            cancellation_token (Optional[CancellationToken], optional): A token to cancel the operation. Defaults to None.
+
+        Yields:
+            AsyncGenerator[Union[str, CreateResult], None]: A generator yielding the completion results as they are produced.
+
+        In streaming, the default behaviour is not return token usage counts. See: [OpenAI API reference for possible args](https://platform.openai.com/docs/api-reference/chat/create).
+        However `extra_create_args={"stream_options": {"include_usage": True}}` will (if supported by the accessed API)
+        return a final chunk with usage set to a RequestUsage object having prompt and completion token counts,
+        all preceding chunks will have usage as None. See: [stream_options](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options).
+
+        Other examples of OPENAI supported arguments that can be included in `extra_create_args`:
+            - `temperature` (float): Controls the randomness of the output. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more focused and deterministic.
+            - `max_tokens` (int): The maximum number of tokens to generate in the completion.
+            - `top_p` (float): An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
+            - `frequency_penalty` (float): A value between -2.0 and 2.0 that penalizes new tokens based on their existing frequency in the text so far, decreasing the likelihood of repeated phrases.
+            - `presence_penalty` (float): A value between -2.0 and 2.0 that penalizes new tokens based on whether they appear in the text so far, encouraging the model to talk about new topics.
+        """
        # Make sure all extra_create_args are valid
        extra_create_args_keys = set(extra_create_args.keys())
        if not create_kwargs.issuperset(extra_create_args_keys):
@ -601,7 +627,8 @@ class BaseOpenAIChatCompletionClient(ChatCompletionClient):
        if cancellation_token is not None:
            cancellation_token.link_future(stream_future)
        stream = await stream_future
-
+        choice: Union[ParsedChoice[Any], ParsedChoice[BaseModel], ChunkChoice] = cast(ChunkChoice, None)
+        chunk = None
        stop_reason = None
        maybe_model = None
        content_deltas: List[str] = []
@ -614,8 +641,23 @@ class BaseOpenAIChatCompletionClient(ChatCompletionClient):
                if cancellation_token is not None:
                    cancellation_token.link_future(chunk_future)
                chunk = await chunk_future
-                choice = chunk.choices[0]
-                stop_reason = choice.finish_reason
+
+                # to process usage chunk in streaming situations
+                # add    stream_options={"include_usage": True} in the initialization of OpenAIChatCompletionClient(...)
+                # However the different api's
+                # OPENAI api usage chunk produces no choices so need to check if there is a choice
+                # liteLLM api usage chunk does produce choices
+                choice = (
+                    chunk.choices[0]
+                    if len(chunk.choices) > 0
+                    else choice
+                    if chunk.usage is not None and stop_reason is not None
+                    else cast(ChunkChoice, None)
+                )
+
+                # for liteLLM chunk usage, do the following hack keeping the pervious chunk.stop_reason (if set).
+                # set the stop_reason for the usage chunk to the prior stop_reason
+                stop_reason = choice.finish_reason if chunk.usage is None and stop_reason is None else stop_reason
                maybe_model = chunk.model
                # First try get content
                if choice.delta.content is not None:
@ -657,17 +699,21 @@ class BaseOpenAIChatCompletionClient(ChatCompletionClient):
        model = maybe_model or create_args["model"]
        model = model.replace("gpt-35", "gpt-3.5")  # hack for Azure API

-        # TODO fix count token
-        prompt_tokens = 0
-        # prompt_tokens = count_token(messages, model=model)
+        if chunk and chunk.usage:
+            prompt_tokens = chunk.usage.prompt_tokens
+        else:
+            prompt_tokens = 0
+
        if stop_reason is None:
            raise ValueError("No stop reason found")

        content: Union[str, List[FunctionCall]]
        if len(content_deltas) > 1:
            content = "".join(content_deltas)
-            completion_tokens = 0
-            # completion_tokens = count_token(content, model=model)
+            if chunk and chunk.usage:
+                completion_tokens = chunk.usage.completion_tokens
+            else:
+                completion_tokens = 0
        else:
            completion_tokens = 0
            # TODO: fix assumption that dict values were added in order and actually order by int index
--- a/python/packages/autogen-ext/src/autogen_ext/models/_openai/_openai_client.py
+++ b/python/packages/autogen-ext/src/autogen_ext/models/_openai/_openai_client.py
@ -60,6 +60,7 @@ from openai.types.chat import (
    completion_create_params,
 )
 from openai.types.chat.chat_completion import Choice
+from openai.types.chat.chat_completion_chunk import Choice as ChunkChoice
 from openai.types.shared_params import FunctionDefinition, FunctionParameters
 from pydantic import BaseModel
 from typing_extensions import Unpack
@ -556,6 +557,31 @@ class BaseOpenAIChatCompletionClient(ChatCompletionClient):
        extra_create_args: Mapping[str, Any] = {},
        cancellation_token: Optional[CancellationToken] = None,
    ) -> AsyncGenerator[Union[str, CreateResult], None]:
+        """
+        Creates an AsyncGenerator that will yield a  stream of chat completions based on the provided messages and tools.
+
+        Args:
+            messages (Sequence[LLMMessage]): A sequence of messages to be processed.
+            tools (Sequence[Tool | ToolSchema], optional): A sequence of tools to be used in the completion. Defaults to `[]`.
+            json_output (Optional[bool], optional): If True, the output will be in JSON format. Defaults to None.
+            extra_create_args (Mapping[str, Any], optional): Additional arguments for the creation process. Default to `{}`.
+            cancellation_token (Optional[CancellationToken], optional): A token to cancel the operation. Defaults to None.
+
+        Yields:
+            AsyncGenerator[Union[str, CreateResult], None]: A generator yielding the completion results as they are produced.
+
+        In streaming, the default behaviour is not return token usage counts. See: [OpenAI API reference for possible args](https://platform.openai.com/docs/api-reference/chat/create).
+        However `extra_create_args={"stream_options": {"include_usage": True}}` will (if supported by the accessed API)
+        return a final chunk with usage set to a RequestUsage object having prompt and completion token counts,
+        all preceding chunks will have usage as None. See: [stream_options](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options).
+
+        Other examples of OPENAI supported arguments that can be included in `extra_create_args`:
+            - `temperature` (float): Controls the randomness of the output. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more focused and deterministic.
+            - `max_tokens` (int): The maximum number of tokens to generate in the completion.
+            - `top_p` (float): An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
+            - `frequency_penalty` (float): A value between -2.0 and 2.0 that penalizes new tokens based on their existing frequency in the text so far, decreasing the likelihood of repeated phrases.
+            - `presence_penalty` (float): A value between -2.0 and 2.0 that penalizes new tokens based on whether they appear in the text so far, encouraging the model to talk about new topics.
+        """
        # Make sure all extra_create_args are valid
        extra_create_args_keys = set(extra_create_args.keys())
        if not create_kwargs.issuperset(extra_create_args_keys):
@ -602,7 +628,8 @@ class BaseOpenAIChatCompletionClient(ChatCompletionClient):
        if cancellation_token is not None:
            cancellation_token.link_future(stream_future)
        stream = await stream_future
-
+        choice: Union[ParsedChoice[Any], ParsedChoice[BaseModel], ChunkChoice] = cast(ChunkChoice, None)
+        chunk = None
        stop_reason = None
        maybe_model = None
        content_deltas: List[str] = []
@ -615,8 +642,23 @@ class BaseOpenAIChatCompletionClient(ChatCompletionClient):
                if cancellation_token is not None:
                    cancellation_token.link_future(chunk_future)
                chunk = await chunk_future
-                choice = chunk.choices[0]
-                stop_reason = choice.finish_reason
+
+                # to process usage chunk in streaming situations
+                # add    stream_options={"include_usage": True} in the initialization of OpenAIChatCompletionClient(...)
+                # However the different api's
+                # OPENAI api usage chunk produces no choices so need to check if there is a choice
+                # liteLLM api usage chunk does produce choices
+                choice = (
+                    chunk.choices[0]
+                    if len(chunk.choices) > 0
+                    else choice
+                    if chunk.usage is not None and stop_reason is not None
+                    else cast(ChunkChoice, None)
+                )
+
+                # for liteLLM chunk usage, do the following hack keeping the pervious chunk.stop_reason (if set).
+                # set the stop_reason for the usage chunk to the prior stop_reason
+                stop_reason = choice.finish_reason if chunk.usage is None and stop_reason is None else stop_reason
                maybe_model = chunk.model
                # First try get content
                if choice.delta.content is not None:
@ -658,17 +700,21 @@ class BaseOpenAIChatCompletionClient(ChatCompletionClient):
        model = maybe_model or create_args["model"]
        model = model.replace("gpt-35", "gpt-3.5")  # hack for Azure API

-        # TODO fix count token
-        prompt_tokens = 0
-        # prompt_tokens = count_token(messages, model=model)
+        if chunk and chunk.usage:
+            prompt_tokens = chunk.usage.prompt_tokens
+        else:
+            prompt_tokens = 0
+
        if stop_reason is None:
            raise ValueError("No stop reason found")

        content: Union[str, List[FunctionCall]]
        if len(content_deltas) > 1:
            content = "".join(content_deltas)
-            completion_tokens = 0
-            # completion_tokens = count_token(content, model=model)
+            if chunk and chunk.usage:
+                completion_tokens = chunk.usage.completion_tokens
+            else:
+                completion_tokens = 0
        else:
            completion_tokens = 0
            # TODO: fix assumption that dict values were added in order and actually order by int index
--- a/python/packages/autogen-ext/tests/models/test_openai_model_client.py
+++ b/python/packages/autogen-ext/tests/models/test_openai_model_client.py
@ -11,6 +11,7 @@ from autogen_core.components.models import (
    FunctionExecutionResult,
    FunctionExecutionResultMessage,
    LLMMessage,
+    RequestUsage,
    SystemMessage,
    UserMessage,
 )
@ -24,28 +25,83 @@ from openai.types.chat.chat_completion_chunk import ChatCompletionChunk, ChoiceD
 from openai.types.chat.chat_completion_chunk import Choice as ChunkChoice
 from openai.types.chat.chat_completion_message import ChatCompletionMessage
 from openai.types.completion_usage import CompletionUsage
+from pydantic import BaseModel
+
+
+class MockChunkDefinition(BaseModel):
+    # defining elements for diffentiating mocking chunks
+    chunk_choice: ChunkChoice
+    usage: CompletionUsage | None


 async def _mock_create_stream(*args: Any, **kwargs: Any) -> AsyncGenerator[ChatCompletionChunk, None]:
    model = resolve_model(kwargs.get("model", "gpt-4o"))
-    chunks = ["Hello", " Another Hello", " Yet Another Hello"]
-    for chunk in chunks:
+    mock_chunks_content = ["Hello", " Another Hello", " Yet Another Hello"]
+
+    # The openai api implementations (OpenAI and Litellm) stream chunks of tokens
+    # with content as string, and then at the end a token with stop set and finally if
+    # usage requested with `"stream_options": {"include_usage": True}` a chunk with the usage data
+    mock_chunks = [
+        # generate the list of mock chunk content
+        MockChunkDefinition(
+            chunk_choice=ChunkChoice(
+                finish_reason=None,
+                index=0,
+                delta=ChoiceDelta(
+                    content=mock_chunk_content,
+                    role="assistant",
+                ),
+            ),
+            usage=None,
+        )
+        for mock_chunk_content in mock_chunks_content
+    ] + [
+        # generate the stop chunk
+        MockChunkDefinition(
+            chunk_choice=ChunkChoice(
+                finish_reason="stop",
+                index=0,
+                delta=ChoiceDelta(
+                    content=None,
+                    role="assistant",
+                ),
+            ),
+            usage=None,
+        )
+    ]
+    # generate the usage chunk if configured
+    if kwargs.get("stream_options", {}).get("include_usage") is True:
+        mock_chunks = mock_chunks + [
+            # ---- API differences
+            # OPENAI API does NOT create a choice
+            # LITELLM (proxy) DOES create a choice
+            # Not simulating all the API options, just implementing the LITELLM variant
+            MockChunkDefinition(
+                chunk_choice=ChunkChoice(
+                    finish_reason=None,
+                    index=0,
+                    delta=ChoiceDelta(
+                        content=None,
+                        role="assistant",
+                    ),
+                ),
+                usage=CompletionUsage(prompt_tokens=3, completion_tokens=3, total_tokens=6),
+            )
+        ]
+    elif kwargs.get("stream_options", {}).get("include_usage") is False:
+        pass
+    else:
+        pass
+
+    for mock_chunk in mock_chunks:
        await asyncio.sleep(0.1)
        yield ChatCompletionChunk(
            id="id",
-            choices=[
-                ChunkChoice(
-                    finish_reason="stop",
-                    index=0,
-                    delta=ChoiceDelta(
-                        content=chunk,
-                        role="assistant",
-                    ),
-                )
-            ],
+            choices=[mock_chunk.chunk_choice],
            created=0,
            model=model,
            object="chat.completion.chunk",
+            usage=mock_chunk.usage,
        )


@ -95,17 +151,64 @@ async def test_openai_chat_completion_client_create(monkeypatch: pytest.MonkeyPa


@pytest.mark.asyncio
-async def test_openai_chat_completion_client_create_stream(monkeypatch: pytest.MonkeyPatch) -> None:
+async def test_openai_chat_completion_client_create_stream_with_usage(monkeypatch: pytest.MonkeyPatch) -> None:
    monkeypatch.setattr(AsyncCompletions, "create", _mock_create)
    client = OpenAIChatCompletionClient(model="gpt-4o", api_key="api_key")
    chunks: List[str | CreateResult] = []
-    async for chunk in client.create_stream(messages=[UserMessage(content="Hello", source="user")]):
+    async for chunk in client.create_stream(
+        messages=[UserMessage(content="Hello", source="user")],
+        # include_usage not the default of the OPENAI API and must be explicitly set
+        extra_create_args={"stream_options": {"include_usage": True}},
+    ):
        chunks.append(chunk)
    assert chunks[0] == "Hello"
    assert chunks[1] == " Another Hello"
    assert chunks[2] == " Yet Another Hello"
    assert isinstance(chunks[-1], CreateResult)
    assert chunks[-1].content == "Hello Another Hello Yet Another Hello"
+    assert chunks[-1].usage == RequestUsage(prompt_tokens=3, completion_tokens=3)
+
+
+@pytest.mark.asyncio
+async def test_openai_chat_completion_client_create_stream_no_usage_default(monkeypatch: pytest.MonkeyPatch) -> None:
+    monkeypatch.setattr(AsyncCompletions, "create", _mock_create)
+    client = OpenAIChatCompletionClient(model="gpt-4o", api_key="api_key")
+    chunks: List[str | CreateResult] = []
+    async for chunk in client.create_stream(
+        messages=[UserMessage(content="Hello", source="user")],
+        # include_usage not the default of the OPENAI APIis ,
+        # it can be explicitly set
+        # or just not declared which is the default
+        # extra_create_args={"stream_options": {"include_usage": False}},
+    ):
+        chunks.append(chunk)
+    assert chunks[0] == "Hello"
+    assert chunks[1] == " Another Hello"
+    assert chunks[2] == " Yet Another Hello"
+    assert isinstance(chunks[-1], CreateResult)
+    assert chunks[-1].content == "Hello Another Hello Yet Another Hello"
+    assert chunks[-1].usage == RequestUsage(prompt_tokens=0, completion_tokens=0)
+
+
+@pytest.mark.asyncio
+async def test_openai_chat_completion_client_create_stream_no_usage_explicit(monkeypatch: pytest.MonkeyPatch) -> None:
+    monkeypatch.setattr(AsyncCompletions, "create", _mock_create)
+    client = OpenAIChatCompletionClient(model="gpt-4o", api_key="api_key")
+    chunks: List[str | CreateResult] = []
+    async for chunk in client.create_stream(
+        messages=[UserMessage(content="Hello", source="user")],
+        # include_usage is not the default of the OPENAI API ,
+        # it can be explicitly set
+        # or just not declared which is the default
+        extra_create_args={"stream_options": {"include_usage": False}},
+    ):
+        chunks.append(chunk)
+    assert chunks[0] == "Hello"
+    assert chunks[1] == " Another Hello"
+    assert chunks[2] == " Yet Another Hello"
+    assert isinstance(chunks[-1], CreateResult)
+    assert chunks[-1].content == "Hello Another Hello Yet Another Hello"
+    assert chunks[-1].usage == RequestUsage(prompt_tokens=0, completion_tokens=0)


@pytest.mark.asyncio