Multimodal Tasks with Gemini | Firebase Extensions Hub

Introduction

Powered by Gemini, Google’s state-of-art Large Language Model (LLM), this extension helps to easily perform generative tasks using Gemini’s large language models, allowing users to store and manage conversations in Cloud Firestore.

The Multimodal Tasks with Gemini extension provides an out-of-the-box solution allowing you to perform AI/ML tasks on text and images, customizable with prompt engineering.

Google AI

Google DeepMind has released its newest Large Language Model (LLM) called Gemini. This solution includes a new range of features for allowing developers to converse and interact with AI easily.

Gemini offers a range of features and capabilities including:

A Multimodal approach

Gemini provides a multimodal approach, allowing it to understand and process various forms of information. This also allows users to combine different types of content such as text, code, audio, images, and video to generate AI responses.

Deliberate Reasoning

Gemini surpasses previous models in its ability to engage in deliberate reasoning. This means that the language model is not only capable of processing and understanding information but also of applying logical reasoning to solve problems and make decisions.

Human-quality Text Generation

Gemini generates human-quality text, which means that it can create realistic and engaging narratives, write different kinds of creative content, and even translate languages.

Model Sizes

This extension uses the Gemini Pro and Gemini Pro Vision chat models for handling requests.

Google AI API

This Firebase extension uses the Google AI API (for developers) and Vertex AI to communicate with Gemini.

Getting Started

Before installing, ensure you have a Firebase project set up. If it’s your first time, here’s how to get started:

Create a Firebase project
Set up Cloud Storage within your project(optional)

This extension allows images to be sent in either a base64 or Google Cloud Storage (GCS) url formats. Ensure you have set up Cloud Storage if you would like to use GCS urls.

Once your project is prepared, you can integrate the Multimodal Tasks with Gemini extension into your application.

Extension Installation

Installing the “Multimodal Tasks with Gemini” extension can be done via the Firebase Console or using the Firebase CLI.

Option 1: Firebase Console

Find your way to the Firebase Extensions catalog, and locate the “Multimodal Tasks with Gemini“ extension. Click Install in Firebase Console and complete the installation process by configuring your extension’s parameters.

Install extension via console

Option 2: Firebase CLI

For those who prefer using command-line tools, the Firebase CLI offers a straightforward installation command:

firebase ext:install googlecloud/firestore-multimodal-genai --project=projectId-or-alias

Configure Parameters

Configuring the “Multimodal Tasks with Gemini” extension involves several parameters to help you customize it. Here’s a breakdown of each parameter:

To quickly get started with this extension, there are up to four fields which must be provided.

Required Configuration

Cloud Functions location
The location field should represent the closest location to where your project is located. In this example we are going to select Iowa (us-central1). See this guide for more information on cloud function locations.

Prompt
A prompt will represent the query that will be sent to the Gemini AI with every input from the user. To ensure customisation per request, this field can substitute values through handlebars. Here, we should include a message that has a value that can be substituted for each document that is written:

Provide a star rating from 1-5 of the following review text: {{comment}}
Based on handlebars configuration, the {{comment}} field will be replaced with the appropriate text before sending the prompt to Gemini.

Default Parameters

Gemini API Provider
Since Google AI requires an API key, you have the option to choose Vertex AI instead, which authenticates requests using application default credentials.

Generative Model
This extension offers two approaches for sending information to Gemini:

Gemini Pro: For sending text based requests.
Gemini Pro Vision: For sending text and image based requests

This guide demonstrates how both of these models can be used to produce a response from Gemini.

Each of these models require a bit of customization, the How it works guide below will demonstrate how to configure the extension for each scenario.

Optional Parameters

API key (A required field for Google AI)
If you selected Google AI as the Gemini API provider, you will need to provide a Google AI API key. This can be obtained through Google AI Studio.

To generate a key:

Select Get API key.
Select Create API key in new project or Create API key in existing project.
Select the newly generated key and choose Copy.

Navigate back to your extension configuration and enter the value before selecting Create Secret.

This will store your secret securely in the Google Secret Manager.

Variable Fields (optional)
Based on the above prompt setting, we should now set the variable field to be comment, this should match the {{comment}} substitution value that was added to the prompt field.

Here’s a full demonstration of how the final result would look like:

prompt: Provide a star rating from 1-5 of the following review text: {{comment}}

comment: This a nice place, would give it another visit

The final result to be sent to Gemini by the extension, for example:

Provide a star rating from 1-5 of the following review text: This a nice place, would give it another visit

Image Field (A required field for Gemini Pro Vision only)
This field is for when using both a text and image prompt when sending a request to Gemini, and will only be used when the Gemini Pro Vision option has been selected.

This field will represent the Firestore field that contains an image value in one of two formats:

Google Cloud Storage Url:: This must be in the format of gs://{projectId}.appspot.com/{filePath}
Base64:: This a representation of an image that has been encoded as a Base64 string.

Gemini API Provider
This option provides a list of APIs that the extension will use to send and receive responses from Gemini. This currently has only one option, which is Google AI.

Language model
Gemini Pro provides conversation with the Gemini AI.

Gemini Pro Vision provides a mulit-modal approach, meaning that you can send both text and image data. This will return a text-based response.

Prompt Field
This is the Firestore field which contains the query from the user when adding a document in Firestore. Use handlebars to provide variable substitutions for on demand prompts when adding a new Firestore document.

Response Field
This is the Firestore field which contains the response from Gemini when adding a document in Firestore. This can remain as the recommended output value.

Candidate Count (optional)
The Gemini AI has the ability to return multiple responses, allowing users to select which is the most appropriate. In this instance we would like one response. Keep this value as 1

How it works

There are multiple approaches to using this extension, and this can depend on which model and approach you have chosen when configuring the extension. There are currently two options:

Gemini Pro
Gemini Pro Vision

The following section will demonstrate how you can use each option.

Process Text with Gemini Pro

In this section, we are using the following parameters to configure the extension:

Language model: Gemini Pro
Prompt: Provide a star rating from 1-5 of the following review text: {{comment}}
Variable fields (optional): comment

To process text only with Gemini, we chose Gemini Pro as the language model.

Then, we provide context through a custom prompt to instruct the model on how to respond. In this case, the prompt asks Gemini to rate a review text on a scale of 5.

Finally, we have included a variable field to substitute custom values for each document, this will use a handlebars approach to inject the text value into the prompt on each request.

After installation and configuration, the extension will listen for any new documents that are added to the configured collection. To see it in action:

Add a new document: By including a comment field with text, the AI will be prompted for a response.
Tracking status: While a response from the AI is generated, the extension will update the current Firestore document with a status field. This field will include PROCESSING, COMPLETED and ERRORED responses.
Successful responses: When the AI has returned a successful response. A new field will be added containing a response.
Regenerating a response:: Updating the status field to anything over than “COMPLETED” will regenerate a response.

How the Multimodal Tasks with Gemini Extension works (Gemini Pro)

Steps to connect to the Multimodal Tasks with Gemini (Pro)

Before you begin, ensure that the Firebase SDK is initialized and you can access Firestore.

const firebaseConfig = {
  // Your Firebase configuration keys
};

firebase.initializeApp(firebaseConfig);
const db = firebase.firestore();

Next, we can simply add a new document to the configured collection to start the conversation.

/** Add a new Firestore document */
await addDoc(collection(db, 'collection-name'), {
  comment: 'The best food I have ever tasted!',
});

Listen for any changes to the collection to display the conversation.

/** Create a Firebase query */
const collectionRef = collection(db, 'collection-name');
const q = query(collectionRef, orderBy('createTime'));

/** Listen for any changes **/
onSnapshot(q, (snapshot) => {
  snapshot.docs.forEach((change) => {
    /** Get prompt and response */
    const { prompt, response, status } = change.data();

    /** Update the UI status */
    console.log(status);

    /** Update the UI prompt */
    console.log(prompt);

    /** Update the UI AI status */
    console.log(response);
  });
});

Process Text and Images with Gemini Pro Vision

In this section, we are using the following parameters to configure the extension:

Language model: Gemini Pro Vision
Prompt: {{instruction}}
Variable fields (optional): instruction
Image field (optional): image

To process prompts containing both text and images with Gemini, we chose Gemini Pro Vision as the language model

A prompt is included to direct Gemini, in this case we would like to provide a custom field along with the image. {{instruction}} represents the Firestore field that will be added to the document with the prompt we would like to be processed with the image.

In addition, we have included the Variable fields value, this will contain the instruction to send as a prompt alongside the chosen image.

Lastly, we have provided an image field which will store the image reference to use with the prompt from the user. Although optional on configuration, it is important to note that this value is mandatory, if you are to use a multi-modal approach.

After installation and configuration, the extension will listen for any new documents that are added to the configured collection. To see it in action:

Add a new document: Two fields are required for the new document. A prompt field representing the text to send to Gemini, alongside you will need an image field based on the field name from your configuration. The image field value will be either a GCP Storage URL or a base64 encoded value of an image.
Tracking status: While a response from the AI is generated, the extension will update the current Firestore document with a status field. This field will include PROCESSING, COMPLETED and ERRORED responses.
Storage: If the image source is Google Cloud Storage(GCS), a buffer value of the image will be downloaded based on the url.
Successful responses: When the AI has returned a successful response. A new field will be added containing a response alongside additional metadata.
**Regenerating a response:**Updating the status field to anything other than “COMPLETED” will regenerate a response.

How the Multimodal Tasks with Gemini Extension works (Gemini Pro Vision)

Steps to connect to the Multimodal Tasks with Gemini (Vision Pro)

Before you begin, ensure that the Firebase SDK is initialized and you can access Firestore.

const firebaseConfig = {
  // Your Firebase configuration keys
};

firebase.initializeApp(firebaseConfig);
const db = firebase.firestore();

Next, we can simply add a new document to the configured collection to start the conversation.

/** Add a new Firestore document */
await addDoc(collection(db, 'collection-name'), {
  instrction: 'What kind of flower is this?',
  image: 'gs://example-image.jpg',
});

Listen for any changes to the collection to display the conversation.

/** Create a Firebase query */
const collectionRef = collection(db, 'collection-name');
const q = query(collectionRef, orderBy('createTime'));

/** Listen for any changes **/
onSnapshot(q, (snapshot) => {
  snapshot.docs.forEach((change) => {
    /** Get prompt and response */
    const { instruction, output, status } = change.data();

    /** Update the UI instruction */
    console.log(instruction);

    /** Update the UI output */
    console.log(output);

    /** Update the UI AI status */
    console.log(status);
  });
});

Conclusion

By installing the Multimodal Tasks with Gemini extension, you can receive highly efficient responses by simply adding a Firestore document with text and an image. Responses are highly configurable, allowing many customization options for developers.