May 27, 2024

[Tutorial] Using Ollama, LLaVA and Gravio to Build a Local Visual Question and Answer AI Assistant

Tutorial on how to use Gravio, Ollama, LLaVA AI to build a local Visual Question and Answer (VQA) application. Anyone can build this solution without coding required and deploy it as a PoC or even in a production environment if the use case fits.

[Tutorial] Using Ollama, LLaVA and Gravio to Build a Local Visual Question and Answer Computer Vision AI Assistant

Introduction

This article will cover the building of a local Generative AI Assistant solution that utilizes open-sourced technologies with Gravio to build a local Visual Question and Answer (VQA) Computer Vision solution that works completely without internet connectivity on-premise.

Approach

tldr; we will use Gravio, Ollama and a Large Language Model (LLM) called Large Language and Vision Assistant (LLaVA) to build and deploy a Local AI Assistant application that does VQA.

‍

How Does It Work?

Image gets saved into a Folder on your local machine.
Gravio picks up this image and brings it into Gravio Platform as a form of “Input Data”
Whenever a new image is detected in the folder, Gravio will create a HTTP Post to LLaVA LLM via Ollama API that will send the image and prompt as part of the body.
LLaVA LLM will generate a response and return to Gravio.
Gravio will use the AI response as part of the solution and then send the response to LINE Message Application (which will require internet).
Finally, the response is also logged in a text file.

‍

‍

Tutorial

Requirements

Gravio HubKit installed (available on Mac, Windows and Ubuntu)
Gravio Studio installed (available on macOS and Windows)
Ollama installed (Ollama is an open-source project that serves platform for running LLMs on your local machine)
Windows or MacOS

Steps

Part 1: Setting up Ollama and Deploying LLaVA LLM

Step 1:

Download Ollama installer using the link provided on their website or you can click here.

Step 2:

Run the installer and follow the steps accordingly.

Step 3:

Upon successful installation, we can check to see if Ollama was installed properly. This can be identified in 2 different ways:

System Tray (Windows)
Using Command Prompt

To check using the System Tray, open your System Tray in Windows and you should see the Ollama icon.

‍

‍

To check using Command Prompt, open the Start Menu and type “cmd” and then press Enter.

In your command window you can type “Ollama” and you should see a list of available commands and how to use them. If they are available, that means that Ollama has been installed.

‍

Step 4:

In this step, we will download and run LLaVA LLM locally/on our Machine. You can download any LLM available from Ollama’s library and you can find out more about the available Models here.

‍

Background of LLaVA: Large Language and Vision Assistant (LLaVA) is a pre-trained end to end multimodal model that combines a vision encoder for general-purpose visual and language understanding. You can find out more about this LLM on their official website.

‍

Navigate to the LLaVA Model on Ollama’s website or you can click here. Then, on the page, you can copy the command line “ollama run llava”.

‍

Then, paste the command line in the same command window and hit enter. If this is your first time installing and running LLaVA, you should expect a download progress bar. Wait for the download and installation process to complete. When it is complete, you should be able to interact with the model directly. We are ready to move on to building the rest of the solution.

‍

‍

Part 2: Creating Actions and Triggers in Gravio Studio

Step 1:

We will need to create a Folder in the location of your choice on your desktop. This will serve as the Folder that Gravio has to look out for when a new image gets dumped into this folder. In my case, I have created a folder called “ImageUpload”.

PS. you can also do this with an ONVIF camera if you have one!

‍

Once you have created a Folder in your desired location, run Gravio HubKit and Gravio Studio. In Gravio Studio, access your localhost HubKit Node and go to the Device List page.

‍

In the Device List, select add a “Camera”. In the options for Camera Type, you can select Folder from the drop-down menu and then specify the File Path of the Folder you have just created. For example: C:\your\path\here. Then, you can change the settings according to your preference and save.

‍

‍

We have to add the Layer in Gravio Studio so that the platform will know that it has to pick up the File from the Folder. In the Device tab, select the “+” sign and Add Layer. Scroll down to “Camera” and save it. Then, you will need to bind the Folder you have saved to the Layer and ensure that it is Toggled on.

We have successfully enabled our Folder as the input to Gravio and the platform will be watching for a file upload now.

‍

Step 2:

It is time to create our Action so that we can build the next part of the solution. This Action should consist of:

Reading the Image File into raw format
Creating an HTTP Post to Ollama API and converting raw format into BASE64 (this is because of LLaVA API usage)
Sending the response of LLaVA LLM using Line Message
Logging AI Response in a text file

‍

We will only need 4 steps in our Action. The components are:

FileRead
HTTP Request
Line Message
FileWrite

‍

Add a new Action by selecting the “+” in Actions page and name it something meaningful, like LLaVAAPI in my case.

File Read Component

Let’s set up the FileRead component by setting the File Name to be read as the latest image file in the ImageUpload folder. We can set it in the Pre-Mappings section like this:

cp.FileName = tv.Data

‍

‍

HTTP Request

Set the HTTP Request component according to the API configuration of Ollama and LLaVA. You can refer to the Ollama LLaVA documentation on how to use the API here. First, we can set the properties:

‍

Method - POST
URL - http://localhost:11434/api/generate (if you have deployed Ollama locally, you can use this exact URL)
Content-Type - application/json
‍

And we can leave the rest as default.

‍

We have to set the body as well so that we can set the Payload correctly. To do that, please use the Pre-Mappings section:

‍

‍

Line Message

There are 2 main mappings we need to consider when using the Line Message component for this particular workflow.

The AI response will be received in a JSON structure. Hence, we will need to use JSONPath to only filter out the response to be used as the message in Line.
After sending the message over Line, there will be a response from Line stating that it is a successful message send and we do not want this to be part of the Payload when moving onto the final step - Write AI Response to a text file.

‍

Note: You will need your Line Token in order to use this component. You can get it here.

‍

Set the Line Message component up like this:

‍

‍

The other fields can be left empty, the main settings are in the pre-mappings and post mappings.

‍

cp.Message = JSONPath(“$.response”, cv.Payload)

‍

Now to the final step, FileWrite.

‍

FileWrite

We can set the FileWrite properties to the file name that we like to write the file to. In this case, I have just used a simple text file, ai-results.txt. As for the contents of what to write into the text file, set it up in the pre-mappings like this:

‍

cv.Payload = DateFormat(Now(), "02 Jan 06 15:04") + " LlaVa Response: " + cv.Payload + "\r"

‍

We are done! We can move onto building our final part, the Trigger.

‍

Step 3:

Configuring the Trigger is straightforward. Go back to the Triggers tab and select Add Event Trigger.

‍

Next, configure the Area, Layer and Device by setting it to the ImageUpload Folder that we have created in step 1.

‍

Then, in “Actions” select the Action we have just created, in this case, LLaVAAPI.

‍

‍

Select “Save” and we are done! The application has been built and everything should be automated.

Part 3: Testing The Live Application

Step 1:

Prepare your own images or look for some images online for your specific test scenario. It can be anything that you would like to ask about the image.

Step 2:

Change the prompt before testing the application. In order to change the prompt to the AI, change it in the LLaVAAPI Action that we built. Change it in the pre-mappings, there is a prompt key in the body. Amend it to the question that you would like to ask the AI.

Step 3:

Save or drop an image into the ImageUpload folder or the folder that you mapped in Gravio Studio in Part 2, step 1. I used this image from Google Search:

‍

Step 4:

Watch the magic happen! Gravio should pick the Image up, do a HTTP Request with the Image and a prompt then get a response from LLaVA LLM. I have tried with a few images, so here’s an example of the result.

Step 5:

We can also view the results in our text file that we created as our last step in the Action. Navigate to the folder that HubKit stores in and open the text file. On Windows, this is the file path:

‍

C:\ProgramData\HubKit\actmgr\data

‍

I have logged previous results from my testing too in Japanese language and it works just as well. You can try it with your preferred language and let us know whether it works too!

‍

That’s it! We have officially built a versatile solution using VQA and there are many use cases that you can apply it for almost out of the box. Although Line requires internet connectivity, you can remove this step and have it just logged into a text file, csv or SQL even. Let us know what you would like to create and we can show you in the next tutorial! See you soon!

Summary

If you want to build a solution without coding with very low developmental timeframes, you can use Gravio as the middleware to connect to various services. In this case, Ollama, Line and writing to a text file. We have successfully built a Local VQA application that ingests images from a Folder that provides a response from an AI.

‍

If you have any questions or need help, don’t hesitate to get in touch with us either on gravio.com (chat in the bottom right corner) or on https://link.gravio.com/slack - We’re excited to see what you will build with this technology!