Revolutionary AI Camera App for Visually Impaired
Table of Contents
- Introduction
- The Need for an Intelligent Camera App
- What is VQA?
- Project Workflow
- Solution Architecture
- Technical Architecture
- Modeling Process
- Deployment
- Demo
- Results and Conclusion
Introduction
Helen Keller once said, "There is no better way to thank God for your sight than by giving a helping HAND to someone in the dark." With this in mind, Manisha, Janvi, and Hashita have created an app that helps transform the visual world into an audible experience. More than 338 million people worldwide suffer from visual impairment, and this app aims to make their lives easier by providing information about who and what is around them.
The Need for an Intelligent Camera App
People with visual impairments go through the motions of life just like any other ordinary person, but sometimes the simplest things can become difficult to manage. Tasks like locating a matching pair of socks or telling a bottle of shampoo apart from that of a conditioner can be challenging. They need help with these tasks, not hand-holding, but just a helping hand or, rather, a helping pair of eyes. Today's technology in education creates an environment where visual impairment doesn't hold people back. With that in mind, the aim of this project was to build an intelligent camera app that provides information about the world around You, which is particularly useful for a person with visual impairment.
What is VQA?
In general, Visual Question Answering (VQA) systems attempt to correctly answer questions in natural language regarding the input image. The broader idea of this problem is to design systems that can understand what the components of an image are, similar to how humans do, and communicate effectively about the event or the image in natural language. This is a challenging task since it requires both image as well as language-Based models to Interact and complement each other perfectly.
Project Workflow
As a part of our proof of concept, we collected data and performed the required preprocessing and stored the data in GCS buckets. We also built a baseline model. In the prototyping stage, we started with enhancing the baseline model along with building a few APIs in front-end containers and started working on the deployment setup. In the final stage, we were able to build and deploy a proposed VQA app.
Solution Architecture
The solution architecture helps in identifying the building blocks of our app. In our case, the process involves the data scientist who builds a model and the user who interacts with the app by uploading images, asking questions, and getting answers. We then have the execution part where the data scientist experiments with different models used, trained in Collab, and tracks their performance. Here, the user interaction with the app is also satisfied using the front-end code, which gives us the results to our users by communicating with the API service that gets the model prediction. The state here shows how the data and models are being managed. We use the GCS bucket to do the same.
Technical Architecture
The technical architecture provides us with a blueprint of the system. This will illustrate to us how every component is connected with each other. The code was written in IDEs, and each component was containerized during the development of the app. The code was pushed to GitHub when it was changed for source control. Collab notebooks were used for running different models, and they were stored in GCS buckets. Once the development was done, all the container images were pushed to the Google Cloud Registry. The app was then deployed by creating a simple compute instance and setting up containers in this VM. In our case, the front-end container and API service container were set up in the VM, which also communicated with the GCS for loading the model weights. Nginx container is a web server used for serving the web app. The user interacts with the developed app.
Modeling Process
We used the publicly available VQA dataset due to its larger size as compared with the others. The basic structure of the dataset has over 120,000 images with at least three questions per image, and each question has ten ground truth answers along with one given answer. We tried three different VQA models. The first model that we tried was CNN with LSTM-based models, both of which are really heavy models, so when the accuracy was also not that great. We then tried to make it a little faster by using BERT for our language as our language model. This drastically improved our accuracy to 51%, but the time taken for it was still quite high. We then came across VILT, which improved the accuracy of our baseline model by about 30% and gave satisfactory results.
Deployment
We have modularized our app into three major microservices. The first is the front-end service, which has all the HTML CSS components. It then Talks to the backend, which is the API service. It is built using FastAPI, and our API service uses the data tracker service, which continuously keeps track of the model We Are experimenting with. We then containerized each of these services using Docker while doing the local development. Once the development was done on the local system successfully, we pushed all of this to GCP using an Ansible script.
Demo
Our app is integrated with the VILT model for getting the answers because that is the best-performing model for us. This app also supports speech to text as well as text to speech. The user can either take a picture or upload an image. They can then ask a question or Type in the question. The answer for the same would be given in input text as well as audio format.
Results and Conclusion
We trained with three different VQA models. VILT improved the accuracy of our baseline model by about 30% and gave satisfactory results. We can do much better with respect to fine-tuning our models further onto more realistic datasets. Instead of using artificially generated VQA datasets, we can use more realistic and real-world scenarios. We can personalize our app further to recognize more Context-specific scenarios. We can train our model to Read documents and answer questions related to the documents. This would help visually impaired people in a drastically huge manner.
Highlights
- An intelligent camera app that provides information about the world around you, which is particularly useful for a person with visual impairment.
- VQA systems attempt to correctly answer questions in natural language regarding the input image.
- We used the publicly available VQA dataset due to its larger size as compared with the others.
- We came across VILT, which improved the accuracy of our baseline model by about 30% and gave satisfactory results.
- We can train our model to read documents and answer questions related to the documents.
FAQ
Q: What is VQA?
A: Visual Question Answering (VQA) systems attempt to correctly answer questions in natural language regarding the input image.
Q: What dataset did you use for your project?
A: We used the publicly available VQA dataset due to its larger size as compared with the others.
Q: What is VILT?
A: VILT is a model that drastically simplifies the processing of visual inputs to just the same convolution-free manner in which we process the textual inputs. This makes the VILT model up to 10 times faster than any of the previous VLP models.
Q: Can your app read documents and answer questions related to the documents?
A: Yes, we can train our model to read documents and answer questions related to the documents.