How to Make Your Own Computer Vision Model With Little to No Experience

Jaylen Schelb
15 min readDec 10, 2021

--

Photo by Pietro Jeng on Unsplash

Take a moment to look around you. What objects do you see? Your phone? Maybe a water bottle? What else? Now, how were you able to identify those objects just by looking at them? Kind of a silly question; we as humans have an exceptional ability of visually identifying objects that we are familiar with. What about computers, though?

Source

Computer vision, more so deep learning as a whole, has seen massive innovation in recent years all across the board. Take self-driving cars, for example. Companies like Tesla have taken the automotive world and turned it completely upside down with autonomous technology thanks in part to innovations in areas like object detection models.

Object detection models, like the name implies, are deep learning models that are specifically trained to detect a certain set of objects. If the model is working properly, then it is able to detect specific objects with almost startling accuracy, like in the video demo below.

Pretty cool, right? Now, one may think that a model like this requires great amounts of computing power and expertise to create; and if you said this ten years ago, you would be correct. In today’s world, however, creating object detection models have been made easier than ever thanks to newer technologies and projects like TensorFlow and YOLO (we will get into these a little bit later).

Earlier in the year, I was tasked with creating an object detection model for a web app. At first, this sounded almost impossible for someone like me to take on. Sure, I had some previous machine learning experience under my belt, but that was really it as far as data science went. Up until that point, the realm of deep learning and computer vision were concepts that I never imagined myself being able to tackle this early on in my journey of the subject. Nevertheless, I accepted the seemingly arduous task and began researching. Armed with some great mentors by my side, I slowly started to realize that this wasn’t going to be nearly as bad as I initially thought. This isn’t to say that it was easy by any stretch, but I was seeing that it definitely was more than just possible.

Long story short, I was able to complete my task and get a working model implemented in the web app, making predictions off of a users video feed. Despite this, It felt like half the time I was just clicking buttons, whilst not truly knowing what I was doing. I would always be able to go to my mentors and teachers for help and clarification, but the whole process was still a bumpy ride to say the least. After I completed the initial model, I was able to enroll in a deep learning class at my university. Let me tell you, this seven week course just about cleared up every single misunderstanding I had with the whole process and beyond. I felt refreshed knowing that I finally understood the puzzle as a whole, not just random pieces here and there.

This leads me into the purpose of writing this article; I want to walk people through making an object detection model of their own, whilst explaining the reason of each action so as to maintain a clear level of understanding throughout the entire process. Don’t get me wrong, these models probably aren’t going to be used in any self-driving vehicles anytime soon, but the notion of technologies like these being out of reach for the average tech enthusiast is completely false. It is with this article that hope to shed some light on the true accessibility and shear power that these technologies and projects bring to the public.

Getting Started

Every object detection model has a purpose. What will yours be? In my case, my model was trained to identify various parts of the human throat as it was implemented in a medical-related app. Make sure that it is an actual object(s) that you want your model to identify like a chair or a table. Although there are models out there that can identify abstract things like emotions, the implementation of such will not be covered here.

Acquiring Images

Once you have decided what you want your model to identify, it is time to find some images to train your model on. This step is often very tedious and time consuming unless you are lucky enough to find a nice existing dataset. When selecting your images be mindful of what you are or aren’t including. Remember that your dataset is the foundation of your model, and without a solid foundation, your model will never perform well, no matter how much you try to optimize it. It is of very high importance that you try and find images that closely replicate what your model will be analyzing in the wild, ideally having many from the same configuration with multiple variations in things like camera, lighting, and angle. Keep in mind that your model will also learn from what is not there, so adding “negative” images (images that do not contain anything that you want to identify) to the dataset is beneficial. In my case, my model would be frequently exposed to the background of various rooms in a house, so I made sure to put in negative images of livings rooms, bedrooms, bathrooms, etc. Aim to have roughly 0–10% of your images be negative.

How many images should I add to my dataset?

For a generic object detection model, many experts recommend at least 1000+ images for good results. It is understood that this amount of quality images is not easily attainable in many circumstances, but it is a recommended long term goal at the very least, especially if you plan to put your model into production in the future. Nevertheless, a dataset of only a couple hundred images should suffice for observing initial results.

Annotating Images

Before we dive in, I want to first explain the differences between a few computer vision techniques.

Source

Image classification is somewhat self-explanatory; each image is put into a category depending on what the model finds in each image. For example, one could have an image classification model for whether or not a cat was in an image or not. Image classification is one of the least resource intensive techniques, as the model is not required to locate anything specific in an image, just put it into a category (class). Contrast this to object detection, where the model is required to locate something specific in the image with a bounding box. A bounding box is simply a rectangular region drawn around an object in an image. Finally, semantic segmentation is similar to object detection, only that the object is to be outlined exactly, instead of just a standard bounding box. This technique unsurprisingly takes the most computing power out of the three. As stated previously, we will be working with object detection models for this walkthrough.

I wanted to bring up these distinctions because each technique is annotated or labeled accordingly. This means that images annotated for an object detection model will not work for training a semantic segmentation or image classification model and vice versa. In a landscape where these model types are talked about quite frequently, it is important to have the clear idea that although similar, the process of making these models is different from each other at the start.

There are many great online annotation services out there. Two good examples are Roboflow and IBM Cloud Annotations; both are great services and offer much more beyond just annotations. Since Roboflow is integrated with YOLO (the model architecture we will be using), It would probably be more seamless and easy to start there. I, however, used IBM Cloud Annotations for my dataset. Both offer a really nice feature called auto labeling. Auto labeling takes an existing model trained on the same objects, and uses it to aid you in annotating more images. If the model being used is accurate enough, this can greatly reduce the time it takes to annotate images. Read more about this here and here. Granted, this feature is most likely only useful once you get your first iteration of your model up and running, but is nevertheless good to keep in mind, with hopes of potentially saving loads of time down the road.

CONTENT WARNING: THE REST OF THE ARTICLE CONTAINS IMAGES OF HUMAN MOUTHS/THROATS. IF THIS MIGHT BE A PROBLEM FOR YOU, DO NOT SCROLL ANY FURTHER.

Tips for annotations

Annotating images can seem quite trivial at first, but know that there are some best practices that should be taken note of. First off, make sure to draw the boxes as tight as possible to the object, whilst still maintaining the entire object within the box.

Screenshot of IBM Cloud Annotations interface

Sometimes the bounding box may contain a good amount more than just object itself. In the case of the tongue label below, it is required as the bounding box is as tight as it can go while still encasing the entire object.

Even if the object is occluded by another object like with the label below, it is still imperative to draw the box around the entire object no matter what. As previously stated, your model will learn from what is not there just as much from what is, so omitting all or some of an object that your model was built to identify makes for potential problems in the future.

Splitting Images

Great, we have our images and they are all labeled. Now what? Uploading my images to Roboflow, I am now able to split my images into training, validation, and testing sets.

Screenshot from Roboflow interface

The training set refers to the images that the model will be trained on for each training run, the validation set refers to the images that the model will be tested against for each training run, and the testing set refers to the images that are set aside to assess the performance of the model after training.

How should I split my data?

There is no definitive answer, but you will always want a majority of your images in your training set. In past iterations of my model I have used a 70%/20%/10%, 80%/10%/10% , and other similar splits (train/valid/test).

Preprocessing

Preprocessing refers to a wide range of deterministic steps performed on all images prior to training. , I am able to easily select what kind of preprocessing I want to apply.

In my case (and probably yours, too) all that is needed is to auto-orient and resize the images. Resizing is self-explanatory, while auto-orient simply discards EXIF rotations and standardizes pixel ordering.

Roboflow offers a handful of other preprocessing options, as seen above. Explaining each one is out of scope for this article, but clicking on each option brings up a nice explanation and demonstration of each, so feel free to explore!

Augmentation

The next step in the pipeline is augmentation. Augmentation is the process of creating more training examples by distorting the input images in some way, shape, or form. These distortions cover a multitude of things including flipping, blurring, and even adding noise to an image.

Just like with preprocessing, Roboflow offers numerous augmentation options. Clicking on each option within the site will bring up an explanation, along with an article detailing when to use it. For our case, YOLO actually does online augmentation during training, so it is not recommended that any augmentations get applied here.

Exporting the Dataset

Now that all of the parameters have been set, the dataset can now be generated. Once generated, it needs to be exported in the correct format.

For Roboflow in particular, exporting to the YOLO format is exceptionally easy. There are a few routes you can take for obtaining the exported dataset; since I trained my model in Google Colab, I opted for the download code in favor of simplicity.

Setting up the Model

It is now time to start working with the actual model architecture! YOLO provides a super nice tutorial notebook to help you along, which also covers some of the previous steps I explained above. It is encouraged that you start with this notebook for training, modifying things to your liking as time goes on. Since the notebook guides you nicely through the processes, I am going to focus less on step-by-step and more on what each step actually means.

Training the Model

For many people, this is the point where things can start to get a little confusing. Take a look at the training line in the notebook:

Screenshot from YOLO tutorial notebook

Terms like data and cache make sense, but what about the others? What is all of this doing? Let’s break it down.

img

The ‘img’ tag is defining a specific size for all of the images in the dataset. Many object detection models have a specific image size that they are trained to analyze, and this one is no different. So what should your image size be? Well, it kind of depends. In most cases, a smaller image size (sizes like 320x320, 640x640, etc) is desired in favor of model speed. There are some cases, though, where a higher resolution (sizes like 1280x1280 and beyond) is a good idea. If, for example, the object your model is identifying is always super small relative to the image, then it could benefit from a higher resolution image. This inherently introduces the balancing act of model speed and model performance. Speed is favored in most situations, so going with a smaller image size at first is probably the best idea. For my model I stuck with the default image size of 640x640 (as seen in the preprocessing step), and carried it over to my training.

batch

‘batch’ refers to batch size. Batch size refers to the number of images the model is training on in each step. Larger batch sizes often mean longer training and vice versa. This also plays a role in model accuracy, although it is far from black and white as the optimal batch size depends on many factors (bigger batch sizes does not directly mean a better model). Because of the ambiguity in selecting a batch size, it is a good idea to stick with the default size at first, adjusting it in future iterations if need be.

epochs

Epochs refers to the number of times to run through your training data. As the training cell describes, 3000+ epochs are commonplace. Despite this, I suggest starting your first training around 300 epochs and going from there.

weights

Weights refer to the parameters of a model that aid in making decisions. This is a very high level and vague definition as a proper definition requires understanding of how a neural network (deep learning model) works in general, which is out of scope here. There are plenty of resources available on the internet for enlightening yourself on this topic. For now, know that weights are what gives a model its decision-making ability.

Why are we selecting weights when we haven’t trained a model yet?

In this case, YOLO takes advantage of a technique called transfer learning. Transfer learning involves using pre-trained weights to aid in a models training. You are essentially “transferring” the knowledge from one dataset, and then fine-tuning it to aid in learning for the new domain. Imagine if you had an object detection model that could detect cars and trucks. Now imagine that you want to create a new model that detects motorcycles. Instead of starting from scratch, what if you could take some of the knowledge from the cars and trucks model and use it to aid in training? After all, cars and trucks are quite similar to motorcycles in the grand scheme of things, so it is not all that absurd to consider something of the sort. This is what transfer learning does from a high level.

In this case, YOLO provides four different pre-trained models to choose from.

Source

Each are trained on a variation of the COCO dataset, meaning that many object detection models can benefit off of using one of these.

Which pre-trained model should I use?

When choosing a pre-trained model for transfer learning, it is best to use the smallest and fastest one practical enough for your task. Although the larger models provide more accuracy, they are also much slower. This means that if the YOLOv5s model does the trick, there should be little incentive to switch to anything bigger.

Observing Training

With that all out of the way, we can now comfortably start training our model. Another really nice feature of YOLO is its integration with W&B (Weights & Biases). If you completed this step earlier in the notebook, then you are able to log in and see a complete rundown of what is happening during training. Here is what some of the metrics might look like for a given training run.

Screenshot from W&B interface

So what do all of these metrics mean? Let’s take a look.

mAP

mAP refers to mean average precision. Put simply, the mAP compares the bounding box of the truthful annotated images and compares it to what the model draws. The closer the boxes, the better the score. This should make a point for ensuring bounding boxes are highly accurate during image annotation.

Precision

Precision is straightforward, as it simply measures the overall accuracy of the model. Do not let this lead you into thinking that this is the end all be all of metrics, however. Although a good indication as to how you model is performing, there is much more to it than that when it comes to analyzing the performance of a model.

Recall

Without getting into the weeds of this term, recall is used to asses whether or not the model is “guessing” correctly. A higher recall is better.

Optimizing the Model

At this point, the first iteration of your model is complete! Congratulations! You can now see how your model performs on testing images never before seen by it. How did it do?

Along comes the process of optimization to increase performance. One of the greatest things you can do to increase the overall quality of your model is to add more high-quality annotated images to the dataset. Diminishing returns do play a role in this, but the statement still stands.

The next step to optimization is fine-tuning the hyperparameters.

Source

Above are the hyperparameters that YOLO provides (found here). This is also where the online augmentation can be modified. After initially training with default hyperparameters, feel free to start tweaking different values, retraining, and seeing what happens. Like many things in this field, messing with hyperparameters is an art that you will get more comfortable with over time.

Overfitting

One thing you need to be mindful of when training is overfitting. Overfitting is a concept that refers to a model fitting exactly against its training data. This may look like a good sign at first, but unfortunately means that your model has most likely picked up a pattern in your dataset, and cannot properly identify any new data that you throw at it. Overfitting can be detected by observing that the model is doing significantly better on the training set than on the testing set. In addition, validation metrics (the metrics seen two images above) that start to dip and rise drastically is also a sign of overfitting.

As an initial attempt to reduce/eliminate overfitting, try lowering the number of training epochs. Increasing some of the augmentation hyperparameters is a good tactic as well.

Further Optimization

All in all, model optimization is a broad term that encompasses many methods and techniques. For a more in-depth look at what more you can do to specifically improve a YOLO model, check out this resource from the creators themselves.

Model Export

Okay so now you have this fancy new object detection model, how do you do anything with it? YOLO provides numerous export formats to work with, each having their own intended use. For example, since my throat model was developed to run in a web browser, I exported it in a TensorFlow web model format. Here is a screenshot of the model analyzing my phone video from a web browser.

Model implementation beyond export is quite user-specific, but rest assured that the production possibilities are endless with ample time and research. To get started with exporting in specific formats, view the documentation here.

Conclusion

Hopefully by now you have a nicely working object detection model that is ready to be put to use! Not as hard as you may have once thought, right? If you enjoyed and/or were inspired by the work that you did alongside me, I highly encourage you to dive deeper into this! Things like object detection models are only scratching the surface of what the field of data science is capable of. The road is never ending!

Resources

The following resources are great starting points in getting to know more about object detection, YOLO, and just data science in general.

YOLOv5 GitHub Page

Roboflow Documentation

Roboflow Glossary of Common Computer Vision Terms

How a Neural Network Works

--

--

Responses (1)