But first, a bit of history…
Some time ago, I came across a post on LinkedIn, where someone showcased a video of himself in front of his computer opening and closing his left palm, and in doing so, he was controlling the T-Rex sprite, avoiding cacti (the plural form of cactus :)) in what is known as the T-Rex Run game.
At first, I found the entire post interesting. Then I paused a while and thought about the technologies used: Machine Learning, Deep Neural Networks, OpenCV, Keras. Really? That seemed blown a bit of proportion, but I wasn’t prepared not to give the guy the benefit of the doubt.
However, I was really disappointed that (at that time at least), the author was basically avoiding to share any code!? You can try and search for it on LinkedIn. I decided not to share the link to the article.
So I decided to make an attempt at replicating the effort (no matter really, how pointless the application’s context really is, still – it could provide the learning foundations for future projects)!
I had already written a piece of code which I could potentially use to control the sprite, provided the remaining portion of the code could interpret or recognize my hand gestures.
My immediate reaction was to look for ‘something’ already done to fill in the first half of the solution (hand gesture recognition) and I would fill in the second half (sending virtual keystrokes to a web browser running the T-Rex game based on the recognized hand gesture).
… and I was much disappointed here as well. I found plenty of examples (some of them shared on GitHub too!) – but none of them worked straight of the bat! From minor typos, to fundamental/principal bugs – which no matter how much I tried correcting – would take me one step closer but never across the finish line!
So I decided to do everything on my own – from scratch!
- Training a Neural Network to Detect Gestures with OpenCV in Python
I like these two articles because they provided most information in the most structured way. Well elaborated and well explained. Well done!
Back to my problem… There are basically two ways you could approach this:
- applying computer vision concepts and geometry algorithms, or
- applying machine learning concepts and algorithms
Although computer vision is a concept frequently employed in machine learning (in the context of my solution) the two approaches are fundamentally different.
- The 1st approach is a static one – design, deploy and forget (well, not quite forget -one could always improve). It is also susceptible to background noise. The background (color) affects the outcome. The skin tone plays a big role too. There are plenty of things to consider.
- The 2nd approach is a dynamic one. You create a model, you train (and test) your model and you deploy it. Less susceptible to background noise and more agnostic of skin tone for example. Over time, you could improve your model by providing more training data (to be fair, this step is definitely a must!).
And there’s plenty more…
At first, I started working on live frame captures and realized it would be a lot easier if I develop a working principle and just use static images. Furthermore, for simplicity sake, I pulled some hand gesture images of the Internet, modified them a bit and started coding.
Have a look at the code in a Jupyter notebook format:
My approach was to apply the same algorithm to all five hand gestures and examine the results. Here’s what I came up with:
The white circles represent a cluster of vertices, and the red circles represent a singled out (or “filtered”) vertex of the convex hull.
If we go back to the definition of what a convex hull is (the minimum n-sided convex polygon that completely circumscribes an object), we shouldn’t be that surprised of the outcome. And mind you, these are theoretically – ideal samples. Nevertheless, it seems the number 5 hand gestures is best represented!
It’s clear, I need to improve my code and identify those vertices representing the fingertips only. But this is for Part 2.