It is a feature based simultaneous localization and mapping (S.L.A.M) made in NodeJs. Its objective is to be able to generate maps in 3D from video input. This isn't new and better alternatives currently exist both on the corporate and on the open source world, although I wasn't able to find one in NodeJs. This is a complex, so because we only use video input and not sensory data (such as GPS or odometry) we are bound to find noise in our maps.
Someone has to do the heavy lifting in this instance we make use of the library OpenCV a library focused on computer vision, to start things off we have to load a video from the NodeJs and load it into OpenCV and play it on frame by frame.
Next is find which features are the most desirable from each frame (a feature is something pretty easy to spot right away on a picture from a computer standpoint of view). There is a ton of research to determine which are good features and bad ones, OpenCV has already implemented some function to search for those features using ORB Detection, this process is known as feature extraction. We can now display all the features on each frame.
Following we need to somehow link old features with new ones, if we see a feature it's likely that it will be spotted on the following frame, this is called feature matching. The concept is to keep all the features that have been seen at least 2 times, there is more on how to match those features or how to discard some of them, but for now we can just fiddle with OpenCV until we can see good feature matches. Some parameters that we can adjust are the number of total features to be extracted and the focal length of the camera used.
At last, once we have a list of matched features, we can use triangulation to guess what 3D coordinate it relates to. This is cool, but how? Glad you asked. The objective is to figure out how we (as the camera) move with the rotation and translation (Rt) and extrapolate this into a 3D point using the last frame rotation matrix and the current one to triangulate the point into 3D global space.
Now we have a video of a room with multiple lamps hanging from the top, the video is recorded looking upwards and moving in circles, see the view as the steps are applied:
- Start
- Feature extraction
- Feature matching
- Points in 3D coords
There are some intrinsic values that depend on the video inputted and because we lack more information there is noise on the resulting points, so complete automation is possible but off the table for now. Also, I wanted something that could work outside the browser, sadly this with 3D rendering of points doesn't work great in NodeJs as I was unable to display them directly. But they could be piped into another program.