Structure from Motion algorithms allows us to reconstruct a 3D image from a set of 2 or more distinct images whose relative pose is unknown. This is the main technique for visual odometry.

Two-View SfM

For a given 3D feature , the relative pose of the camera between two views, may not be known. We can use the stereo vision process to find and well . For example, the relative pose may come from the movement of the camera between two time instances, and . This is called Two-View SfM.

Assume the camera pose at time is a reference (i.e.) world frame, the image taken at would be given by

and the image taken at at would be

where because image points are projective, each comes with an unknown positive depth/scale .

Working in normalized (calibrated) image coordinates:

Because the world frame is the first camera, the 3D points coordinates in camera 1 are simply .

In these coordinates the intrinsics, and the geometry is purely Euclidean:

Taking the cross product of both sides of the above equation with :

Finally, taking the dot product with on both sides:

where:

  • is the skew-symmetric matrix that implements the cross product with
  • is called the essential matrix
  • is called the epipolar constraint

SfM Procedure

In general, the SfM procedure involves:

  1. Computing the essential matrix
  2. Extracting and
  3. Triangulating to retrieve 3D coordinates

Triangulation

Once the relative camera poses are known, each pair of corresponding rays and can be intersected to recover the 3D point .

Conceptually:

  • Each image point defines a ray in space.
  • The 3D point is the intersection (or best intersection) of these rays.

This process is called triangulation.

Mathematically, we solve for the depths that best satisfy:

In practice, triangulation is done using least-squares linear or nonlinear optimization.