2D image sensors:
- Monocular vision: black/white, e.g. 512x512
- Color: 3 sets of 2D matrix data for RGB, e.g. 512x512x3
- Bit depth: data size of each pixel (e.g. 8 bit: 0 ~ 255)
Image Coordinates (MATLAB)
Pixel indices:
- Row and column indices are ordered from top to bottom, and from left to right.
- In general, there are three indices with .

Spatial coordinates:
- Intrinsic coordinate: Representing locations in image on a continuous plane
- World coordinate (mapping the intrinsic coordinate to the spatial frame of reference)

Camera Optics
A thin lens with focal length forms a sharp image of an object at distance from the lens on an image plane located at distance behind the lens:
When the object is far away (), the image forms at the focal plane ().

If we let the aperture shrink to a point (or equivalently consider very distant scenes with ), we obtain the pinhole camera: all rays from a scene point pass through a single point (the optical center) and intersect a plane at distance . This “ray-through-a-point” geometry is the basis for the projection equations below.
Pinhole Camera Model
Consider a point in camera coordinates with the origin at the pinhole and axis pointing forward. Its image coordinates (in metric measurements, not pixel measurements) on a image plane at distance are . Similar triangles give us:
The division by is a hallmark of perspective projection; points further away (large ) appear closer to the principal point (center of the image from the camera’s geometric perspective).


We can represent the above with a homogeneous representation:
where
where is the unknown projective scale.
In practice, 3D points are given in a world frame . The rigid motion from world to camera coordinates is:
with rotation matrix and translation .
In homogeneous form, we can then transform between the two as
where . We call the extrinsic matrix; it positions the camera within the world.
Real sensors measure pixels, not metric lengths. We let:
- and be the pixel densities (pixels/meter) along the sensor’s and axes.
- be the principal point (the pixel where the optical axis hits the sensor, typically near the image center)
Converting the metric projection to pixels gives:
- and are called the focal lengths in pixels.
Then, the intrinsic calibration matrix that maps metric image-plane coordinates into pixel coordinates is:
- The image plane is is the surface where the 3D world is projected to form a 2D image

Putting the pieces together, a world point projects to image pixels via