The first part of this section describes the basic principles of the pinhole model for computer vision and processes for lens distortion compensation. After that, computer vision techniques were applied to deal with cameras installed into UAVs. The last part of this section presents and describes the method to estimate the target height from still oblique images acquired with cameras installed into UAVs.
2.1. Pinhole Camera Model and Computer Vision
In computer vision, cameras are usually modeled with the pinhole camera model [
14]. The model is inspired by the simplest camera, where the light from an object enters through a small hole (the pinhole). This model considers a central projection, using the optical center of the camera and an image plane (that is perpendicular to the camera’s optical axis, see
Figure 1). In the physical camera, a mirror image is formed behind the camera center but, often, the image plane is represented in front of the camera center. The pinhole camera model represents every 3D world point P (expressed by world coordinates
,
,
) by the intersection between the image plane and the camera ray line that connects the optical center with the world point P (this intersection is called the image point, noted with I in
Figure 1).
The pinhole camera projection can be described by the following linear model [
10].
K is the calibration matrix [
14], defined as follows:
and
represent the focal length expressed in pixels.
and
are the coordinates of the image center expressed in pixels, with origin in the upper left corner (see
Figure 1).
is the skew coefficient between the x and y axes, this latter parameter is very often 0.
The focal length, (which can be here considered as the distance between the image plane and the optical center) can be also expressed in metric terms (e.g., mm instead of pixels) considering the following expressions [
14]:
and are, respectively, the image width and length expressed in pixels, is the width and the length of the camera sensor expressed in world units (e.g., mm). Usually, and have the same value, although they may differ due to several reasons such as flaws in the digital camera sensor or when the lens compresses a widescreen scene into a standard-sized sensor. R and T in (1) are the rotation and translation matrices of the camera, respectively, in relation to the world coordinate system. These include the extrinsic parameters which define the so-called “camera pose”. [RT] is a 3 × 4 matrix composed of the three columns of the rotation matrix R and the translation vector T as the fourth column. Image and object points are represented in this model in homogeneous coordinates.
R is defined in this case by the angles around the axes (X, Y and Z) of the world coordinate system needed for rotating the image coordinate system axes in order to get them coincident (or parallel) with the world coordinate system axes. In the case of rotation around the
X-axis by an angle
θx, the rotation matrix
Rx is given by (5) [
14]:
Rotations by
θy and
θz about the Y and Z axes can be written as:
A rotation R about any arbitrary axis can be written in terms of successive rotations about the X, Y and Z axes, using the matrix multiplication shown in (8). In this formulation
θx,
θy and
θz are the Euler angles.
T is expressed by a 3-dimensional vector that defines the position of the camera against the origin of the world coordinate system. Scaling does not take place in the definition of the camera pose. Enlarging the focal length or sensor size would provide the scaling. The next section describes the lens distortion effects and procedures for their correction.
2.2. Lens Distorsion and Compensation
The pinhole model does not consider that real lenses may produce several different non-linear distortions. The major defects in cameras are the radial distortion, caused by light refraction differences along with the spherical shape of the lens. Other distortions, like the tangential distortion, which is generated when the lens is not parallel to the imaging sensor or when several component lenses are not aligned over the same axis, have minor relevance in quality objectives and will not be considered in this study. The radial distortions can usually be classified as either barrel distortions or pincushion distortions (
Figure 2), which are quadratic, meaning they increase as the square of the distance from the center.
Removing a distortion means obtaining an undistorted image point, which can be considered as projected by an ideal pinhole camera, from a distorted image point. The simplest way to model the radial distortion is with a shift applied to the pixel coordinates [
16]. The radial shift of coordinates modifies only the distance of every pixel from the image center. Let
represent the observed distance (distorted image coordinates from the center) and
the distance of the undistorted image coordinates from the center. The observed distance for a point I (
), in the image (see
Figure 1) can be calculated as follows [
16]:
With these notations, the function that can be used to remove lens distortion is [
16]:
However, before applying the compensation function
we need to underline that the model would be useless if images with the same distortion, but different resolutions would have different distortion parameters. Therefore, all pixels should be normalized to a dimensionless frame, so that the image resolution is not important. In the dimensionless frame, the diagonal radius of the image is always 1, and the lens center is (0; 0) [
16].
The formula to transform the pixel coordinates to dimensionless coordinates is the following [
16]:
and are the dimensionless pixel coordinates and , are the image width and height in pixels.
The dimensionless coordinates defined in (11) can be used to calculate a normalized distance
considering the formula given in (9).
can be then used to approximate the normalized
with its Taylor expansion [
16]:
are the radial distortion coefficients. The “perfect” approximation would be a polynomial of infinite degree; however for the purpose of this project, because the target is always centered in the image center (see
Section 2.5) the radial distortion does not affect the measured pixels as much as if the measurements would be made at longer radial distances, as
Figure 2 illustrates. Therefore, considering just
in (12) is deemed enough for this analysis.
calculated with (12). needs to be denormalized to obtain the undistorted
and
image coordinates for the image under study.
2.4. On-Board Sensor and Data Analysis for Height Estimation
Large UAVs, which are also called MALEs (Medium Altitude Long Endurance [
21]), are usually fitted with three-gimbaled advanced imaging systems and accurate positioning systems resorting also to differential positioning techniques for higher accuracy. These systems are calibrated and capable to perform transformations in real-time and embed the instantaneous camera pose, and other information (such as Field of View (FOV), image footprint projected on the ground, and measured slant range (when available)) into the acquired video stream using the KLV encoding protocols, in accordance to military standards [
22].
On the other hand, non-military lightweight UAVs available in the regular market are not always fitted with advanced imaging systems and very accurate positioning systems. For example, the DJI Phantom 4 PRO (a widely diffused multi-rotor platform of 1.388 kg, used to collect data for the testing of the approach described in this paper, see
Section 3. Results) is not capable to generate KLV embedded metadata but it can generate ancillary tags in Exchangeable Image File Format (EXIF) of still images which provide, among other information, the position of the aircraft, aircraft orientation and camera orientation at the moment of the acquisition of the still image. DJI Phantom 4 PRO has a GPS/GLONASS positioning system [
23]. The actual uncertainty of this positioning system after refinement with accelerometric info provided by the IMU is ±0.5 m vertically and ±1.5 m horizontally [
23]. The camera of this UAV has a pivoted support (one-gimbal) with a single degree of freedom around the
Y axis (pitch angle, see
Figure 3). Angular values are measured with an uncertainty of ±0.02° [
23]. However, this angular accuracy refers to the gimbal zero reference. Thus, in real case scenarios, the uncertainty associated with Euler angles could be slightly higher. Although not specified in any available technical documentation, considering the available information of this UAV, it is here assumed that the transformations employed to provide the information in the EXIF tags are the following: (a) the translation defined by the positioning coordinates of the UAV body, (b) rotation based on Euler angles of the body followed by (c) a 1D rotation of the camera (pitch angle). Therefore, the position of the camera when dealing with DJI Phantom 4 PRO can be defined by UAV body positional location (coordinates) while the orientation is given by a yaw angle defined by flight orientation, a pitch angle defined by camera orientation and a roll angle defined by flight orientation.
The camera sensor is a 1″ Complementary Metal-Oxide Semiconductor (CMOS) of 20 MP effective pixels with 5472 × 3648 pixels resolution and 13.2 mm × 8.8 mm size, lens focal length of 8.8 mm with no optical zoom and diagonal FOV of 84° [
23].
Assuming a lightweight UAV, like the one described in
Figure 3, a feature standing vertically on the ground, for example, a pole, that the UAV has a heading (yaw angle) and pitch angle appropriate to point to the target and a Roll angle equal to zero, the situation graphically represented in
Figure 4 occurs.
Point
in
Figure 4b is the image center, which is obtained, as already described, by the intersection between the image plane and the optical axis (see
Figure 4a). The optical axis is centered on the target, not necessarily to the midpoint but to any point of the pole. The Image Point I is given by the intersection of the camera ray line that connects the tip (highest point) of the pole with the camera center. This point is expressed by the image coordinates
while
represents the distance from the image center, more specifically, to the principal point. Moreover,
is a distorted value that needs to be compensated to obtain the distance
of the ideal undistorted image. The procedure to obtain such undistorted distance was already discussed in the previous paragraph (see (12)). Similarly, Image Point J is the intersection of the image plane with the ray line that connects the bottom of the pole (lowest point) with the camera center. The point is expressed by the image coordinates
while
represents the distance from the image center that needs to be compensated to get
, the undistorted distance from the center of the ideal undistorted image. The line I–J in the image plane is the projected height of the pole expressed in pixels in the image plane. Considering now the case when the roll angle is different than zero graphically represented in
Figure 5.
When the roll angle is different than zero, the line IR-JR, which is the representation of the pole in the image plane, will not appear as parallel to the axis, as in the case before, but rotated by an angle equal to the Roll angle itself, as it is possible to infer from (5). As mentioned above, the observed distances (respectively and must be compensated to obtain the distances and of the ideal undistorted image.
The method assumes that the distances and are input data, that can be measured directly on the image or derived from an automatic point detection algorithm yielding the image coordinates of the endpoints of the pole. Direct measuring will have the drawback of the need for postprocessing, contradicting the aim of near real-time results, but it will be more precise. The automatic way has always the disadvantage of retrieving false results, especially when the contrast between object and background is not well defined. The same can be said when the object does not have sharp contours. For testing the uncertainty of the method, we decided to opt for the more reliable direct measuring, but in the future, an adequate automatic algorithm shall be developed/adopted in order to automatize the workflow as far as possible.
The next paragraph describes how to estimate the height of a target standing vertically (pole) considering the elements discussed so far in this paper. As an example, we can use a lightweight UAV like the DJI Phantom 4 PRO but the approach can be extended to any imaging system installed in steerable moving platforms.
2.5. Estimating Target Height with a Camera Fitted into UAVs
The approach proposed in this study for the estimation of target height using a camera fitted into UAVs foresees the UAV pointing at the target, as depicted in
Figure 2 and
Figure 3, starting with the case when the roll is zero (see
Figure 6).
The pitch angle, which can be also identified with
, see (7) is a known value, while the angles
,
,
,
are not originally known but they can be retrieved using simple trigonometric calculations:
where
can be calculated considering (12). in the previous paragraph starting from the observed
in the image plane (see
Figure 6). Similarly,
refers to the point
J (see also
Figure 4).
F is the focal length, which was defined by (3) and (4) (assumed here for simplicity that
).
V in
Figure 6 is the vertical distance between the base of the target and the camera center, while H is the horizontal distance between the target and the camera center. H and V are not related at all to the topography, as is possible to infer from
Figure 6 because H and V depend on the relative height and horizontal distance between the UAV and the target foot. If the coordinates of the target are known, then H and V are also known since the positioning coordinates of the camera are available (see
Section 2.3. Cameras installed into UAVs). The positional uncertainty and how it will impact the estimation of the target height will be treated later in this paper, but, since it is much more practical to handle horizontal distances, using a GIS (Geographic Information System (GIS), for example, than vertical elevations and because we want to reduce the number of parameters for the model (possible source of error), V is always calculated in function of H, as defined in (17) below.
The angles
,
,
,
are now known, as well as the pitch angle,
V and
H. These elements can be used to calculate the height of the target using triangle similarity. In fact,
P (see
Figure 6) can be calculated as follow:
p is the horizontal distance between the base of the target and the camera ray that passes through the tip (highest point) of the target, which can be calculated as follow:
Finally, the height of the target can be calculated by (20):
As already mentioned, the horizontal distance between the target and the camera center can be determined if the coordinates of the target are known. In practice, this could be the case only when dealing with immobile features like light poles, trees or buildings. If the position of the target is not known, as it may happen for moving targets like humans, vehicles, etc., the best solution would be using a device (like a laser range finder) to measure the instantaneous camera-to-target distance (slant range) [
24], as required by dedicated military standards for UAV metadata sets [
25]. As already mentioned, advanced imaging systems are very often fitted with such devices, and the instantaneous distance measurements can be stored in the KLV metadata set [
21] and transferred in real-time to the ground controller [
22].
Slant range is a distance is aligned with the optical axis of the camera (see
Figure 6) and can be used to calculate the horizontal distance
H using the following formula:
It is important to highlight that also the slant range distances measured by laser range finders are affected by a certain error that must be considered during the estimation of the target height. For example, an error of ±0.3 m along the slant range when the horizontal camera-to-target distance is 20 m and the pitch angle is 30° will result in a horizontal error of ±0.26 m.
Let us now analyze the case in which the roll angle is different from zero: in this case, as already discussed (see
Figure 5), the points I and J are not located along the υ axis passing on the center of the image. In other words, a vertical feature will appear as “tilted” in the image on an angle equal to roll. However, as it is possible to infer from (8) and as graphically represented in
Figure 7, I and J are in the same (vertical) plane of the pitch angle. Thus, the approach presented in this paper does not need to consider the roll angle for the calculation of the target height. In this case, it is necessary to perform a distortion correction to obtain
and
and use these parameters in the formulas previously described (see (13)and (15)).
As previously discussed, the camera-to-target horizontal distance (H) is a required parameter in the calculations. If the target is moving (like a person), a laser range finder should be preferably used because it can measure the instantaneous slant range, from which we can derive H (the horizontal distance between the camera and the target). These devices are usually fitted only in larger UAVs, but we believe it is important to analyze this option in our paper. Regular-market UAVs (like DJI Phantom used in the field test of the paper) do not normally have such devices, but H can be calculated considering the coordinates of the target and the coordinates of the UAV. If the target is a fixed feature, like an electrical pole or a tree, then it is most probably visible in reference ortho-photo maps or topographic maps, where it is possible to retrieve its coordinates (even public available geo-webservices such as Google Earth can be used for this purpose). If the target is not fixed (therefore moving, like a human), then it is necessary to find its coordinates when the image was taken. This is not an easy task, but practical experience has shown that, especially in urban areas, the image can be taken when the target is located over a recognizable marker on the ground (like road markings) or feature (like a building corner). The coordinates of this marker or feature, which can be once again retrieved from available maps or geo web-services, can be used to indirectly obtain the horizontal target-to-camera distance.
2.6. Uncertainty Analysis
Assuming an accurate correction of image distortion, the relevant sources of error may come from: the horizontal camera-to-target distance, the number of pixels spanning the target from the image center toward the top and/or bottom, and angular measures.
The Computer-Aided Design (CAD) software Rhinoceros [
26] was combined with a spreadsheet to visually analyze how the aforementioned uncertainties impact the height estimation. In this CAD, it was possible to create a 3D environment of real planar coordinates and locate in there a UAV (e.g., a DJI Phantom 4 PRO) in a desired location and altitude, rotate the camera of the UAV considering Euler angles (e.g., 30° pitch angle), create a target of defined height (e.g., a pole of 1.80 m), and locate it to a desired distance from the UAV. Moreover, since the pinhole model after distortion correction is perfectly geometrical, it was also possible to recreate the geometry of the camera (field of view, image plane, etc.). In this way, we can have all the parameters under control, including the number of pixels spanning the feature, to simulate different scenarios to analyze the uncertainty.
Firstly, let us consider the following scenario: a DJI Phantom 4 PRO is looking at a target of 1.80 m, the laser range finder (this UAV does not have a laser range finder, but we assume it would have one) measured a slant range distance of 23.1 m from the target (equal to 20 m horizontally), the total number of pixels spanning the feature for this case would be 248 (see
Figure 8a). If the laser range finder has an uncertainty of ±0.30 m (±0.26 m horizontally), the real horizontal distance of the UAV to the target must be included between 19.74 m and 20.26 m (see
Figure 8b,c). In this case, we could say that the error associated with the estimated height is ±0.02 m (varying from 1.78 m to 1.82 m as shown in
Figure 8b,c). Moreover, in this case, the difference in elevation between the UAV and the target is not known because we do not have the coordinates of the target. Let us now consider the other possible source of error: the number of pixels is miscounted (this can be ideally indicated as error of collimation, where, basically a camera ray line is not collimating to the top or bottom of the target). Practical experience has shown that there might be an error of uncertainty of ±3 pixels considering a simple linear vertical feature like a pole, this uncertainty takes place at the top and at the bottom of the feature. If the target is more “complex” than a pole (like a person) this uncertainty could be more than 3 pixels. Considering the same situation described above (DJI Phantom 4 PRO looking at a target of 1.80 m, 20 m horizontal distance, 30 deg Pitch), an uncertainty of ±3 pixels at the top generates an error in the estimation of the height equal to ±0.021 m (see
Figure 8d,e where the error is exaggerated to underline the issue). The same uncertainty at the bottom of the feature generates an error of 0.023 m. These two errors, although quite similar, should be kept separate because they may vary significantly for pitch assuming high values.
According to the statistical propagation of error, these three errors can be combined as follow:
where
denotes the error,
is the real height of the target (in this case 1.800 m),
is the calculated height of the target with uncertainty related to horizontal distance (1.820 m),
is the calculated height of the target with uncertainty related to the number of pixels toward top (1.821 m),
is the calculated height of the target with uncertainty related to the number of pixels toward the bottom (1.823 m). This analysis tells us that, statistically, we should expect an error of ±0.040 m.
The analysis conducted above does not consider the uncertainty associated with angular measures. The declared angular uncertainty of the Phantom 4 PRO is ±0.02° (see
Section 2.4) but, as already explained, this angular accuracy refers to the gimbal zero references. Thus, in real case scenarios, it is necessary to consider possible uncertainty. In this analysis, an angular uncertainty of ±1° is assumed for the Pitch angle. In the case here under consideration, the H value is calculated considering the Pitch Angle and the slant range (see (21)). There is a direct dependency between these three variables. According to the error propagation law, the uncertainty of two multiplied quantities is equal to the addition of their respective uncertainties (slant range: ±0.30 m and Pitch angle: ±1°). This resulting uncertainty associated with the horizontal distance H can be then used to replace
in (22). The uncertainty of H given by the combination of slant range uncertainty and angular uncertainty is ±0.32 m which generates, statistically, an error of ±0.05 m when combined with the other uncertainties (PixTop and PixBot) in Equation (22). Once again, this is valid for pitch equal to 30° and H equal to 20 m, which is the most probable scenario for the purpose of this project. In other conditions, the error will be clearly different and therefore should be recalculated.
Let us now consider a more realistic scenario, by taking into account consumer-level UAVs, such as DJI Phantom 4 PRO, which is usually not fitted with laser range finders. In this case, we need to take into account the positional uncertainty of the UAV (the declared horizontal uncertainty of the Phantom 4 PRO is ±1.50 m [
26]), the precision of the maps used to get the coordinates of the target (for a map at 1:4000 scale, we may consider uncertainty of ±1 m [
27]), the uncertainty related to the number of pixels at the top and bottom, and ±1° for the Pitch, as seen before. The statistical error generated by the combination of these uncertainties can be calculated considering (22). To better analyze how the error may vary in function of the relative position of UAV with respect to the target, 20 cases with a random position of the UAV (from 5 m to 50 m horizontal distance from the target) and camera pitch angle (between 5° to 75°) were generated in a spreadsheet and visualized in a 3D CAD. These cases (artificially created) gave us the possibility to have full control over all the parameters in order to isolate and analyze each uncertainty and to calculate their combination
. The results of this analysis are reported in
Table 1.
It is interesting to analyze the appearance of the variables (horizontal camera-to-target distance, number of pixels towards top/bottom, and pitch angle) plotted against the correspondent errors generated for the calculation of the height (
Figure 9). Since the horizontal camera-to-target distance has two uncertainties associated (UAV positional accuracy and target positional accuracy), this variable was plotted against those two errors (
Figure 9a,b). In both cases, the relationship can be very well correlated using a negative power function, which tells us that the error is lower when the horizontal distance is higher. This should not be surprising, because looking at the geometry in
Figure 4 (and associated formulas) it is evident that the same horizontal positional uncertainty of the UAV has a stronger impact (higher uncertainty in the estimation of the height) when the target is closer. Concerning the number of pixels (both top and bottom,
Figure 9c,d), we can instead notice that the error generally increases when the number of pixels decreases. This is logical since the number of pixels spanning a feature decreases with the distance, and the associated error is higher when the target is farther from the UAV. The angular uncertainty of pitch generates a trend that increases exponentially when the angle is higher (
Figure 9e). Let us now consider the statistical error vs. itch (
Figure 9f). Although quite scattered,
Figure 9f shows that the error increases when the angle also increases. Concerning the number of pixels (in this case, we considered all the pixels for simplicity)
Figure 9g shows that the error increases when the number of pixels also increases. Let us now consider the overall statistical uncertainty vs. the horizontal distance (
Figure 9h): a negative power function relationship can be used to approximate this distribution (in this case, the function of the regression line is y = 2.7182x
−0.905). This well-predictable behavior is probably due to the strong component of uncertainty related to the horizontal distance. The error varies from ±0.65 m when the horizontal distance between the UAV and the target is 5 m, to 0.1 m when the distance is around 50 m. Finally, we should notice that error at around 50 m (see case 11 in
Table 1) is slightly higher than the previous one (case 10). This indicates that, possibly, there is growth in the error after 50 m due to the uncertainties related to the number of pixels and pitch (especially for high angles).
Theoretically, the function of the regression curve that approximates the distribution of the statistical error against H (y = 2.7182x−0.905) could be used to predict quite accurately the uncertainty of a height estimation at any horizontal distance (since R2 = 0.9849). More practically, we need to take into account the following: (a) the function is valid for a target of 1.80 m, but in a real case scenario the height of the target would be clearly not known; (b) in a real case scenario we would not have the exact horizontal distance but just a distance affected by a certain uncertainty; (c) the function is only valid for the system (DJI Phantom 4 PRO) and for the conditions considered.
Regarding the first point (a), let us consider
Figure 10 where the statistical error has been calculated for ten different H considering a height of target = 1.60 m, 1.80 m and 2.0 m. A different height of the target is clearly generating a different distribution of the error but is always well-aligned along a negative power function. Moreover, the distribution flattens when the horizontal distance is above 15 m.
Regarding point (b) mentioned above, we can use the same rationale: the function of the regression curve that approximates the distribution of the statistical error against H gets rather flat for values above 15 m to 20 m. Therefore, values of H above 15 m or 20 m, even if affected by uncertainty, can be used in the equation reported in
Figure 9e without introducing a sensible error. Finally, regarding point (c), we can only say that every case is different: conditions and uncertainties must be carefully considered and analyzed, because they may generate different errors.
All in all, we can say that in cases as the one described in this study (when the horizontal uncertainty is constant and brings the highest contribution in terms of uncertainty) the statistical error has a well predictable distribution (negative power). This behavior can be used to predict a priori the uncertainty using only the horizontal target-to-camera distance. However, this prediction can be only used (cautiously) when this distance is above 15 m to 20 m (assuming that the target is a person).