1. Introduction
In recent years, robots have been used to automate many tasks to improve productivity at manufacturing and other production sites. Real-time object position and posture estimation using image processing is an essential function for autonomous robots that perform object handling in automation. There are various methods for estimating object position and posture using image processing.
Methods using stereo cameras or RGB-D cameras can estimate the position, posture, and shape of an object from multiple RGB images or depth images, enabling handling of general objects without the need to process the handling target. There have been many attempts to estimate the position and posture of objects using machine learning [
2,
3,
4,
5,
6]. However in general, the above methods have disadvantages such as high implementation costs due to the large amount of data sets required and the time required for data learning. Therefore, they are not suitable for applications that require low-cost and low-computational resources.
On the other hand, visual markers are support tools that facilitate object identification and position and posture estimation. Since visual markers utilize known shape information and image features, they can be used to identify marker IDs and estimate relative position and posture from a camera using only a single 2D camera. Although there is a restriction that the marker must be fixed to the target object, it has the advantage of being inexpensive to implement. In addition, there are a variety of marker projection patterns, and they have been presented for various applications [
7,
8,
9,
10,
11].
Among visual markers, AR markers have the advantage of easily enabling augmented reality and providing information to users in a more intuitive manner. Most AR markers use the principle of projective transformation to perform position and posture estimation of the marker by finding a homogeneous transformation matrix that represents the position and posture of the marker’s coordinate system as seen from the camera coordinate system. The open source library ARToolKit markers [
12] is a typical example, and this marker technology is used in self-position estimation [
13,
14,
15,
16,
17,
18] and mapping [
19,
20,
21] for mobile robots, and is one of the essential technologies in navigation systems.
On the other hand, AR markers can also be used for handling specific objects [
22,
23,
24]. However, the size and recognition accuracy of AR markers are problematic when handling small objects or objects with complex shapes. Conventional AR markers require a large amount of space to be attached to the target object, and it is undesirable from an aesthetic point of view for the markers to be too conspicuous in a real environment. Several previous studies have attempted to reduce marker size or improve the aesthetics of markers. Zhang et al [
25] developed a curved surface marker that can be attached to cylindrical objects as small as 6 mm in diameter, enabling real-time tracking of ultrasound probes. However, these markers can only be recognized in a space of 30 to 125 mm in depth from the camera, making them unsuitable for object handling that requires a large workspace. Costanza et al [
26] created a marker that is unobtrusive to users in their living environment with "d-touch”, an open source system that allows users to create their own markers based on their aesthetic sense. However, there is no concrete verification with regard to miniaturization or recognition accuracy.
It is also difficult to consistently achieve the recognition accuracy needed for accurate object handling. To solve this problem, Douxchamps et al. [
27] improved accuracy and robustness by physically increasing marker size and using high-density patterns to reduce noise and discretization in marker recognition. This method can recognize markers at a maximum of 0.06 to 4 ppm, however, the miniaturization of the marker becomes a trade-off issue. Yoon et al. [
28] presented a coordinate transformation algorithm to obtain the globally optimal camera posture from local transformations of multiple markers, improving the accuracy of pose estimation. Yu et al. [
29] presented a robust pose estimation algorithm using multiple AR markers and showed its effectiveness in real-time AR tracking. Hayakawa et al. [
30] presented a 3D toothbrush positioning method that recognizes AR markers on each face of a dodecahedron attached to a toothbrush and achieved a motion tracking rate of over 99.5%. However, these methods require multiple markers to achieve high recognition accuracy, which requires large space to attach them to objects.
There are two methods to improve recognition performance with a single marker while maintaining the marker size; one is using filters and the other is using circular dots as feature points. The method using a filter [
31,
32] reduces jitter between frames and stabilizes posture recognition, but does not guarantee accurate recognition. On the other hand, in the method of using circular dots for posture estimation, Bergamasco et al. [
33,
34] achieved robustness against occlusion by using markers that utilize the projection characteristics of a circular set of dots and an ellipticity algorithm. In addition to circular dots, Tanaka et al. [
35,
36,
37,
38] presented an AR marker that uses lenticular lenses or microlens arrays to change the pattern depending on the viewing angle, reducing the posture estimation error and improving robustness against distance and illumination changes. These techniques have dramatically improved recognition performance for a single marker to enhance the practicality. Therefore, these techniques are also promising for marker miniaturization, but have not been demonstrated to date.
However, even if marker miniaturization and high recognition accuracy can be achieved, a generally-used camera system limits the range of marker recognition, making practical operation difficult. Because the markers are so small, they cannot be recognized with high accuracy from an overhead view of the workspace. Conversely, a magnified view narrows the field of view, making it difficult to recognize the surrounding environment necessary for handling. To solve this problem, Toyoura et al. [
39] presented a monospectrum marker that enables real-time detection from blurred images, thereby extending the recognition range. However, the recognition of translational positions has an average error of 5 to 10 mm, which does not meet the level of recognition accuracy required for object handling. Another disadvantage is that the system requires a high-performance GPU for real-time detection.
Based on the background described above, this study will develop a prototype micro AR marker of 10 mm per side that is compatible with the high accuracy recognition method of Tanaka et al. [
38], and construct a low-cost and high accuracy marker recognition system using a general-purpose web camera. The micro AR marker is printed on a glass substrate by photolithography with high resolution, so that the marker image is not easily degraded even when the marker is magnified by a camera. First, we demonstrate that this AR marker inherently has very high accuracy with regard to position and posture recognition despite its ultra-compact size. On the other hand, we reveal the problem of insufficient recognition range for practical use with a conventional camera system. Next, to solve this problem, we newly present a dynamic camera parameter control method that can maintain high recognition accuracy over a wide field of view, and demonstrate its effectiveness through a series of experiments.
This paper is organized as follows.
Section 1 describes the background and objectives of this study.
Section 2 describes the overall system configuration.
Section 3 describes the process of the proposed camera control system, i.e., the algorithm for camera parameter optimization.
Section 4 describes the results of the evaluation experiments of the proposed camera control system.
Section 5 discusses the results of the evaluation experiments. Finally,
Section 6 describes the summary of this paper and future issues.
3. Camera Control System
This section describes the dynamic camera control processes that properly adjust the parameters of zoom, focus, and calibration data to finally determine the position and posture of the AR markers. The three parameters of zoom, focus, and calibration data are collectively referred to as "camera parameters”.
3.1. Marker Recognition Process
As shown in
Figure 6, the proposed camera control system consists of two processes:
Scanning process scans the camera’s shooting range to detect AR markers;
Iteration process optimizes camera parameters based on the detected AR marker positions to determine the final AR marker position and posture. Here, it is assumed that the AR marker is still present within the camera’s shooting range when the camera parameters are changed.
Table 2 shows the camera parameters used in this system and the parameters used to calculate the camera parameters. As will be discussed later in
Section 4.4, when recognizing micro AR markers, if the camera zoom function is not used, the size of the AR marker in the image will be very small and the recognition range of the AR marker will be narrow. Also, when the zoom magnification is large, the recognition accuracy of the position of AR markers in close proximity to the camera is poor. Therefore, it is necessary to dynamically control the zoom value to have a wide recognition range. The zoom value
W is proportional to the depth distance
z from the camera of the AR marker shown in
Figure 2, and is calculated by Equation
1. The camera magnification
n is also proportional to
W as shown in Equation
2.
As the data is presented in
Section 4.3, a fixed-focus camera has a narrow range for recognizing AR markers with a range of only 0.2 m in the depth distance. On the other hand, an auto-focus camera often fails to recognize AR markers because the camera unexpectedly focuses on the background that has greater contrast than the micro AR markers. Therefore, it is necessary to explicitly control the focus to have a wide recognition range. The focus value
F is also determined by the marker’s depth distance
z using Equation
3. The smaller the focus value is, the farther away from the camera it is. The constants
and
used in Equation
3 are constants obtained by measuring the optimum focus value according to the depth distance from the camera in advance and by exponential approximation of the experimental results.
The calibration data also includes the distortion coefficient matrix
and the internal parameter matrix
of the camera. The distortion coefficient matrix
refers to the lens distortion and is represented by a 1 × 5 matrix containing the radial distortion coefficients (
) and tangential distortion coefficients (
) as in Equation
4. The internal parameter matrix
refers to camera-specific parameters and is represented as shown in Equation
5 in a 3 × 3 matrix containing the focus distance (
) and optical center (
). In this work, 31 calibration data sets for every increment of 5 in the focus value were prepared. The system selects the most appropriate calibration data based on the calculated focus value and applies it to the next recognition process.
The two processes shown in
Figure 6 are described in detail below.
In the Scanning process, AR markers are initially detected by scanning within the camera’s shooting range; the Scanning process follows the steps below.
- (1a)
The camera parameters are set to the initial values W=500 and F=150, with the zoom at its maximum and the focus at its farthest forward. The reason for these settings is that the zoom value is farther away for a wider recognition range, and the focus value can be processed faster if it is shifted from near to far.
- (1b)
Detect AR markers within a single frame of image output from the camera.
- (1c)
If no AR marker is found in the image, the focus value F is reduced by 15 to focus on a more distant point.
- (1d)
Repeat steps (1b) – (1c) until AR markers are detected.
- (1e)
If an AR marker is detected for the first time, get the initial position and posture of the AR marker to be given as initial values for the next Iteration process.
According to the above algorithm, the scanning takes a maximum of 11 frames of images before the AR marker is detected. Since the frame rate of the camera used is 30 fps, the maximum scanning time is theoretically about 0.33 seconds. In reality, however, even if the AR marker was at the farthest point within the recognition range, the scanning time was only about 0.3 seconds. This is because the AR marker could be detected even when the focus position was in front of the AR marker, and detection was possible from the 10th frame of the image.
In the Iteration process, the camera parameters are optimized to determine the final marker position and posture with enhanced accuracy; the Iteration process follows the steps below.
- (2a)
Receive the initial recognition values and of the AR marker from the Scanning process.
- (2b)
Update the camera parameters based on the recognized depth distance z of the AR marker.
- (2c)
Get the next recognition values and with the updated camera parameters.
- (2d)
Calculate the absolute error value between and , and between and . If the error values are larger than the thresholds and , repeat steps (2b) – (2c).
- (2e)
If the absolute error values calculated in step (2d) are smaller than the thresholds and , the latest and are output as the final recognition values and .
3.2. Dynamic Camera Parameter Controller
Figure 7 shows a block diagram of the system for the camera parameter optimization update process shown in the
Iteration Process in
Figure 6. The values
and
are the final output of the marker’s position and posture. The Iterator judges whether the marker’s recognized position and posture converge to an accuracy within set thresholds, as in Equations
6 and
7.
The Iterator also calculates the marker’s depth distance
z for updating the camera parameters, which takes into account the magnification of the image due to zooming. When the image is magnified by a factor of
n by zooming, the "apparent” depth distance
becomes
of the real value
z. Therefore, the zoom value
W is input to LEAG-Library to recognize the real marker position with compensation of the apparent depth distance
as shown in Equation
8.
After the depth distance
z is calculated, the zoom value
W and focus value
F are updated by Equations
1 and
3, respectively. In addition, the lens distortion coefficients
and the internal parameter matrix
are appropriately selected from the calibration data list based on the focus value
F.
5. Discussion
Using the proposed dynamic camera control system, the 10 mm square micro AR marker can be recognized with accuracy of better than
for depth distance and
for rotation angle. The depth recognition range is 1.0 m, which is five times greater than the range with fixed camera parameters. In the most recent relevant study, Inoue et al.[
41] proposed an AR marker pose estimation system using a 64 mm × 64 mm AR marker, in which the recognition accuracy was 4% in the depth direction and
in the rotation angle. The depth recognition range was 0.1 ∼ 0.7 m. Compared to the state-of-the-art research as above, the proposed method in this study achieves significantly better performance in terms of marker size, recognition accuracy, and recognition range.
Regarding the iteration time, the proposed system requires a maximum of 0.6 to 0.7 seconds for convergence of recognition values for both position and posture. For application to object tracking in a robotic manipulation system, the iteration time can be significantly reduced except for the initial detection. This is because the previous marker recognition value and the kinematic information of the robot arm can be utilized to efficiently determine the initial values of the camera parameters at the next recognition.
On the other hand, the recognition performance of the proposed system depends on the hardware specifications. A camera with a higher zoom magnification capabilities, including an optical zoom, can further expand the recognition range. Iteration time also depends on the computational resources. Considering low-cost implementation, this study uses a normal laptop PC without a GPU. However, the real-time performance of the system can be further improved by using an industrial camera with a high frame rate and a processor with GPU acceleration.
Figure 1.
The 10 mm square micro AR marker prototyped in this study.
Figure 1.
The 10 mm square micro AR marker prototyped in this study.
Figure 2.
Definition of the marker position and posture with respect to the camera.
Figure 2.
Definition of the marker position and posture with respect to the camera.
Figure 3.
Logicool BRIO C1000eR®.
Figure 3.
Logicool BRIO C1000eR®.
Figure 4.
System architecture.
Figure 4.
System architecture.
Figure 5.
Correlation diagram of ROS nodes.
Figure 5.
Correlation diagram of ROS nodes.
Figure 6.
Flowchart of proposed method.
Figure 6.
Flowchart of proposed method.
Figure 7.
Block diagram of camera parameter optimization.
Figure 7.
Block diagram of camera parameter optimization.
Figure 8.
Experimental setup: (a) Definition of AR marker position and angles. (b) Scene of the experiment.
Figure 8.
Experimental setup: (a) Definition of AR marker position and angles. (b) Scene of the experiment.
Figure 9.
Recognition error rate of the marker’s depth distance(z) with fixed camera parameters.
Figure 9.
Recognition error rate of the marker’s depth distance(z) with fixed camera parameters.
Figure 10.
Recognition error rate of the marker’s depth distance(z) using dynamic focus control with constant zoom values(W).
Figure 10.
Recognition error rate of the marker’s depth distance(z) using dynamic focus control with constant zoom values(W).
Figure 11.
Recognition error rate of the depth distance(z) with proposed camera control for different marker angles().
Figure 11.
Recognition error rate of the depth distance(z) with proposed camera control for different marker angles().
Figure 12.
Iteration time of the depth distance(z) to converge with proposed camera control for different marker angles().
Figure 12.
Iteration time of the depth distance(z) to converge with proposed camera control for different marker angles().
Figure 13.
Convergence of the depth distance(z) measurement values to true values by iteration: (a) z=1.0m; (b) z=0.5m; (c) z=0.1m
Figure 13.
Convergence of the depth distance(z) measurement values to true values by iteration: (a) z=1.0m; (b) z=0.5m; (c) z=0.1m
Figure 14.
Recognition error of the marker angle() with proposed camera control for different depth distances(z).
Figure 14.
Recognition error of the marker angle() with proposed camera control for different depth distances(z).
Figure 15.
Iteration time of the marker angle() with proposed camera control for different depth distances(z).
Figure 15.
Iteration time of the marker angle() with proposed camera control for different depth distances(z).
Table 1.
Camera specifications.
Table 1.
Camera specifications.
Product name |
Logicool BRIO C1000eR® |
Output resolution |
1920 × 1080 (FHD) |
Frame rate |
30fps |
Diagonal FOV |
90° |
Digital zoom |
1x - 5x |
Size (mm) |
102 × 27 × 27 |
Table 2.
Parameters used in the proposed camera control system.
Table 2.
Parameters used in the proposed camera control system.
Parameter |
Symbol |
Value |
Zoom value |
W |
Variable. () |
Maximum zoom value |
|
500 |
Minimum zoom value |
|
100 |
Depth distance of AR marker |
z |
Variable. (m) |
Minimum depth distance |
|
0.05 (m) |
Conversion coefficient of zoom value |
C |
1000 (1/m) |
Camera magnification |
n |
Variable. |
Focus value |
F |
Variable. () |
Constants of focus value |
|
|
Constants of focus value |
|
|
Distortion coefficient matrix |
|
Determined by calibration. |
Radial distortion coefficient |
|
Tangential distortion coefficient |
|
Internal parameter matrix |
|
Determined by calibration. |
focus distance |
|
Optical center |
|
AR marker position |
|
Iterative output variable. |
AR marker posture |
|
Iterative output variable. |
Threshold of position error |
|
Arbitrally setting. |
Threshold of posture error |
|
Arbitrally setting. |
Table 3.
Initial camera parameter values and error thresholds.
Table 3.
Initial camera parameter values and error thresholds.
Parameter |
Symbol |
Value |
Zoom value |
W |
500 |
Focus value |
F |
150 |
Threshold of position error |
|
1.0 (mm) |
Threshold of posture error |
|
0.01 (deg) |