1 Overview of the Project

Fig. 1
figure 1

Screenshot of our web application

Fig. 2
figure 2

Screenshots of our android app

URWalking started as a student project for indoor navigation at the University of Regensburg. Later on, the project was integrated into the research project NADINE (funded by BMBFFootnote 1) to implement a navigation aid for public transport users for all public transport stops in Nuremberg (Germany). Furthermore, the system was incorporated into the research project DIVIS (funded by IuK Bayern) for the implementation of advanced indoor tracking strategies using mainly inertial sensors of smartphones combined with spatial and behavioural knowledge. We have developed a web application for usage in any browser and an smartphone app for Android devices. The interface was iterately improved by updating the implementation incorporating results of evaluations of previous versions [2, 19].

URWalking’s knowledge base connects 5,130 rooms in 27 buildings of the university itself and the nearby University of Applied Sciences. The area covered by the system exhibits the following quantitative features:

Space in total

4,3 km\(^2\)

Length of all paths

110 km

Doors

1.563

Elevators

320

Stairs

151

We have also drawn maps for each floor in each of the modelled buildings. In this way, we have implemented results from several user studies [3, 17].

Currently, the system is in daily use at the University of Regensburg as a web application and an Android app. It processes between 100 and 5,000 route inquiries per day—a typical situation for a university navigation system: with a new term starting, many persons are new to the campus and seek assistance to find locations. The inquiries decrease as soon as these persons have learned a spatial map of the campus.

2 Components of the System

Fig. 3
figure 3

A shortest path route compared to a route taking human preferences for wayfinding decisions into account

URWalking consists of four main components: the basic component is the web server that stores all maps of the covered environment and calculates routes. For URWalking to be used in a browser, we developed a web client as the second component of the URWalking system (see Fig. 1). As the web client lacks indoor positioning, users can give interactive feedback when they want to see the next routing instruction. The third component is our Android application. It provides the same functionality as the web client plus indoor tracking of the user’s current position along the route (see Fig. 2). The fourth component is YAMA—a web app for creating, editing, and maintaining map data (see Fig. 4).

Fig. 4
figure 4

Screenshot of the YAMA tool displaying an annotated map of a floor in the central building of the university

2.1 URWalking Web Server

Routes are calculated by a shortest path algorithm. Edge weights have been learned using genetic optimisation in order to reflect human route preferences that often lead to routes that are very distinct from shortest routes in terms of time or distance [15, 24, 25]. When choosing a destination, users can choose which type of route calculation they prefer. Routes that mimic human wayfinding decisions are typically longer than shortest routes, but less complex in terms of transitions between indoor and outdoor and changes in walking directions (see Fig. 3). These preferences seem linked to a person’s spatial map of an environment and the person’s capability to recall details. It is an challenging issue for future research to analyse this relationship in detail.

The web server provides an API used by the both client applications for route calculation and access to maps and route instructions (see Figs. 1 and 2).

2.2 Web Application

The purpose of the web app is to visualize maps and routing instructions in a browser window. As the app runs on all common browsers, it requires no installation procedure and is used by the majority of the users.

2.3 Android Application

The Android application (see Fig. 2) relies on the web server for calcuations of routes and appropriate instructions. It is available on Google Play StoreFootnote 2 and constitutes our experimental framework for adding indoor tracking of users in order to provide a user experience similar to car navigation systems and update maps and routing instructions automatically. In February 2022, there were around 1,700 active installations. Reliable indoor tracking of users for several hours they spend on the campus performing very different activities is still an open research issue.

2.4 Data for Indoor Navigation

For calculating routes and generating routing instructions, the web server uses maps that model the environment [24]. Our concept for mapping indoor environments is hybrid. Firstly, it is graph-based (for accessibility relations between locations that are needed for route calculation). Secondly, it is hierarchical in order to formalize the structure of the environment (separate areas of an environment, separate buildings in an area, separate levels in each building). Thirdly, it is also semantic: nodes come in several categories in order to capture various functions of locations in indoor buildings (e.g. doors, toilettes, offices, corridors, or landmarks—see Fig. 4). Edges can also be of different category: e.g. indoor connections between two nodes, outdoor connections, stairs, elevators, street crossings. Node and edge categories allow route calculation to account for user preferences [15]: e.g. handicapped users prefer elevators over stairs. Finally, it is spatial: as for indoor position tracking we snap positions to edges as the most probable footways on a route, sometimes we have to consider spatial dimensions are locations: e.g. for broad corridors a grid is a better model to allow higher degrees of freedom for users to move in the corridor.

For visualizing the model as maps of indoor areas, we used Inkscape to generating scalable maps in the SVG vector graphics format. All other (symbolic) data can be created, edited, and maintained with the editing tool YAMA (see Fig. 4) that is part of URWalking.

As all data for running URWalking in an environment can be configured with YAMA, URWalking can be configured for any environment repeating our systematic procedure to create data for the navigation service on the campus. For this purpose, our software is available as freeware upon requestFootnote 3.

3 Applications in Research Contexts

URWalking implements path planning based on state-of-the-art algorithms. It showcases that real-time route calculation is tractable in a client server framework with several thousand requests per day, many of them concurrently. For routing instructions, URWalking according to recent findings in the literature [1, 5, 12, 26] generates instructions incorporating the most salient landmark close to the user. The instructions are generated completely automatically by inspecting their properties for being visible in advance [13, 22]. To the best of our knowledge, no other freely available pedestrian indoor navigation system offers these features.

A further unique property of URWalking is that it runs since several years as a service for members of the university. As we log usage data, we are collecting a corpus of naturalistic user data in realistic settings. Furthermore, we conduct experiments using URWalking to investigate two important research questions in assisted wayfinding: (1) Can we predict how users perceive the current situation at any time during a wayfinding task in real-time and what decision they will take next? (2) Can we validate the role of conceptual models claiming to contribute to an answer by building machine learning models implementing these concepts?

3.1 Data Driven Validation of Wayfinding Models

A prominent conceptual model of human decision making during wayfinding such as that proposed in [8]. The authors measure the complexity of a routing instruction (type) \(t_i\) in an outdoor environment e:

$$\begin{aligned} c(e,t_i)= & {} (1-w_1)\cdot \text{ b }(t_i,e)\,+\\{} & {} w_1\cdot (\beta \cdot \text{ v }(t_i,e)+(1-\beta )\cdot \text{ r }(t_i,e)) \end{aligned}$$

Here, \(\text{ b }(t_i,e)\) is the branching factor, i.e. the number of options for the decision how to continue a route. \(\text{ r }(t_i,e)\) is the ease of detecting a mentioned landmark in the physical environment, and \(\text{ v }(t_i,e)\) is the visibility in advance of landmarks. In order to understand how such models can be validated, adjusted, or modified for indoor environments, we conducted various controlled eye tracking studies. These are controlled experiments in which we record video data synchronized with the gaze data (obtained from the SMI Eye Tracking Glasses 2) and performance data such as hesitations or errors during wayfinding and the time needed to follow URWalking’s routing instructions.

In order to address question (1), we interpret \(\text{ b }(t_i,e)\) and \(\text{ r }(t_i,e)\) as predictors for how users will perceive a routing instruction at their current position. In our work, we try to find signals that allow use to implement a model that can serve as a proxy for \(\text{ b }(t_i,e)\) and \(\text{ r }(t_i,e)\), respectively.

In our analyses [1, p. 96], we found that—different to outdoor environments—\(\text{ b }(t_i,e)\) does not influence the task performance, i.e. correct human decisions. We conclude that indoor contexts seem to be clearer in terms of changes in direction due to architectural elements such as corridors, stairs, entries, or elevators.

For a data-driven operationalization of \(\text{ r }(t_i,e)\), we applied state-of-the-art machine learning models from computer vision to predict the visual salience of landmarks in route instructions from photographs of the landmarks [4]. We fine-tuned a pretrained VGG 19 CNN on the photographs with the respective salience as target variable. Results indicate that high level style, high level content, and visual complexity of the photographs are the best features the CNN can generate for landmark prediction. We conclude that \(\text{ r }(t_i,e)\) is correlated with \(\text{ v }(t_i,e)\). This observation would allow to automatically extract landmarks from visualisations of objects in indoor environments. As a consequence, we can simplify the the approach in [18]: There is no need to classify objects as POI to compute their salience. Instead, we can estimate it from photographs of the objects.

However, the limitation of this approach is that the photographs isolate landmarks and do not show them in their usual surroundings with many visual detractors.

3.2 Real-Time Prediction of Areas of Interest

Fig. 5
figure 5

Examples of video frames with objects identified and labelled by YOLO (see [21] for details)

In order to overcome this limitation, we decided to automatically recognize objects in the video stream gained from the Eye Tracking Glasses and thereby detect which objects users focus on. This could result in a better proxy for \(\text{ r }(t_i,e)\). For object recognition, we we used the YOLO [23] model as a state-of-the-art neural network. As our corpus was small, we used YOLO pretrained on the COCO dataset [16] without any fine-tuning. Our asumption was the COCO contained classes of objects that are typical in our indoor wayfinding video streams such as doors, stairs, or hallways (see Fig. 5).

Table 1 YOLO’s classification results for our video data

The results in Tab. 1 indicate that the pretrained model recognizes too many classes with low confidence and accuracy values. From these results, we learned that fine-tuning is indispensable. For GeoAI, more specific data sets than COCO for indoor objects would be beneficial to automatically detect focussed objects during wayfinding. Such data would allow us to better understand which environmental stimuli influence human decision making. In order to contribute to the research question (2), we are currently annotating our data in order to finetune YOLO on the environment of the Regensburg campus. Our objective is to come up with an improved proxy for \(\text{ r }(t_i,e)\).

3.3 Prediction of Landmark Salience from Gaze Data

The lack of reliable object recognition being a drawback for finding a proxy for \(\text{ r }(t_i,e)\), we tried a less supervised appraoch: ignoring all visual data, we analysed the gaze data recorded sychronously to the video streams. We based our analysis on fixations. In earlier research, the fixation duration was used as an indicator for the difficulty in extracting the information processed [11], and the fixation frequency was considered as a factor of search efficiency [9]. These variables can be used to analyze the cognitive processes during wayfinding [14] and indicate how the next situation is perceived. Related results in eye tracking research point out the distinction between ambient and focal visual processing of visually perceived information [20]: During ambient processing information is explored superficially and input from peripheral vision may control eye movements. During focal processing central vision becomes dominant, the collected information is processed, and salient objects are recognized and interpreted.

Fig. 6
figure 6

The fixation frequency on (Fixation-on) and outside (Fixation-out) the display averaged over all subjects for each landmark in the corresponding routing instruction

As automated detection of fixations on the smartphone’s display was not reliable, we annotated each frame manually with a binary label: Is the gaze position on the navigation aid’s display or outside of it? From the annotated data, we could extract the frequency of fixations anywhere on the navigation aid’s display and outside of it (see Fig. 6).

To compute the frequencies of interest (on the display: (\(F_{\text{ on }}\), outside of the display: \(F_{\text{ out }}\)), we applied Empirical Mode Decomposition [7] on the gaze data: According to the results in e.g. [6, 20] focal processing is characterized by high fixation frequencies. As we worked with two different stimuli visible at the same time (display and environment), we normalized \(F_{\text{ out }}\) by \(F_{\text{ on }}\) and calculated a relation between the degree of focal processing outside and on the display. For that purpose, we defined the quotient PS (perceived salience of landmarks in routing instructions):

$$\begin{aligned} PS = \frac{F_{\text{ out }}}{F_{\text{ on }}}. \end{aligned}$$

To avoid the case that PS may be undefined, we set \(PS=F_{\text{ out }}\) if \(F_{\text{ on }}=0\). Then, the intuition behind PS is the following: We can distinguish two cases:

  • \(PS>1\): the degree of focal processing in the environment is higher than that on the display. For the current routing instruction, the test person can focus his/her attention on few objects in the environment and does not need much effort to explore the environment to find the landmark referred to in the instruction.

  • \(PS\le 1\): the degree to focal processing in the environment is low in relation to that on the display. So, proportionally, test persons need more ambient processing to explore the environment in order to eventually locate the referred landmark.

Fig. 7
figure 7

The perceived salience score (PS) on average over all subjects for each landmark

Figure 7 visualizes the average (over all test persons) perceived salience score PS. The plot shows that despite of careful selection of the landmarks, PS is far from constant. We believe to have learned the following lesson: While the concept of PS appears quite simplistic, we can provide evidence, that for our data it correlates significantly with the concept of visual salience [12, 26] based on subjective self-reports (Spearman \(r = 0.656\), \(p = 0.0042\)). So, in fact, landmarks rated high for visual salience, are also perceived as visually salient in the complex physical environment in which they are embedded. On the one hand, this observation is exciting, as it allows PS to be interpreted as a real-time proxy for visual salience ratings. As a consequence, in the sense of research question (2) PS can be beneficial for automated generation of routing instructions referring to landmarks that are salient for the user at the moment of the instruction being given. This is a major advantage over ratings that are collected using questionnaires in a non-naturalistic way.

However, PS is to simplistic to explain the viewing process completely. GeoAI in the future should take the challenge to get more out of eye tracking data by applying more detailed models of viewing behaviour in order to understand the influence of architectural constraints imposed by the indoor surroundings, e.g. width of corridors. A second limitation of our study is the lack of information about which objects were fixated by test persons. An automated procedure to extract objects in the environment that were fixated significantly more often than others would be a great step towards automated identification of salient objects from gaze data. In this way, the bias of experimenters to choose areas of interest could be removed. This is another strong argument for GeoAI to create indoor wayfinding databases in order to finetune state-of-the-art image classifiers (see Sect. 3.2).

3.4 Real-Time Prediction of Assistance Needs

We can get, however, an idea of the influence indoor environments have on eye movements by further analysing fixations. As stated in Sect. 3.3, the PS score showed much more variance than the visual salience ratings of the landmarks chosen for the routing instructions. For a deeper analysis of this behaviour, we aggregated fixations between two routing instructions and generated heatmaps for all aggregations. We then used the distribution of dwell time [10, p. 535] for each aggregated heatmap as a measure for similarity of the gaze behaviour of our test persons. So we could also take the spatial distribution of fixations into account, not only their frequency anywhere in the environment.

Fig. 8
figure 8

Differences in dwell time between subsequent route segments

With the landmarks in the routing instructions carefully chosen as the objects rated best between two subsequent wayfinding decision points and satisfying established criteria for landmarks [22] better than other objects, we assumed that the viewing behaviour should follow a similar pattern for each instruction: read it, identify it in the environment, and continue walking. Consequently, in two subsequent route segments the viewing behaviour should be similar if the environment did not have high impact on the gaze behaviour (e.g. by enforcing the user to take a turn at a crossing or while climbing a staircase). We quantified this impact by calculating the normalized mean square error between the observed dwell time distribution and its estimation from the distribution of the preceding aggregation. The resulting NMSE values for our test route are presented in Fig. 8. From an inspection of the segments with high and low NMSE, we conclude that the NMSE is low if there is no change in direction from a segment and the subsequent one, while it otherwise tends to increase. The highest values in segments 16, 17, and 20 have been calculated in staircases where persons have to change the direction while going up the stairs and reorient themselves continuously.

4 Conclusions and Current Research Interests

In this paper, we presented the URWalking system assisting users during indoor wayfinding. Usage data indicates that in real-time situations indoor navigation is appreciated by users although indoor positioning is not available as in our web application. From think aloud protocols we even know that no positioning is better than a system with poor performance in this aspect. In this sense, URWalking is innovative as most other indoor navigation systems try to solve the positioning issue first. URWalking, instead, first serves to collect large data sets that we will leverage to improve indoor positioning algorithms in the future.

Another important issue that still waits for a better solution is how to make URWalking better understand natural language descriptions of users for their destinations. Often, they do not know room identifiers. Our current implementation is capable of repairing minor errors in spelling; however, in many cases users describe an event they want to attend, a person that they want to meet, or a service of the university that they want to use. As we cannot control user inputs and do not have a redundant mechanism available for determining destinations, for most of the inquiries we lack reliable ground truth that we could use in machine learning approaches to improve the prediction accuracy of destinations users want to request wayfinding information for.

Finally, from the analyses discussed in this project report we learn that real-time tracking of human wayfinding behaviour is a difficult task and still needs progress in the GeoAI community on the construction of conceptual models, their empirical evaluation, and on AI algorithms for analysing wayfinding behaviour at run-time in order to provide situation specific assistance to users.

In order to move ahead, we currently try to find proxies for gaze data that we can only collect in controlled experiments, but not from users under naturalistic conditions. Therefore, it is one of the important issues on our research agenda to find out which interaction data can serve as proxies for human viewing behaviour. Beyond developing our models using data that we collect from URWalking users, we contribute to this field by sharing our log data with the communityFootnote 4, by integrating available indoor tracking implementations in our application, and by comparing their performance. In this way, we hope that our system can inspire the community to address many of issues that require solutions in order to develop better AI based wayfinding aids.