1. Introduction
Extended Reality (
xr), sometimes also referred to as media-generated reality [
1], is a term that generalizes concepts such as Virtual Reality (
vr), Augmented Reality (
ar), and Mixed Reality (
mr) [
2]. V
r is the simulated experience through 3
d virtual environments. A
vr user is “isolated” to some extent from the surrounding ambient environment, and controls and interacts with computer-generated visual, auditory, and haptic components of a virtual environment, often in a first-person (endocentric) experience. Such experiences may entail the use of a head-mounted display (or
hmd) [
3]. In
ar, a user is presented with simultaneous exposure to virtual elements and real-time perception of the physical environment, resulting in a “hybrid”, mediated experience [
4]. M
r-based applications may combine
vr and
ar techniques while including the anchoring of virtual elements within the real world [
5].
Current applications of
xr include gaming, audio-visual media, design, social media, training, and education. In the case of education,
xr-based learning environments can surpass traditional instructional learning by enhancing, stimulating, or motivating student understanding [
6]. A study involving a
mr learning environment tested in a high school chemistry class showed that students can achieve significant learning gains when such educational tools are co-designed with educators [
7]. Another study, based on Bloom’s cognitive complexity level [
8], evaluated the effectiveness of university students learning English as a second language via a 3
d vr learning platform, and the results showed that the use of the tool assisted in the development of a higher level of thinking [
9]. Furthermore, Villena-Taranilla et al. observed that
vr, especially immersive
vr, promotes greater learning in comparison to control conditions in studies involving students in the Kindergarten to sixth-grade range [
10].
One method that can potentially yield larger learning gains in
xr-based learning is Action Observation (
ao).
Ao refers to the process of learning or practicing through observation and imitation of another person or avatar’s actions as practice. This act of observing movement can induce the same neural activity within the premotor, motor, and parietal cortices as when such movement is performed by oneself [
11]. A
o activates mirror neurons within the brain, which are inherent to neurocognitive functions relating to social functions [
12]. A complete neurological discussion on
ao is outside the scope of the article but interested readers are encouraged to look through the cited sources. In a study involving upper limb motor function measurements of children with cerebral palsy, a significant improvement was found between the functional score of a group before and after
ao treatment relative to a control group [
13]. Similarly,
ao therapy with intensive repetitive practice of observed actions was found to provide a significant improvement of motor functions in chronic stroke patients [
14].
In
vr-based applications of
ao, the actions of an exemplar, real or virtual, real-time or recorded, are often simultaneously displayed atop a learner’s avatar. Onebody is a system that allows students to receive remote real-time posture guidance, rendered in the first-person in place of their body [
15]. A study indicated that Onebody, relative to other modalities of movement instruction (including third-person
vr) resulted in significantly higher matching accuracy for upper limb posture. In another study, tai chi moves were taught to subjects via either 2
d video or 3
d immersive
vr system and subjects of the latter were demonstrated to have comparatively learned more [
16]. In another study involving subjects learning to use a prosthetic arm for the first time, it was found that subjects who were exposed to first-person
vr-based
ao were able to complete a bilateral manual task significantly faster than subjects who observed third-person
vr-based
ao or standard 2
d video [
17].
Conversely, the limitations of
ao, including
vr-based
ao, have been noted. In the aforementioned prosthetic-limb training experiment, for example, results were less promising for the comparatively easier unimanual task of picking up and moving blocks using the prosthetic limb. Consequently, it was suggested that
vr-based
ao is more likely to exhibit efficacy in tasks involving relatively simpler coordination challenges. In a separate experiment involving a
vr-based full body-tracking tai chi training system [
18], two immersion techniques that could be defined as examples of
ao were significantly more difficult for subjects to follow. The average positional error of one of those
ao-based modalities was statistically significantly higher relative to a condition based on a traditional teaching environment.
Virtual Co-embodiment (
vc) is a relatively new concept and research field that describes applications of multiple users sharing control of a single avatar. In a recent study [
19], a
vr-based dual task hand movement coaching application with first-person perspective employed a
vc mode that tracked and averaged two users’ hand positions to control a single avatar’s arms. The performance of subjects who used this mode to practice had improved motor skill learning efficiency relative to both a control group that practiced alone and a group that practiced via an
ao mode. The
ao mode, despite its subjects exhibiting higher learning than the control group, was thought to have a relatively low sense of agency or the subjective feeling of initiating and controlling an action. Because of this, it is believed that
ao-based practice, including even first-person
vr-based
ao practice, does not excel at helping users retain motor skills learned. Results also alluded to the
ao mode being less effective than
vc in the short term. Similar conclusions were drawn when subjects learned movements via a first-person
vr-based
ao tool called Just Follow Me—subjects could accurately mimic immediately after learning but the learning was not substantially retained.
As authors interested in exploring
xr-based applications to teach drumming, both
ao and
vc were considered. While weighted average-based
vc had observable success in improving subjects’ ability to perform a dual task [
19], this was for a relatively slow action with primarily smooth patterns of motion. Conversely, the trajectory of a drumstick in use is inherently full of sharp turns and direction changes that can occur almost instantaneously. If a similar
vc approach was employed to a drumming tool, double hits in the performance audio or glitchy motions in the virtual scene seem unavoidable. Although some research has been conducted in musical applications of
ao [
20], studies entailing
ao and drumming, in particular, are scarce. In a study involving subjects listening to music with groove, it was found that such music engages listeners’ motor system, an effect also induced by
ao.
Conversely,
vr-based musical tools are not uncommon—and past studies have suggested such tools have the potential to both support musical rehearsal [
21] and improve motor function and reported reduced feelings of anxiety in chronic stroke patients [
22]. Despite the existence of
vr-based applications of drumming [
23], research on its effectiveness is underrepresented. In authors’ previous work, a
vr-based system was designed to teach drumming exercises through first-person interaction with an exemplar’s demonstrations, to inconclusive results [
24].
1.1. Research Question
As
vr- and, more generally,
xr-based
ao research has shown both positive and negative effects on subject learning, our research question is as follows: Do the affordances of a first-person
mr-based
ao tool designed to help non-musicians practice drumming result in different levels of improvement relative to simply practicing with first-person video demonstrations? Because such
ao tools improve the learning of novices relative to a control group in some cases [
15,
17] and impairs such learning in others [
18], the goal of this work is to compare the improvement of novice drummers learning rhythms via
xr-based
ao against the improvement of drummers’ learning via video. Conclusions drawn from this comparison may strengthen understanding of
ao-related concepts and influence the methods and tools for teaching music as a motor skill.
1.2. Background
In order to better understand our research, it is appropriate to review related terms.
1.2.1. Video See-Through
Just as
hmds can be used for
vr applications, they can also be used to enhance real environments with virtual elements in
mr experiences. Video See-Through (or Passthrough, as the feature is referred to by Meta [
25]) visually displays one’s ambient real environment via video feed. Although both VR and
mr technology can be used effectively in fields such as education [
6], in certain applications, users may be tasked with interacting with virtual components and real objects simultaneously. M
r, and specifically video see-through capabilities, may help provide such experiences seamlessly. Due in part potentially to such capabilities and the ubiquity of
mr technology, the global
mr market size was valued at about 811 billion USD in 2021 and is projected to grow to 19,489 billion USD by 2030 [
26].
1.2.2. Haptic Displays
The main functions of a haptic device include actuation, the display of forces from the virtual environment via actuators contacting a user’s body, and sometimes sensing, the tracking of movement or force of a user to control a virtual avatar [
27]. Current handheld haptic devices in
vr applications support large-scale body movement, are easy to use and put down, and provide vibrotactile feedback [
27]. Such a display can be controlled through the playback of audio signals or direct programming of signal frequencies, amplitudes, and durations. Common use cases of haptics include using vibration to reinforce a player’s actions and signify danger or urgency within the game.
1.3. Relevant Rhythmic Terms and Musical Practices and Technologies
1.3.1. Rudiments
Rudiments, two examples of which are shown in
Figure 1a, are widely agreed-upon exercises deemed essential for drummers to practice [
28]. These exercises may help musicians practice rhythm, dynamics, or sticking, which is the concept of “assigning” certain notes of an exercise to particular hands to increase fluidity. Rudiments help drummers hone important aspects of playing, such as control, coordination, and endurance. Because achieving complete mastery of a rudiment is an ongoing pursuit, drummers regularly practice rudiments even after achieving high proficiency.
1.3.2. Polyrhythms
Polyrhythms, in contrast, are not usually considered essential for beginner drummers’ practice routines, and are often regarded as a concept that requires substantial practice [
29]. They are made up of simultaneously expressed musical lines based on different but mathematically related tempos [
30]. A polyrhythm can be broken down into elements known as a basic pulse and counterrhythm(s) [
31]. Although multiple inherent tempos can be realized through deep listening, western notation is able to notate most polyrhythms in a single stave based on a single tempo, as shown in
Figure 1b. Despite this, players often have difficulty learning to play, or even properly “feel”, polyrhythms. Apart from studying with a teacher, two common techniques to learn polyrhythms include breaking down the divisions of the primary pulse and counterrhythm to the lowest common factor and uttering mnemonic phrases that can reflect the cadence of a polyrhythm [
32]. Certain cultures, such as some in West Africa, pass down musical tradition and polyrhythms entirely through oral transmission [
33].
2. Materials and Methods
2.1. Apparatus
Our system uses a Meta Quest 2 (Meta Platforms, Menlo Park, CA, USA)
hmd connected via Quest Link to a workstation running the Unity 2023.1.0b6 game engine. The frontal cameras of the
hmd are used for Meta Passthrough, displaying a real-time, gray-scale monochrome, stereoscopic view to the user. Within this view, users can see and strike Roland PD-7 electronic drum pads with a pair of standard-sized drumsticks. The Unity project augments a user’s view with virtual content, including 3
d models of drums [
34] and drumsticks, for observation and interaction. These accompany their physical counterparts placed within the user’s reach. Atop the Meta Quest 2
hmd, the user also wears HD380 Pro headphones (Sennheiser, Wedemark, Germany) to aurally monitor both the exemplar performance of the virtual drumming and the user’s own performance via a TD-25 Drum (Roland, Hamamatsu, Japan) Sound Module.
The audio of the virtual exemplar demonstration had a “floor tom” sound corresponding to the right virtual drum and a “snares-off” snare sound corresponding to the left virtual drum. This was chosen for both auditory and practical reasons. The floor tom and snare occupy significantly different frequency ranges, and the set-up used in the experiment was similar to common placements of a snare and floor tom pair. Subjects monitored their performance on drum pads as a wood block sound, chosen for its transient-like sonic qualities. For the two rudiment exercises, the two pads shared the same sound; for the polyrhythmic exercises, the sound of the left pad was relatively higher in pitch than the right because when practicing polyrhythms, it is often recommended to use two different sounding instruments or timbres to differentiate between hands [
31].
The architecture of the full system, as seen in
Figure 2, shows the user’s means of multimodal interaction with the system. The drum pads, played by the user’s hands receiving vibrations from the controllers, send MIDI signals interpreted by Unity in real-time, influencing the
mr scene.
The virtual drums and sticks are instantiated as prefabs. They can be moved and re-instantiated about the scene to best fit the layout of the physical pads, improving the integration of the mr experience. The virtual drumstick pair uses keyframe-based programmed animations to modulate the location and strike the virtual drum to play a variety of rhythms that the user is expected to observe and practice along with.
Real-time MIDI capability in Unity is handled by the MIDI Player Tool Kit Pro asset [
35]. Such data are received from the TD-25 and used for concurrent feedback purposes. As a way of encouraging rhythmic adjustment while a user practices, his/her timing is compared against the timing of an exemplar. This timing difference is used to give positive or negative feedback depending on whether or not it falls within a predetermined window of 50 ms. This timing threshold was chosen as the precedence effect occurs for delay times between 2 and 50 ms [
36].
In addition to a pair of standard drumsticks, users also held a Meta Quest 2 controller in their corresponding hand while practicing, as shown in
Figure 3. This grasp is based on matched grip and allows the user to relatively easily hold the controller between the fingers that are not part of the drumstick fulcrum. Haptic feedback, expressed via the controllers, is synced with the virtual drumsticks’ exemplar animations and reinforces the timing and sticking of the rhythmic exercises. For every right or left stroke, the corresponding controller vibrated for 200 ms at half of their maximum amplitude, starting at the exact time of the ideal stroke.
For portions of the experiment involving video demonstrations, a Dell E2010H monitor was used. Stereo loudspeakers were used for initial demonstrations, whereas the headphones were used elsewhere.
2.2. Participants
There were 20 subjects (15 male, 5 female), students recruited from The University of Aizu who participated in this study (age: years). This experiment was conducted following the ethic guidelines of the University of Aizu. Before the start of the experiment, subjects self-reported their age, hearing issues, dominant hand, whether or not they were able to read Western musical notation, and any prior musical experience. As there was an aim to recruit primarily nonmusicians and novices, 55% of subjects had no prior experience with musical instrument practice. Examples of reports from subjects with musical experience included having played bass guitar for a lifetime total of ten hours, to practicing trumpet for two years at the age of 10. No subjects reported hearing issues and 95% were right-handed. Subjects were given information and instructions via a script read aloud before being given an informed consent form to sign. Ten subjects were randomly placed in a control group and 10 subjects were placed in the experimental group.
2.3. Procedure
Each experiment was tasked to one of the authors who met with each of the subjects individually and facilitated the process. The procedure consisted of five sections in total, including a tutorial section, two sections to teach two drumming rudiments, and two sections to teach two polyrhythms. The goal of each section was to teach and potentially improve a subject’s skill at the execution of a rhythmic exercise. The tutorial section was meant only to familiarize subjects with the flow of the experiment and their assigned modality for practice, and taught a simple exercise consisting of eighth notes played with alternating hands. After the tutorial, the subjects were asked if they wanted to make any adjustments to their set-up, including headphone volume and seat or drum pad height. After this, no further adjustments were made to the apparatus during the experiment.
The following four sections sequentially consisted of doubles and paradiddles, two rudiments shown in
Figure 1a, and the 3:2 and 3:4 polyrhythms shown in
Figure 1b. This succession was decided based on ascending rhythmic complexity. All five sections’ exercises were of tempo eighth note = 120 beats per minute (bpm). Subject performances during these sections, or experimental blocks, were recorded as quantitative data. The experiment took about 35–45 min for subjects in the control group and 45–55 min for subjects in the experimental group. The difference in experiment length was due to preparations of the
mr apparatus that pertained only to subjects of the experimental group.
Each of the five aforementioned sections consisted of the same four-phase process to help each subject learn the rhythmic exercise corresponding to that section, as shown in
Figure 4. In the first phase, subjects were shown a video demonstration of the rhythm via a computer monitor and stereo loudspeakers. The video was shot in third-person perspective via a camera placed behind and above the drummer’s performance, as shown in
Figure 5. The video started with the in-tempo clicking of a metronome for one measure before the demonstration of the exercise began. The exercise was played for four measures, via a 22-second-long video. The audio of each of the videos in this first phase was quantized and panned, to rectify timing imprecision and to achieve stereo separation, respectively.
After observing the video demonstration of the first phase, subjects were then asked to complete a baseline recording of the observed rhythm for phase 2. These recordings for each exercise are also referred to as trial 1 recordings. Subjects were given headphones and were played back a click track to drum along with. They were asked to recall the demonstration to the best of their capacities and try to mimic the rhythm just observed.
After the baseline recording, subjects were asked to complete a short practice session for phase 3. This portion was dependent on the subjects’ assigned mode of practice; Video or
mr. Subjects practicing via standard video in the control group were asked to play along while watching a video shot in first-person, shown in
Figure 6a. Subjects in the
mr group were helped with the set-up and fitting of the
hmd and headphones. Subjects then placed their hands on the drum pads as the virtual drum and drumstick objects were instantiated into the scene based on hand-tracking. After this set-up, subjects grabbed Meta Quest 2 controllers with some combination of their ring, index, and pinky fingers, and the
mr scene (shown in
Figure 6b) was launched. In addition to a monochrome representation of their surroundings, a sticking diagram, the same of which is used for the video-based practice content, was displayed on a virtual screen in front of the subject. Within the scene, subjects were asked to focus on and try to mimic the movements of the virtual drumsticks playing virtual drums atop the physical pads. Vibrational cues expressed via the controllers also reinforced the pattern of each exercise’s rhythm and sticking. Subjects of both groups used headphones in this phase. Each practice session, regardless of rhythmic exercise or subject group, consisted of four repetitions of the exercise, a duration of about two and a half minutes.
The fourth and final phase of each section was a final recording of the the rhythmic exercise. Once again, subjects wore headphones and listened to a metronome while recording a four-measure performance of the previously practiced exercise. The data of these recordings are referred to as trial 2.
There was no verbal instruction to subjects within the practice media prepared for the Video or mr groups. While control group subjects were getting situated with headphones and mr group subjects were getting situated with the mr apparatus; however, there was some communication between the subject and the author facilitating the experiment. After finishing the four phases, participants volunteered their thoughts about the whole experience. These data were not analyzed.
3. Results
Two metrics were used to analyze the data: the difference in maximum of consecutive correct strokes between the post-practice recording (trial 2) and the pre-practice recording (trial 1), and the absolute timing error of each recordings’ strikes. For the second metric, we compared the timing and sticking of each stroke against an ideal performance. These data were analyzed through a series of mixed linear models in R [
37], eased by the library lme4 [
38]. The goodness of fitness of the final models was confirmed with diagnostics available in the DHARMa library [
39]. Starting with a simple model that included the random effect of each participant, we added potentially relevant factors (group and block) and compared the nested models via ANOVA. When necessary, not nested models were compared using the Bayesian Information Criterion (BIC).
A correct hit was defined as one that used the intended hand and occurred within ms of an ideal stroke. We subtract the number of correct hits of the trial 1 recording from that of trial 2 recording for both groups to gauge the performance improvement.
We found no significant effect of group [
] or block [
] on the number of correct consecutive strokes, as shown in
Figure 7.
For the absolute timing error, we only analyzed the data when the hand used by the participant corresponded to the intended hand. This was the case % of the time. We found significant effects on the absolute timing error of the interactions between Block and Trial [], Block and Group [], and Trial and Group []. Posthoc analyses based on Tukey’s honest significant difference between estimated least-square means were computed for the significant interactions with the library means.
As illustrated in
Figure 8, all the blocks yielded significant differences within the same trial, except in the case Doubles–Paradiddles with an estimated difference of
ms [
z-ratio
]. Differences across trials for a given block are summarized in
Table 1. According to this table, both treatments video and
mr improved the timing with which participants could perform the exercises for all the blocks except “Doubles”. In this case, we observed a flooring effect suggesting that participants found it too easy to perform it with correct timing.
The effect of the interaction between block and group on the absolute timing error is illustrated in
Figure 9. Within the same group, the absolute timing errors between blocks were significantly different, except between Doubles and Paradiddles in the MR group (difference =
ms,
z-ratio
). No significant differences between the two groups in any of the blocks were found, as summarized in
Table 2.
Perhaps the interaction between trial and group is the most interesting for the purposes of our study. These results are shown in
Figure 10 and summarized in
Table 3. According to these, there are no significant differences between groups for a given trial; however, the absolute timing errors were significantly lower in the second trial relative to the first one. Crucially, the absolute timing error difference between the first and the second trial for the MR group was larger than that of the Video group 239 ms (
z-ratio
). The latter finding indicates that participants benefited more from the
mr treatment than the Video treatment.
4. Discussion
The displayed results, particularly the interaction between trial and group shown in
Figure 10, suggest there is a timing error-related benefit to novices employing at least one of the affordances of the multimodal
mr tool for practice involving rudiments and polyrhythms. This difference of 239 ms between improvement in absolute timing error between groups is statistically significant. Moreover, it signifies the subjects practicing via
mr had twice as great a reduction in timing error observed relative to subjects practicing via video. In activities that are time-sensitive such as rhythm, this reflects a significant difference. Due to the experimental findings, we suspect the overlaying of the exemplar avatar within the
mr scene helped users perform with more accuracy. We also believe the multimodal expression via vibration in the controllers helped subjects internalize the rhythm during practice.
When considering why mr-based ao had no observable effect on the improvement of subject performance with respect to the maximum number of consecutive correct strikes, it seemed contradictory at first. Our current interpretation relates to the scale of novice drummers’ timing inaccuracies. While mr-based practice may help subjects improve their timing error in ways video practice did not, it was not enough to have a noticeable effect on the maximum number of consecutive correct strikes. It seems possible such subjects’ timing improved, but not enough to where a correct hit was detected using a window of 50 ms. Using a wider window (for example, 75 ms) for the correct strike detection in a reanalysis of the data could yield observable differences between the two methods; however, 50 ms is considered the point at which two auditory events start to be perceived independently and not fused as per the precedence effect.
The aim to recruit novice musicians and to avoid the ceiling effects of experienced drummers playing exercises they might already be familiar with was mostly fulfilled. This also helped equalize the initial capabilities of subjects. As shown in
Figure 10, the absolute timing errors of both groups before practice sessions were at comparable levels. However, it thus potentially results in this study not being representative of more experienced drummers and musicians.
As expected, subjects were not able to perform the two polyrhythmic exercises as proficiently as rudiments, in general. Two subjects (both of the control group) were surprisingly able to perform all rudiments and polyrhythms correctly both in phase 2 and phase 4’s recording. Another relatively rare case was a subject performing all exercises besides 3:4 polyrhythm correctly for both the pre- and post-recordings, which was observed twice in the mr group and once in the control group.
4.1. Limitations of the Study
The experiment was limited in that only 20 subjects were able to be recruited. Whether our findings also apply to a general population needs to be validated with further subjective studies.
Another limitation was the diversity of language ability of subjects. Native Japanese, English, and Chinese speakers were recruited for the experiment but instructions were facilitated in English and Japanese only. Despite the care taken on the authors’ side to prepare scripts in the two languages, there was potential for some shortcomings in understanding due to the subject background.
The experiment was also limited to four rhythmic exercises. While the two rudiments, doubles and paradiddles, are almost universally used across many genres and percussion instruments, polyrhythms are less common and often considered more advanced.
In the same vein, whether the benefits observed in our experiment persist over time and not just in a short period needs to be investigated in longitudinal studies.
4.2. Future Lines of Research
4.2.1. MR Virtual Co-embodiment Extension
It is of interest to extend the current mr experience to incorporate vc techniques. This may include a weighted average-based vc application for drumming with brushes, a device used instead of drumsticks that allows drummers to express rhythm via smooth lateral movement as opposed to striking.
4.2.2. Exploration of Multimodal Interfaces
The haptic interface used in this study could be considered a limitation as it may have weighed down the mr-group subjects’ hands during practice sessions. A lighter, more unobtrusive, solution would be easier to implement and it is of interest to conduct a study comparing subjects using different interfaces for haptic feedback. In addition, experimenting with the qualities of the programmed vibration, including frequency, duration, and intensity, may yield interesting findings.
4.2.3. Extensions for More Drums and Other Instruments
We are also interested in extending development to explore training for a full drum set. This would include triggers on foot-operated pedals in addition to the hands. Beyond that, piano and keyboard instruments (which involve coordination of fingers as opposed to wrists and feet) as well as brush technique on snare drum may also be effective extensions of this application. Significant pedagogical benefits may be found with training for brushes in particular, as such technique is focused on continuous positioning of the hands, as opposed to just the moments a drum is to be struck.
4.2.4. Integration of Hmd with Higher Specifications
Extending the experience with a Meta Quest 3 or Apple Vision Pro
hmd is an obvious next step. Due to their higher display resolution, wider field of view, and full-color video see-through capabilities, utilizing such devices would increase immersion for users of this system [
40,
41].