Introduction

Database technology nowadays is an essential part in everyday life. The constantly changing requirements for storage systems [1] were only one of the important building blocks in the development towards this technology. In particular, in the field of libraries, the requirement of accessing “all the world’s literature from a single computer terminal” was a topic of discussion in the late 1960s already before the internet was invented [2]. In this time, first computerized literature databases such as MEDLARS emerged [3]. With the rise of the internet, the number of purpose related database systems rapidly grew. The access to scientific literature was eased and first specialist libraries, e.g., for the field of nuclear magnetic resonance [4], emerged as well. Such specialist literature databases have also been created in the field of complementary and alternative medicine (CAM). A 2009 review counted a total of 45 online accessible databases covering various aspects topics such as phytotherapy, traditional chinese medicine, or music therapy [5].

One of these databases is CAMbase. The Chair of Medical Theory and Complementary Medicine at the Witten/Herdecke University initiated the first version in 1998, enabling users to easily find relevant scientific literature on CAM. In 2007, CAMbase v2.0 arose and was implemented using Extensible Markup Language (XML) protocols and interfaces, in accordance with the requirements of the Open Archives Initiative [6]. Most importantly, CAMbase v2.0 was equipped with a semantic-syntactic search algorithm that benefits users by deconstructing a search query into linguistic (i.e., semantic and grammatical) parts and then transmitting the relevant documents in XML-packaged form [7,8,9]. At the time CAMbase v2.0 was released, the proprietary decomposition of search queries was much more detailed than with usual stemming algorithms. Besides forming the word stem, the word order, the word ending, and even umlauts, which are special to the German language, were used for searching in the index and calculating the relevance of documents. Page numbers and stop words such as "the", "on", or "and" were removed from the search query in advance, whereas the Boolean operators between each word were still taken into account [7, 8]. A search with relatively similar search queries (e.g., “treatment of hospital patients” and “treatment of patients in hospitals”) resulted ultimately in different documents by recognizing the linguistic parts. In sum, CAMbase v2.0 was on the information technological cutting edge for a specialist literature database at the time of its development.

As already mentioned in the early paper of Barraclough [2] and in contrast to conventional opinions, hosting an online literature database is not an easy task in many aspects. In addition to functionality, security must be ensured for the underlying hardware and software. Lifespan of an operating system (OS) increases the likelihood to find existing vulnerabilities, as evidenced by numerous reported vulnerabilities that can potentially cause substantial risks [10, 11]. On the other hand, there is more time to develop patches or better OS versions that are distributed without these vulnerabilities. An example of a decrease in vulnerability risks could be shown using a mean risk factor calculation method for the three versions of Microsoft Windows 7, 8, and 10 [12].

CAMbase v2.0, has been running on the same 32-bit OS since its release. As the database can be accessed publicly, it has been at risk of being attacked, e.g., by denial-of-service attacks (DOS) [13] or intrusions with mostly bad and unethical intentions [14, 15]. Despite the wide spectrum of securing an OS, selecting the right one already shows that it may be used to improve an intrusion-tolerant system [16]. All these facts led to the need to migrate CAMbase v2.0 to a modern 64-bit OS.

An essential requirement for the migration was to preserve the previous data. According to Haynes' understanding [17], it is important for patient care to follow current and best evidence-based medicine. Even though CAM pursues the approach of addressing the evidence along with emphasizing the patients and their relationship with the practitioner, the evidence remains limited [18], which may itself limit the patient care and shows the need of such data.

Apart from the preservation of the previous data, a further requirement concerns the effort that must be invested in the migration. For user-friendliness, the key components of CAMbase v2.0 should be migrated without major changes. These components include the established graphical user interface (GUI) and the approach of generating both the website and the retrieved documents on the client's side using Extensible Stylesheet Language Transformations (XSLT) and XML protocols [6, 8].

This technical report describes the migration process of CAMbase, the challenges that had to be solved, and a final evaluation of the system. More precisely, the chapter “Migration Process” gives an overview of the initial system architecture, outlines the issue with a pure migration, and justifies the replacement of an important component of the system architecture, namely the semantic-syntactic algorithm, with a current search engine that uses a score ranking algorithm (final development version: CAMbase v3.0). Then, the chapter “Comparison” presents the pre-trained language models and the statistical analysis that largely incorporates these models. This is followed by the chapter “Results”, in which it is analyzed whether the new retrieval processes affect the performance of the system by means of speed, accuracy, and reliability. The basic assumption is that current search engines can keep up with the semantic algorithm of 2007 despite using a different algorithm. The following research questions are considered: First, does the system retrieve the same search results after changing to a score ranking algorithm as before? And second, to what extent is the performance of the system affected after this change? The “Lessons learned” chapter revisits the challenges of this migration and the approach that was used to solve them. The advantages and disadvantages of the approach are also highlighted here. The final chapter rounds off this technical report with some concluding remarks.

Migration Process

System Architecture

The architecture, with which CAMbase v2.0 was built, follows the layered architecture pattern for smaller applications [19]. CAMbase v2.0 has three main layers with specific roles and responsibilities, namely GUI, business logic, and database (architecture on the left side of Fig. 1). The GUI is used for the graphic presentation. The presentation takes place on the client side after the data (i.e., web elements as well as literature of the database) has been transmitted by the server in XML protocols. The business logic, where the semantically algorithm is located, is responsible for processing user input, search queries, and data retrieval. The last layer is the database, which at the time after migration contained 115,355 entries (e.g., books, case reports, clinical studies, or experimental work) from 1906 onwards. With this three-layered architecture, it is intended to give simplicity to develop a new database by just replacing one of the layers.

Fig. 1
figure 1

Model of the system architecture before (CAMbase v2.0) and after the migration (CAMbase v3.0) taken from [20]

As already stated, CAMbase v2.0 has been running on the same 32-bit OS since its release. With the migration of CAMbase v2.0 on a 64-bit OS, multiple errors occurred (e.g., missing libraries, missing literature references, overlapping GUI elements, or wrong interpreted search queries). Although the pre-defined requirements seemed satisfied, the technical differences of the new architecture have to be carefully taken into account. Otherwise, such migration can even result in software vulnerabilities if the intricacies of this architecture are not considered, showing the complexity of this process [21, 22]. Since there were only binary files and no source code of the legacy search algorithm, a complete inspection or replication was not possible. In order to maintain user acceptance and thus the online literature database itself, the complete search algorithm had to be replaced.

Search Engine Solution

Over the years, especially since the release of CAMbase v2.0, search engines have significantly improved by comparing and developing different indexing approaches [23]. In addition, research in this area has compared search engines and rated their search capabilities and functionalities in order to find the most relevant documents [24,25,26,27]. Therefore, the legacy search algorithm was replaced with the search engine Apache Solr. This fast and poplar search engine is based on Apache Lucene, which itself has powerful indexing capability and supports a lot of search features [28, 29]. A comparison with Xapian already showed Solr’s good performance in searching for the most important documents [30]. Some of the key features of Solr are that it is ready-to-deploy, open source, centrally configurable, and allows full-text search and scalable search across multiple servers [29, 31]. In a search, the relevance of a document determines where it appears in the retrieved documents. For this purpose, a document’s relevance factor is calculated, which takes into account, among other things, the frequency of words of the search query within a document or even a specific relevance increasing boost [32]. This is by no means a semantic interpretation, but Solr's configuration pool provides a lot of scope for more specific search. For example, queries can be separated by alphanumeric characters or modified by adding stemming or phonetic algorithms.

System Architecture Adjustment

On the 64-bit system, all files related to the legacy algorithm were removed. This also removed the whole business logic from the three-layered architecture. Afterwards, Solr (version 8.9.0) was installed. Various routines and preparations (e.g., definition of fields and user roles) followed before importing the cleaned data. Cleaned data here means that, for example, duplicates were removed and types were unified. Since Solr is not intended to be used as a stand-alone system, another layer was implemented to allow the communication between GUI and Solr (architecture on the right side of Fig. 1). For this, a PHP: Hypertext Preprocessor (PHP) script supported by the PHP Extension Community Library (PECL) was used so that the layer can parse the user inputs as Solr-understandable queries and retrieve documents to the users.

The next stage was to approximate the syntactic search of CAMbase v2.0. Solr offers many ways to narrow down the search. In the end, a light stemming method was embedded to also search for slightly variant words, which was similarly used by the legacy algorithm. Converting the words (index and query) to low case enables even greater reach in finding relevant documents. In contrast to the legacy algorithm, the words of a query are handled independently, which is done by separating them by blanks. The words still had to be joined with the proper operator for a qualitatively high approximation. The users itself can nevertheless narrow down the query through the execution character supported by Solr. A list of synonyms was not utilized, as this would be too time-consuming for migration, but could lead to a more accurate approximation.

Comparison

Pre-Trained Language Model

To compare CAMbase v2.0 and CAMbase v3.0, the title of each retrieved document is used for a semantic comparison with two pre-trained language model.

Pre-trained language models can be considered as state-of-the-art in natural language processing and semantic text similarity detection. The models Bidirectional Encoder Representations from Transformers (BERT), A Lite BERT (ALBERT), and Embeddings from Language Models (ELMo) are impressive examples of this, especially when they are fine-tuned [33,34,35]. There are already numerous optimization approaches in the literature [33, 36,37,38]. This approach focuses only on Sentence-BERT (SBERT).

SBERT is based on BERT, maintaining the accuracy of BERT but improving the effort and thus the performance needed to compare large numbers of literature titles. In order to compare those titles, SBERT uses siamese and triplet network structures to derive semantically meaningful sentence embeddings [37]. A comparison is then made by taking the cosine similarity between the sentence pairs A and B as the similarity score [39] according to the formula:

$$\cos \theta = \frac{{\mathop \sum \nolimits_{i = 1}^{n} A_{i} B_{i} }}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} A_{i}^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{n} B_{i}^{2} } }} ;A = \left( {A_{1} , \ldots , A_{n} } \right), B = \left( {B_{1} , \ldots , B_{n} } \right)$$

Technically, values range between -1 and 1 for this approach (see Fig. 2) and thus can be interpreted like correlation coefficients: Values close or equal to 1 means a high correlation while a value close or equal to -1 also means a high correlation, but in the opposite direction. However, value lower than 0 are not expected so that only values between 0 and 1 are considered in the following.

Fig. 2
figure 2

SBERT approach to compute similarity scores in accordance with [37]

A general-purpose model (all-MiniLM-L6-v2) was selected in a previous comparison [20]. As the name suggests, this model can be applied to many use cases, e.g., comparing literature titles searched within two database systems. However, this model was only trained with English data. This allows an overall comparison in all documents’ provided titles but does not consider other languages. Therefore, this search reliability comparison refers additionally to a multi-language model (paraphrase-multilingual-MiniLM-L12-v2), which considers translations of other languages.

Statistical Analysis

In this approach, CAMbase v2.0, equipped with a proprietary syntactic algorithm, and CAMbase v3.0, equipped with Solr and a score ranking algorithm, are compared. The comparison is based on 36 search queries, which were suggested by experts from the field of CAM. They are a combination of terms from the list of Wieland et al. [40] and German key terms (see Table 3 and Fig. 5). The queries were executed in both systems with the four restrictions “All words”, “Keywords”, “Abstracts”, and “Titles”. Thus, four dependent pairs of outcome variables are obtained per query, i.e., the retrieved documents, their titles, and the query time that was needed.

Values of the outcome variables then were manually entered into a data sheet for further data analysis. Firstly, the number of retrieved documents was compared. For this purpose, a mean value given by the sum of the documents divided by the search queries executed in them was calculated for each restriction. Secondly, the mean query times were compared analogously to the number of retrieved documents. Finally, the titles of the retrieved documents are compared. Here, SBERT is applied, using the two models described above (all-MiniLM-L6-v2 and paraphrase-multilingual-MiniLM-L12-v2). Similarities calculated by SBERT were based on the titles of the document, where only the most similar title was regarded. The calculation went in both directions, i.e., all titles from the documents of CAMbase v2.0 are compared with those of CAMbase v3.0 and vice versa. Mean values then were calculated by summing the SBERT values within the search queries, including all the restrictions, and dividing by the number of retrieved documents. These means represent the reliability of the systems.

Statistical analysis concluded with a t-test for each part with the functions of Microsoft Excel for Windows, considering a level of significance of 5%. Equivalently, for graphical displays, means and their 95% confidence intervals were used.

Results

Quantitative Reliability

In a first case, words of a search query are joint with the operator “OR” in CAMbase v3.0. The differences to the legacy systems were tremendous and could easily verified as significant with a t-test on these interim results [41]. This is because a query to the CAMbase v3.0 retrieves a union of documents with this setting if the search query consists of multiple words. The more words the query has, the larger the union can be. The significant differences are in accordance with some user statements that the content of a search corresponds no longer to the usual, but fortunately without errors. A few users seemed slightly positive about the larger document range, because they might finally obtain more results for their systematic reviews and meta-analyses. Nevertheless, a second case was conducted with the operator “AND”. This led to a more comparable number of documents between both systems. Because of the now created intersection, the mean number of documents retrieved in CAMbase v3.0 is no longer statistically different from those retrieved in CAMbase v2.0. This applies to all restrictions and the change was not unnoticed by users either, which stated the content of a search as very accurate. Table 1 contains the t-test results based on the means. In addition, a graphical overview of the means is shown in Fig. 3.

Table 1 Results of the paired t test, comparing the mean numbers of documents of CAMbase v2.0 and CAMbase v3.0, whereas the operator in CAMbase v3.0 is first set to “OR” and then to “AND”
Fig. 3
figure 3

Means of the number of documents separated into the restriction “All words”, “Abstract”, “Title”, and “Keywords”. The striped bar represents CAMbase v2.0, the dotted bar represents CAMbase v3.0 by using the operator “OR”, and the filled bar represents CAMbase v3.0 by using the operator “AND”. Means are partially taken from [20]

On the one hand, the Solr-based system offers an increase of documents if manually or automatically a union is built, using the operator “OR”. On the other hand, a similar number of documents can be retrieved with an intersection, using the operator “AND”. This can also be seen in the bar charts and applies to all four restrictions. For the restriction “All words”, the number of documents was slightly lower in CAMbase v2.0 ( = 193) than in CAMbase v3.0 ( = 210), despite the fact that CAMbase v2.0 still operated with a semantic-syntactic search algorithm to find more relevant documents.

Performance

Performance was compared via query times like quantitative reliability by setting the operator in CAMbase v3.0 twice first to “OR” and then to “AND”. Both cases showed an improvement in performance compared to the legacy system (see Table 2). Except for the restriction “Title” after setting the operator to “AND”, the t-test provides statistical proof of improvement. As this restriction just slightly missed to be significant, it demonstrates that the processing of queries via Solr is overall more efficient compared to the algorithm of CAMbase v2.0.

Table 2 Results of the paired t-test, comparing the mean query times of CAMbase v2.0 and CAMbase v3.0, whereas the operator in CAMbase v3.0 is first set to “OR” and then to “AND”

Figure 4 displays the query times recorded from CAMbase v2.0 and the two variants of CAMbase v3.0. Obviously, Solr outperforms the legacy algorithm regardless of the operator. Solr also shows, when setting the operator from “AND” to “OR”, that its performance is maintained despite the increasing numbers of retried documents.

Fig. 4
figure 4

Means of the query times separated into the restriction “All words”, “Abstract”, “Title”, and “Keywords”. The striped bar represents CAMbase v2.0, the dotted bar represents CAMbase v3.0 by using the operator “OR”, and the filled bar represents CAMbase v3.0 by using the operator “AND”. Means are partially taken from [20]

Search Reliability

As the search in CAMbase 3.0 with the “AND” operator seemed closer to the legacy system, only this case is analyzed in respect of the search reliability. Regardless of the applied model, the values from SBERT indicate a high level of consistency when comparing the titles retrieved from CAMbase v2.0 with those retrieved from CAMbase v3.0 and vice versa. No mean calculated within the 36 search queries was below 0.5 as shown in Fig. 5. With both models, the best result was with the search query “Craniosacral Manipulation” (N = 2). The exact same documents were retrieved before and after the migration, leading to a SBERT value of 1. The search with “Morita Therapy” (N = 3) performed the worst with CAMbase v2.0. SBERT calculated a value of 0.639 with the general-purpose model and a value of 0.669 with the multi-language model. With the final system, the worst SBERT value was at 0.661 with the general-purpose model and 0.68 with the multi-language model when searching for “Bee Products” (N = 11). These small values only occur on one side, i.e., a search with these queries in the other system performed much better. The reason for this effect is given by the lower number of documents in comparison to that retrieved from the other system. If there is a higher number of documents in an overall small set than in the other system, the difference obviously cannot be found in the retrieved documents of the system with the smaller set. This can be seen by looking at the equivalent values of 0.968 and 0.939 calculated with the general-purpose model and multi-language model, respectively, for the query “Bee Products” (N = 7) and the value of 1 calculated with both models for the query “Morita Therapy” (N = 2). The unilateral effect can also explain lower values if the number of retrieved documents is relatively high in both systems. Again, the difference leads to a lack of equivalent documents, with a high difference amplifying the effect. In CAMbase v2.0, for example, the query “Arts therapy” (N = 409) resulted in a value of 0.995 with both models. However, in CAMbase v3.0, the same query (N = 1317) resulted in a value of 0.727 with the general-purpose model and a value of 0.78 with the multi-language model.

Fig. 5
figure 5

Overview of means that are calculated from the syntactic analysis via the multi-language model of SBERT. Calculations were performed for CAMbase v2.0 (circles) and CAMbase v3.0 (squares) using the “AND” operator accordingly and separately for search queries. Error bars denote the 95% confidence interval and crosses donate the equivalent means of the general-purpose model from [20]

It was also observed that a large proportion of values was just below 1, even when the number of documents were equal on both systems. The reason for this is the cleaning process of the data. While, for example, umlauts were coded in CAMbase v2.0, all words were indexed in plain text in CAMbase v3.0. SBERT in this case also differentiated between titles when additional special characters such as a dot at the end or quotes for highlighting titles in data occurred. For a human user, those titles may be considered identical, but SBERT made a slight difference. The multi-language model seemed to rate these slight differences better. A general worsening or improvement between the models however could not be determined by contrasting the t-values in Table 3. The observation implies that Although SBERT showed differences in both systems, the document comparisons for most search queries are not significantly different. Only five (“Arts Therapy”, “Autogenes Training”, “Chinese Traditional Medicine”, “Krebs”, and “Massage”) of the 36 search queries led to significant differences between CAMbase v2.0 and CAMbase v3.0. The other queries led to means that are close to each other in both systems displayed in Fig. 5 by the strong slope of means at the right side.

Table 3 Alphabetically ordered list of search queries, which is derived from the list of Wieland et al. [40] and extended with German key terms, and the results of the t test for independent samples, whereas the t1 values correspond to the general-purpose model and the t2 values to the multi-language model

Overall, the comparison of the reliability of the search with SBERT shows a promising result and thus that the legacy and the final system are very similar.

Lessons Learned

In this technical report, the realization and evaluation of a migration of an online literature database from a 32-bit to a 64-bit OS is presented. As the pure migration was unsuccessful, the proprietary search algorithm had to be replaced with Apache Solr, which changed the semantic search to a score-based search and required a data migration. By integrating Solr to CAMbase v3.0, the main goal of providing a useful and functional literature database for CAM could be achieved. The approach of implementing a ready-made search engine solution has been shown to be a good solution to provide similar search results for users without abandoning the graphical user interface and the modular structure given by the historically grown database.

Compared to the release date of CAMbase v2.0, there are notable good open-source search engine solutions available on the Internet today. Solr was chosen because there is a large community behind it that drives development. In addition, Solr ‘s documentation is quite extensive, covering different use cases which clearly helped with the installation, the configuration of the project, the import of the literature data into the project, and with even linking Solr to the GUI through an existing library solution. Despite the time investment, Solr remains a flexible solution that has even led to an increase in performance after replacing the legacy algorithm, which could be useful in similar projects.

A methodological limitation is given by the fact that this technical report omitted the calculation of sensitivity and precision of relevant and irrelevant retrieved documents suggested by Lefebvre et al. [42]. Although the analysis was not intended for this type of evaluation, it could be considered in further analysis or in an analysis of similar projects. Instead, SBERT, a derivation of the language representation model BERT [43], was used to ensure the quality aspects like data accuracy and data accessibility, which could be affected by the migration [44]. BERT itself has already proven to be a remarkable method for detecting similarities in textual or bibliographic data in similar contexts [45,46,47]. The results with SBERT were generally sufficient. According to these reults, the documents retrieved through the 36 specific search queries showed an overall high equality between CAMbase v2.0 and CAMbase v3.0. The two chosen language models had some minor issues with additional punctuation marks or coded umlauts in German language, which were a bit higher in the general-purpose model (all-MiniLM-L6-v2) as in the multi-language model (paraphrase-multilingual-MiniLM-L12-v2). Nevertheless, both demonstrated similar, good quality results, which indicates their accuracy and robustness. The general-purpose model stands out a bit, as it was trained on English data and could still handle the mixture of English and German documents. Which model is better, depends on the purpose of the use case. In our case neither of the two models fits perfectly. The more optimal model should be a mixture of the two, trained in both languages without considering translations. As a recommendation, even if the results were sufficient, a model should be trained appropriately for its specific use case, e.g., by fine-tuning, which can lead to better results [48].

The addition of small qualitative surveys of user statements helped to ensure and improve the data quality as well. At first, CAMbase v3.0 had significant change in retrieved documents when the operator was set to “OR”, which users immediately noticed. Users could no longer find their usual literature but were delighted with the wide range of documents available, although it takes longer to find the right literature. However, it does not correspond to the goal of an equivalent online database. Therefore, the operator in CAMbase v3.0 was finally set to “AND”. Now, users state the literature as more accurate, which is in accordance with the former analysis, and have a much better experience [20]. The fact that CAMbase v3.0 is a new system was hardly noticed, which could be due to the remained GUI. In contrast to that, users miss the functionality of easily narrowing down their search with words from a thematic landscape [8]. This functionality has only been partially implemented. Instead, the search can now be manually influenced by Boolean operators. However, users will need a certain training period to use these new functions. This highlights the need of an online tutorial, a feature for further development. Yet, all statements came from only a few supportive users. A larger sample could reveal more critical and detailed statements, which can be collected in a more systematic, qualitative study.

Conclusions

The assessment of various parameters, e.g., after a data migration, is important for quality management of bibliographical data [49], especially for sensitive or confidential data such as in the medical field. Possible data changes could be measured and categorized to support the data quality [50]. User statements and a semantic textual analysis evaluated the data of this report. The combination of both resulted in a well-accepted final system.

In sum, this technical report may serve as blueprint for similar projects. If the implementation is followed carefully, Solr can be considered to some extent as an alternative or replacement to a search engine that uses a semantic algorithm. In particular, the semantic text analysis via SBERT has proven to be a promising tool for quality management, which therefore is highly recommended and should be used and investigated in further analyses.