Abstract
The orchestration of cloud computing infrastructures is challenging, considering the number, heterogeneity and dynamicity of the involved resources, along with the highly distributed nature of the applications that use them for computation and storage. Evidently, the volume of relevant monitoring data can be significant, and the ability to collect, analyze, and act on this data in real time is critical for the infrastructure’s efficient use. In this study, we introduce a novel methodology that adeptly manages the diverse, dynamic, and voluminous nature of cloud resources and the applications that they support. We use knowledge graphs to represent computing and storage resources and illustrate the relationships between them and the applications that utilize them. We then train GraphSAGE to acquire vector-based representations of the infrastructures’ properties, while preserving the structural properties of the graph. These are efficiently provided as input to two unsupervised machine learning algorithms, namely CBLOF and Isolation Forest, for the detection of storage and computing overusage events, where CBLOF demonstrates better performance across all our evaluation metrics. Following the detection of such events, we have also developed appropriate re-optimization mechanisms that ensure the performance of the served applications. Evaluated in a simulated environment, our methods demonstrate a significant advancement in anomaly detection and infrastructure optimization. The results underscore the potential of this closed-loop operation in dynamically adapting to the evolving demands of cloud infrastructures. By integrating data representation and machine learning methods with proactive management strategies, this research contributes substantially to the field of cloud computing, offering a scalable, intelligent solution for modern cloud infrastructures.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Data Availability
Data will be made available on request.
References
Fazio, M., Ranjan, R., Girolami, M., Taheri, J., Dustdar, S., Villari, M.: A note on the convergence of iot, edge, and cloud computing in smart cities. IEEE Cloud Comput. 5(5), 22–24 (2018). https://doi.org/10.1109/MCC.2018.053711663
Liu, S., Liu, L., Tang, J., Yu, B., Wang, Y., Shi, W.: Edge computing for autonomous driving: Opportunities and challenges. Proc. IEEE 107(8), 1697–1716 (2019)
Bachhuber, C., Martinez, A.S., Pries, R., Eger, S., Steinbach, E.: Edge cloud-based augmented reality. In:2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6 (2019). IEEE
Number of edge enabled internet of things (IoT) devices worldwide from 2020 to 2030. Statista (2022). https://www.statista.com/statistics/1259878/edge-enabled-iot-device-market-worldwide/
Tang, H., Li, C., Bai, J., Tang, J., Luo, Y.: Dynamic resource allocation strategy for latency-critical and computation-intensive applications in cloud-edge environment. Comput. Commun. 134, 70–82 (2019). https://doi.org/10.1016/j.comcom.2018.11.011
Soumplis, P., Kokkinos, P., Lagos, D., Kretsis, A., Sourlas, V., Varvarigos, E.: Network slicing and workload placement in megacities. In:2020 22nd International Conference on Transparent Optical Networks (ICTON), pp. 1–4 (2020). IEEE
Cisco annual internet Report - Cisco Annual Internet Report (2018-2023) White Paper. Cisco (2022). https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html
Khan, W.Z., Ahmed, E., Hakak, S., Yaqoob, I., Ahmed, A.: Edge computing: A survey. Futur. Gener. Comput. Syst. 97, 219–235 (2019)
Christodoulopoulos, K., Sambo, N., Argyris, N., Giardina, P., Kanakis, G., Kretsis, A., Fresi, F., Sgambelluri, A., Bernini, G., Delezoide, C., et al.: Observe-decide-act: Experimental demonstration of a self-healing network. In:Optical Fiber Communication Conference, pp. 3–7 (2018). Optical Society of America
Svorobej, S., Bendechache, M., Griesinger, F., Domaschka, J.: In: Lynn, T., Mooney, J.G., Lee, B., Endo, P.T. (eds.) Orchestration from the Cloud to the Edge, pp. 61–77. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41110-7-4
Barika, M., Garg, S., Zomaya, A.Y., Wang, L., Moorsel, A.V., Ranjan, R.: Orchestrating big data analysis workflows in the cloud: research challenges, survey, and future directions. ACM Comput. Surv. (CSUR) 52(5), 1–41 (2019)
Duc, T.L., Leiva, R.G., Casari, P., Östberg, P.-O.: Machine learning methods for reliable resource provisioning in edge-cloud computing: A survey. ACM Comput. Surv. (CSUR) 52(5), 1–39 (2019)
Dong, D.: Agent-based cloud simulation model for resource management.J Cloud Comput 12(1), 1–24 (2023)
Ashawa, M., Douglas, O., Osamor, J., Jackie, R.: Improving cloud efficiency through optimized resource allocation technique for load balancing using lstm machine learning algorithm. J. Cloud Comput. 11(1), 1–17 (2022)
Yang, K., Ma, H., Dou, S.: Fog intelligence for network anomaly detection. IEEE Netw. 34(2), 78–82 (2020). https://doi.org/10.1109/MNET.001.1900156
Ibidunmoye, O, Hernández-Rodriguez, F., Elmroth, E.: Performance anomaly detection and bottleneck identification. ACM Comput. Surv. 48(1) (2015). https://doi.org/10.1145/2791120
Mitropoulou, K., Kokkinos, P., Soumplis, P., Varvarigos, E.: Detect resource related events in a cloud-edge infrastructure using knowledge graph embeddings and machine learning. In:2022 13th International Symposium on Communication Systems, Networks and Digital Signal Processing (CSNDSP), pp. 698–703 (2022). https://doi.org/10.1109/CSNDSP54353.2022.9908022
Sauvanaud, C., Kaâniche, M., Kanoun, K., Lazri, K., Da Silva Silvestre, G.: Anomaly detection and diagnosis for cloud services: Practical experiments and lessons learned. J. Syst. Softw. 139, 84–106 (2018). https://doi.org/10.1016/j.jss.2018.01.039
Duan, S., Babu, S., Munagala, K.: Fa: A system for automating failure diagnosis. In:2009 IEEE 25th International Conference on Data Engineering, pp. 1012–1023 (2009). IEEE
Zhang, J., Zulkernine, M., Haque, A.: Random-forests-based network intrusion detection systems. IEEE Trans. Syst. Man Cybern. Part C (Applications and Reviews) 38(5), 649–659 (2008)
Farshchi, M., Schneider, J.-G., Weber, I., Grundy, J.: Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis. In:2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp. 24–34 (2015). https://doi.org/10.1109/ISSRE.2015.7381796
Fu, S., Liu, J., Pannu, H.: A hybrid anomaly detection framework in cloud computing using one-class and two-class support vector machines. In: Zhou, S., Zhang, S., Karypis, G. (eds.) Advanced Data Mining and Applications. Springer, Berlin, Heidelberg (2012)
Roumani, Y., Nwankpa, J.K.: An empirical study on predicting cloud incidents. Int. J. Inf. Manag. 47, 131–139 (2019). https://doi.org/10.1016/j.ijinfomgt.2019.01.014
Liu, J., Chen, S., Zhou, Z., Wu, T.: An anomaly detection algorithm of cloud platform based on self-organizing maps. Math. Probl. Eng. 2016 (2016)
Xu, M.: A novel machine learning-based framework for channel bandwidth allocation and optimization in distributed computing environments. EURASIP J. Wirel. Commun. Netw. 2023(1), 97 (2023)
Kompougias, O., Papadopoulos, D., Mantas, E., Litke, A., Papadakis, N., Paraschos, D., Kourtis, A., Xylouris, G.: Iot botnet detection on flow data using autoencoders. In:2021 IEEE International Mediterranean Conference on Communications and Networking (MeditCom), pp. 506–511 (2021). https://doi.org/10.1109/MeditCom49071.2021.9647639
Cherkasova, L., Ozonat, K., Mi, N., Symons, J., Smirni, E.: Automated anomaly detection and performance modeling of enterprise applications. ACM Trans. Comput. Syst. (TOCS) 27(3), 1–32 (2009)
Miyazawa, M., Hayashi, M., Stadler, R.: vnmf: Distributed fault detection using clustering approach for network function virtualization. In:2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), pp. 640–645 (2015). IEEE
Schmidt, F., Suri-Payer, F., Gulenko, A., Wallschläger, M., Acker, A., Kao, O.: Unsupervised anomaly event detection for vnf service monitoring using multivariate online arima. In:2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 278–283 (2018). https://doi.org/10.1109/CloudCom2018.2018.00061
Cotroneo, D., Natella, R., Rosiello, S.: A fault correlation approach to detect performance anomalies in virtual network function chains. In:2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE), pp. 90–100 (2017). IEEE
Ullah, I., Lim, H.-K., Seok, Y.-J., Han, Y.-H.: Optimizing task offloading and resource allocation in edge-cloud networks: a drl approach. J. Cloud Comput. 12(1), 112 (2023)
Jiang, F., Ma, R., Gao, Y., Gu, Z.: A reinforcement learning-based computing offloading and resource allocation scheme in f-ran. EURASIP J Adv Signal Process 2021, 1–25 (2021)
Di Stefano, A., Di Stefano, A., Morana, G., Zito, D.: Prometheus and aiops for the orchestration of cloud-native applications in ananke. In:2021 IEEE 30th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), pp. 27–32 (2021). IEEE
Nagasundaram, S., Bobinath, B., Shedthi, A., Rajalakshmi, K., Humnekar, T.D., et al.: Analysis of the requirement and artificial intelligence-based resource management system in cloud. In:2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS), vol. 1, pp. 2516–2525 (2023). IEEE
Chen, X., Yang, L., Chen, Z., Min, G., Zheng, X., Rong, C.: Resource allocation with workload-time windows for cloud-based software services: a deep reinforcement learning approach. IEEE Trans. Cloud Comput (2022)
Zhang, J., Wang, J., Wu, J., Lu, Z., Zhang, S., Zhong, Y.: Warmops: a workload-aware resource management optimization strategy for iaas private clouds. In:2014 IEEE International Conference on Services Computing, pp. 575–582 (2014). IEEE
Guo, W., Tian, W., Ye, Y., Xu, L., Wu, K.: Cloud resource scheduling with deep reinforcement learning and imitation learning. IEEE Internet Things J. 8(5), 3576–3586 (2020)
Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G.d., Gutierrez, C., Kirrane, S., Gayo, J.E.L., Navigli, R., Neumaier, S., et al.: Knowledge graphs. Synthesis Lectures on Data, Semantics, and Knowledge 12(2), 1–257 (2021)
Rotmensch, M., Halpern, Y., Tlimat, A., Horng, S., Sontag, D.: Learning a health knowledge graph from electronic medical records. Sci. Rep. 7(1), 1–11 (2017)
Qian, J., Li, X.-Y., Zhang, C., Chen, L., Jung, T., Han, J.: Social network de-anonymization and privacy inference with knowledge graph model. IEEE Trans. Dependable Secure Comput 16(4), 679–692 (2017)
Wang, H., Zhang, F., Wang, J., Zhao, M., Li, W., Xie, X., Guo, M.: Ripplenet: Propagating user preferences on the knowledge graph for recommender systems. In:Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 417–426 (2018)
Iannacone, M., Bohn, S., Nakamura, G., Gerth, J., Huffer, K., Bridges, R., Ferragut, E., Goodall, J.: Developing an ontology for cyber security knowledge graphs. In: Proceedings of the 10th Annual Cyber and Information Security Research Conference, pp. 1–4 (2015)
Tengku Asmawi, T.N., Ismail, A., Shen, J.: Cloud failure prediction based on traditional machine learning and deep learning. J. Cloud Comput. 11(1), 47 (2022)
Xu, J., Xu, Z., Shi, B.: Deep reinforcement learning based resource allocation strategy in cloud-edge computing system. Front. Bioeng. Biotechnol. 10, 908056 (2022)
Barshan, M., Moens, H., Latre, S., Volckaert, B., De Turck, F.: Algorithms for network-aware application component placement for cloud resource allocation. J. Commun. Netw. 19(5), 493–508 (2017)
Tärneberg, W., Mehta, A., Wadbro, E., Tordsson, J., Eker, J., Kihl, M., Elmroth, E.: Dynamic application placement in the mobile cloud network. Futur. Gener. Comput. Syst. 70, 163–177 (2017)
Sun, G., Liao, D., Anand, V., Zhao, D., Yu, H.: A new technique for efficient live migration of multiple virtual machines. Futur. Gener. Comput. Syst. 55, 74–86 (2016)
Miyazawa, T., Kafle, V.P., Harai, H.: Reinforcement learning based dynamic resource migration for virtual networks. In: 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM), pp. 428–434 (2017). IEEE
Mijumbi, R., Hasija, S., Davy, S., Davy, A., Jennings, B., Boutaba, R.: Topology-aware prediction of virtual network function resource requirements. IEEE Trans. Netw. Serv. Manag. 14(1), 106–120 (2017)
Eisen, M., Ribeiro, A.: Optimal wireless resource allocation with random edge graph neural networks. Ieee Trans. Signal Process. 68, 2977–2991 (2020)
Li, W., Wang, H., Zhang, X., Li, D., Yan, L., Fan, Q., Jiang, Y., Yao, R.: Security service function chain based on graph neural network. Information 13(2), 78 (2022)
Robinson, I., Webber, J., Eifrem, E.: Graph Databases: New Opportunities for Connected Data, USA (2015)
Cypher query language - developer guides (2023). https://neo4j.com/developer/cypher/
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Adv Neural Inf Process Syst 30 (2017)
Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Comput 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735. https://direct.mit.edu/neco/article-articlepdf/9/8/1735/813796/neco.1997.9.8.1735.pdf
He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recogn. 24(9–10), 1641–1650 (2003)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Proceedings - IEEE International Conference on Data Mining, ICDM (2008). https://doi.org/10.1109/ICDM.2008.17
Breunig, M., Kriegel, H.-P., Ng, R., Sander, J.: Lof: Identifying density-based local outliers., vol. 29, pp. 93–104 (2000). https://doi.org/10.1145/342009.335388
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
NetworkX documentation (2023). https://networkx.org
Neo4j Python Driver documentation (2023). https://neo4j.com/docs/api/python-driver/current/
Neo4j documentation (2023). https://neo4j.com/
Narayan, S.: The generalized sigmoid activation function: Competitive supervised learning. Inf. Sci. 99(1–2), 69–82 (1997). https://doi.org/10.1016/S0020-0255(96)00200-9
PyOD documentation (2023). https://pyod.readthedocs.io/en/latest/
Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Wojciechowski, S., Goścień, R., Ksieniewicz, P., Walkowiak, K.: Hybrid regression model for link dimensioning in spectrally-spatially flexible optical networks. IEEE Access 10, 53810–53821 (2022). https://doi.org/10.1109/ACCESS.2022.3175193
Ashawa, M., Douglas, O., Osamor, J., Jackie, R.: Improving cloud efficiency through optimized resource allocation technique for load balancing using lstm machine learning algorithm. J. Cloud Comput. 11 (2022) https://doi.org/10.1186/s13677-022-00362-x
Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., Epema, D.H.J.: The grid workloads archive. Futur. Gener. Comput. Syst. 24(7), 672–686 (2008). https://doi.org/10.1016/j.future.2008.02.003
GWA-T-2 Grid5000 Dataset (2023). http://gwa.ewi.tudelft.nl/datasets/gwa-t-2-grid5000. Accessed November 2023
GWA-T-4 AuverGrid Dataset (2023). http://gwa.ewi.tudelft.nl/datasets/gwa-t-4-auvergrid. Accessed November 2023
GWA-T-12 Bitbrains Dataset (2023) http://gwa.ewi.tudelft.nl/datasets/gwa-t-12-bitbrains. Accessed November 2023
Acknowledgements
This work is supported by the EU research project MARSAL (101017171).
Funding
Open access funding provided by HEAL-Link Greece.
Author information
Authors and Affiliations
Contributions
K.M, P.K and P.S wrote the main manuscript text. K.M. prepared the figures and software. E.V. provided funding and project supervision. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mitropoulou, K., Kokkinos, P., Soumplis, P. et al. Anomaly Detection in Cloud Computing using Knowledge Graph Embedding and Machine Learning Mechanisms. J Grid Computing 22, 6 (2024). https://doi.org/10.1007/s10723-023-09727-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-023-09727-1