| Peer-Reviewed

Kognitor: Big Data Real-Time Reasoning and Probabilistic Programming

Received: 10 June 2021    Accepted: 9 July 2021    Published: 2 August 2021
Views:       Downloads:
Abstract

There is a huge increase in the amount of generated data since the explosion of the Internet. This generated data which is usually collected in different formats and from multiple sources is popularly termed Big Data. Big data contains uncertainty. To handle uncertainty in big data, probabilistic reasoning is used to develop probabilistic models that specify generic knowledge in different topics. These models are used in conjunction with an inference algorithm to enable decision makers especially during uncertain situations. Extensive knowledge in fields such as statistics, machine learning and probability theories are employed in the development of these probabilistic models. Thus, it is usually a difficult undertaking. Probabilistic programming was introduced to simplify and enable development of complex models. Again, decision makers often need to use knowledge from historic data as well as current data to make cogent decisions. Thus, the necessity to unify processing of historic and real-time data with low latency. The Lambda architecture was introduced for this purpose. This paper presents a framework called Kognitor that simplifies the design and development of difficult models using probabilistic programming and Lambda architecture. Evaluation of this framework is also presented in this paper using a case study to highlight the crucial potential of probabilistic programming to achieve simplification of model development and enable real-time reasoning on big data. Thus, demonstrating the effectiveness of the framework. Finally, results of this evaluation are presented in this paper. The Kognitor framework can be used to steer effective and easier implementation of complicated real-life situations as probabilistic models. This will be beneficial in the big data processing domain and for decision makers. Kognitor ensures cost-effectiveness using contemporary big data tools and technology on commodity hardware. Kognitor framework will also be beneficial in academia with respect to the use of probabilistic programming.

Published in International Journal on Data Science and Technology (Volume 7, Issue 2)
DOI 10.11648/j.ijdst.20210702.12
Page(s) 32-39
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2021. Published by Science Publishing Group

Keywords

Big Data Processing, Probabilistic Model, Lambda Architecture, Probabilistic Programming

References
[1] A. McAfee, E. Brynjolfsson, Big data: the management revolution., Harv. Bus. Rev. 90 (2012) 59–68. https://doi.org/10.1007/s12599-013-0249-5.
[2] D. Laney, 3D Data Managment: Controlling Data Volume, Velocity and Variety, Meta Group, 2001.
[3] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, C. Shahabi, Big data and its technical challenges, Commun. ACM. 57 (2014) 86–94. https://doi.org/10.1145/2611567.
[4] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A. H. Byers, Big data: The next frontier for innovation, competition, and productivity, McKinsey & Company, 2011.
[5] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. Franklin, M. Jordan, MLbase: A Distributed Machine-learning System, 6th Bienn. Conf. Innov. Data Syst. Res. (2013).
[6] P. Szolovits, S. G. Pauker, Categorical and probabilistic reasoning in medical diagnosis, Artif. Intell. 11 (1978) 115–144. https://doi.org/10.1016/0004-3702(78)90014-0.
[7] R. Haenni, Towards a unifying theory of logical and probabilistic reasoning, Isipta. 5 (2005) 1.
[8] G. Luger, C. Chakrabarti, Knowledge-Based Probabilistic Reasoning from Expert Systems to Graphical Models, Handb. Probab. Theory Appl. (2008) 2–22. http://www.cs.unm.edu/~luger/23-Luger-Chakrabarti.pdf.
[9] N. Alon, Paul Erdős and probabilistic reasoning, in: Bolyai Soc. Math. Stud., 2013: pp. 11–33. https://doi.org/10.1007/978-3-642-39286-3_1.
[10] J. Gonzalez, Parallel and Distributed Systems for Probabilistic Reasoning, Carnegie Mellon University, 2012.
[11] C. Dobre, F. Xhafa, Parallel Programming Paradigms and Frameworks in Big Data Era, Int. J. Parallel Program. 42 (2014) 710–738. https://doi.org/10.1007/s10766-013-0272-7.
[12] Z. Ghahramani, Probabilistic machine learning and artificial intelligence, Nature. 521 (2015) 452–459. https://doi.org/10.1038/nature14541.
[13] L. A. Zadeh, Toward a perception-based theory of probabilistic reasoning with imprecise probabilities, in: Intell. Syst. Inf. Process., Elsevier, 2003: pp. 3–34. https://doi.org/10.1016/B978-044451379-3/50001-7.
[14] A. Pfeffer, Practical probabilistic programming, Manning, New York, 2016.
[15] S. Liu, A. H. B. Duffy, R. I. Whitfield, I. M. Boyle, Integration of decision support systems to improve decision support performance, Knowl. Inf. Syst. 22 (2010) 261–286. https://doi.org/10.1007/s10115-009-0192-4.
[16] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE. 77 (1989) 257–286. https://doi.org/10.1109/5.18626.
[17] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological sequence analysis, Cambridge University Press, New York, 1998. https://doi.org/10.1017/CBO9780511790492.
[18] C. D. Manning, P. Raghavan, An Introduction to Information Retrieval, in: Online, 2009: p. 1. https://doi.org/10.1109/LPT.2009.2020494.
[19] L. De Raedt, K. Kersting, Probabilistic logic learning, ACM SIGKDD Explor. Newsl. 5 (2003) 31. https://doi.org/10.1145/959242.959247.
[20] E. Ábrahám, K. Havelund, Some recent advances in automated analysis, Int. J. Softw. Tools Technol. Transf. 18 (2016) 121–128. https://doi.org/10.1007/s10009-015-0403-0.
[21] D. Williams, Predictive coding and thought, Synthese. (2018). https://doi.org/10.1007/s11229-018-1768-x.
[22] Q. Zhang, C. Dong, Y. Cui, Z. Yang, Dynamic uncertain causality graph for knowledge representation and probabilistic reasoning: Statistics base, matrix, and application, IEEE Trans. Neural Networks Learn. Syst. 25 (2014) 645–663. https://doi.org/10.1109/TNNLS.2013.2279320.
[23] A. Pfeffer, Figaro: An object-oriented probabilistic programming language, 2009. http://www.cs.tufts.edu/~nr/cs257/archive/avi-pfeffer/figaro.pdf%5Cnpapers2://publication/uuid/0E83E526-451F-41EA-ACBE-7150FF7584D4.
[24] A. Sampson, Probabilistic Programming, (2015). http://adriansampson.net/doc/ppl.html (accessed March 25, 2018).
[25] D. Roy, Probabilistic Programming, (2018). http://www.probabilistic-programming.org/wiki/Home (accessed March 25, 2018).
[26] N. D. Goodman, A. Stuhlmüller, the Design and Implementation of Probabilistic Programming Languages, (2014). http://dippl.org (accessed March 25, 2018).
[27] M. Hicks, What is probabilistic programming? (The Programming Languages Enthusiast), (2014). http://www.pl-enthusiast.net/2014/09/08/probabilistic-programming/ (accessed March 25, 2018).
[28] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, Building machines that learn and think like people, Behav. Brain Sci. 40 (2017) 72. https://doi.org/10.1017/S0140525X16001837.
[29] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, N. R. Tallent, HPCTOOLKIT: Tools for performance analysis of optimized parallel programs, Concurr. Comput. Pract. Exp. 22 (2010) 685–701. https://doi.org/10.1002/cpe.
[30] S. Shahrivari, Beyond Batch Processing: Towards Real-Time and Streaming Big Data, Computers. 3 (2014) 117–129. https://doi.org/10.3390/computers3040117.
[31] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, B. Moon, Parallel data processing with MapReduce, ACM SIGMOD Rec. 40 (2012) 11–20. https://doi.org/10.1145/2094114.2094118.
[32] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei Ding, Data mining with big data, IEEE Trans. Knowl. Data Eng. 26 (2014) 97–107. https://doi.org/10.1109/TKDE.2013.109.
[33] S. Chen, W. Li, M. Li, X. Zhang, Y. Min, Latest Progress and Infrastructure Innovations of Big Data Technology, in: 2014 Int. Conf. Cloud Comput. Big Data, IEEE, 2014: pp. 8–15. https://doi.org/10.1109/CCBD.2014.25.
[34] W. Raghupathi, V. Raghupathi, Big data analytics in healthcare: promise and potential, Heal. Inf. Sci. Syst. 2 (2014) 3. https://doi.org/10.1186/2047-2501-2-3.
[35] J. Lin, F. Leu, Y. Chen, ReHRS: A Hybrid Redundant System for Improving MapReduce Reliability and Availability, in: 2015: pp. 187–209. https://doi.org/10.1007/978-3-319-09177-8_8.
[36] I. Taxidou, P. Fischer, Realtime analysis of information diffusion in social media, Proc. VLDB Endow. 6 (2013) 1416–1421. https://doi.org/10.14778/2536274.2536328.
[37] S. Sagiroglu, D. Sinanc, Big data: A review, in: 2013 Int. Conf. Collab. Technol. Syst., IEEE, 2013: pp. 42–47. https://doi.org/10.1109/CTS.2013.6567202.
[38] Y. Wu, L. Zheng, B. Heilig, G. R. Gao, Design and Evaluation of a Novel Dataflow Based Bigdata Solution, Proc. Sixth Int. Work. Program. Model. Appl. Multicores Manycores. (2015) 40–48. https://doi.org/10.1145/2712386.2712397.
[39] A. Vakali, P. Korosoglou, P. Daoglou, A multi-layer software architecture framework for adaptive real-time analytics, in: 2016 IEEE Int. Conf. Big Data (Big Data), IEEE, 2016: pp. 2425–2430. https://doi.org/10.1109/BigData.2016.7840878.
[40] S. K. Mohapatra, P. K. Sahoo, S.-L. Wu, Big data analytic architecture for intruder detection in heterogeneous wireless sensor networks, J. Netw. Comput. Appl. 66 (2016) 236–249. https://doi.org/10.1016/j.jnca.2016.03.004.
[41] S. Perera, S. Suhothayan, Solution patterns for realtime streaming analytics, in: Proc. 9th ACM Int. Conf. Distrib. Event-Based Syst. - DEBS ’15, ACM Press, New York, New York, USA, 2015: pp. 247–255. https://doi.org/10.1145/2675743.2774214.
[42] M. Wang, J. Liu, W. Zhou, Design and Implementation of a High-Performance Stream-Oriented Big Data Processing System, in: 2016 8th Int. Conf. Intell. Human-Machine Syst. Cybern., IEEE, 2016: pp. 363–368. https://doi.org/10.1109/IHMSC.2016.64.
[43] M. Hirzel, S. Schneider, B. Gedik, SPL: An Extensible Language for Distributed Stream Processing, ACM Trans. Program. Lang. Syst. 39 (2017) 1–39. https://doi.org/10.1145/3039207.
[44] Apache Software Foundation, Apache Storm, (2015). http://storm.apache.org/ (accessed February 13, 2018).
[45] M. Zaharia, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, Apache Spark: a unified engine for big data processing, Commun. ACM. 59 (2016) 56–65. https://doi.org/10.1145/2934664.
[46] SQLstream, SQLstream - A SQL-based Real-time Stream Analytics Platform -, (2017). http://sqlstream.com/ (accessed February 15, 2018).
[47] B. Twardowski, D. Ryzko, Multi-agent Architecture for Real-Time Big Data Processing, in: 2014 IEEE/WIC/ACM Int. Jt. Conf. Web Intell. Intell. Agent Technol., IEEE, 2014: pp. 333–337. https://doi.org/10.1109/WI-IAT.2014.185.
[48] M. Kiran, P. Murphy, I. Monga, J. Dugan, S. S. Baveja, Lambda architecture for cost-effective batch and speed big data processing, in: 2015 IEEE Int. Conf. Big Data (Big Data), IEEE, 2015: pp. 2785–2792. https://doi.org/10.1109/BigData.2015.7364082.
[49] N. Marz, J. Warren, Big Data: Principles and best practices of scalable real-time data systems, Manning, New York, 2015. http://nathanmarz.com/about/.
[50] J. Kreps, Questioning the Lambda Architecture, O’Reilly. (2014) 1–10. https://www.oreilly.com/ideas/questioning-the-lambda-architecture (accessed October 18, 2017).
[51] N. D. Goodman, V. Mansinghka, D. Roy, K. Bonawitz, J. B. Tenenbaum, Church: a language for generative models, in: Proc. 24th Conf. Uncertain. Artif. Intell., 2008: pp. 220–229. https://doi.org/10.1.1.151.7160.
[52] R. C. Fernandez, P. Pietzuch, J. Kreps, N. Narkhede, J. Rao, J. Koshy, D. Lin, C. Riccomini, G. Wang, Liquid: Unifying Nearline and Offline Big Data Integration, Conf. Innov. Data Syst. Res. (2015).
[53] Z. Hasani, M. Kon-Popovska, G. Velinov, Lambda Architecture for Real Time Big Data Analytic, ICT Innov. (2014) 133–143.
[54] V. Astakhov, M. Chayel, Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL, (2015) 1–12. https://d0.awsstatic.com/whitepapers/lambda-architecure-on-for-batch-aws.pdf.
[55] M. Köhler, Y. Kaniovskyi, S. Benkner, Towards adaptive execution strategies for large-scale and real-time data analytics, Proc. Int. Conf. Parallel Distrib. Process. Tech. Appl. (2015) 447–454.
[56] G. Liu, W. Zhu, C. Saunders, F. Gao, Y. Yu, Real-time Complex Event Processing and Analytics for Smart Grid, Procedia Comput. Sci. 61 (2015) 113–119. https://doi.org/10.1016/j.procs.2015.09.169.
[57] J. C. C. Tseng, J. Gu, P. F. Wang, C. Chen, C. Li, V. S. Tseng, A scalable complex event analytical system with incremental episode mining over data streams, in: 2016 IEEE Congr. Evol. Comput., IEEE, 2016: pp. 648–655. https://doi.org/10.1109/CEC.2016.7743854.
[58] F. Yang, G. Merlino, N. Ray, X. Léauté, H. Gupta, E. Tschetter, The RADStack: Open Source Lambda Architecture for Interactive Analytics, in: Proc. 50th Hawaii Int. Conf. Syst. Sci., 2017: pp. 1703–1712. https://doi.org/10.24251/HICSS.2017.206.
[59] T. Yang, M. N. Shadlen, Probabilistic reasoning by neurons, Nature. 447 (2007) 1075–1080. https://doi.org/10.1038/nature05852.
[60] A. Tversky, D. Kahneman, Probabilistic Reasoning, Probabilistic Reason. 1131 (1983) 1124–1131. https://doi.org/10.1142/9789814291354_0006
[61] A. Prékopa, Probabilistic Programming, in: Handbooks Oper. Res. Manag. Sci., 2003: pp. 267–351. https://doi.org/10.1016/S0927-0507(03)10005-9.
[62] T. Gehr, S. Misailovic, M. Vechev, PSI: Exact Symbolic Inference for Probabilistic Programs, in: S. Chaudhuri, A. Farzan (Eds.), Int. Conf. Comput. Aided Verif., Springer, Cham, 2016: pp. 62–83. https://doi.org/10.1007/978-3-319-41528-4_4.
[63] F. Wood, J. W. van de Meent, V. Mansinghka, A New Approach to Probabilistic Programming Inference, in: 17th Int. Conf. Artif. Intell. Stat., Reykjavik, Iceland, 2014. http://arxiv.org/abs/1507.00996.
[64] A. Pfeffer, The Design and Implementation of IBAL: A General-Purpose Probabilistic Language, Introd. to Stat. Relational Learn. (2007) 34.
[65] B. Milch, B. Marthi, S. Russel, D. Sontag, D. L. Ong, A. Kolobov, Probabilistic models with unknown objects, Stat. Relational Learn. (2007) 352.
[66] T. Sato, A glimpse of symbolic-statistical modeling by PRISM, J. Intell. Inf. Syst. 31 (2008) 161–176. https://doi.org/10.1007/s10844-008-0062-7.
[67] A. Anikwue, B. Kabaso, Probabilistic Programming and Big Data, in: 2019 Int. Conf. Adv. Big Data, Comput. Data Commun. Syst., IEEE, 2019: pp. 1–6. https://doi.org/10.1109/ICABCD.2019.8851053.
[68] Z. Zhao, J. Pei, E. Lo, K. Q. Zhu, C. Liu, InferSpark: Statistical Inference at Scale, (2017). http://arxiv.org/abs/1707.02047.
[69] Lightbend Inc., Introduction - Akka Documentation, (2019). https://doc.akka.io/docs/akka/current/stream/stream-introduction.html (accessed June 8, 2019).
[70] Lightbend Inc, Akka, Actor-based message-driven runtime | @lightbend, (2010). https://www.lightbend.com/akka (accessed November 10, 2017).
[71] The Apache Software Foundation, Apache Cassandra Database, Cassandra. (2015). http://cassandra.apache.org/ (accessed December 20, 2018).
[72] A. R. Hevner, S. T. March, J. Park, S. Ram, Design Science in Information Systems Research, MIS Q. 28 (2004) 75–105. http://dblp.uni-trier.de/rec/bibtex/journals/misq/HevnerMPR04.
[73] A. Hevner, S. Chatterjee, Design Science Research in Information Systems, in: Des. Res. Inf. Syst., Springer US, Boston, MA, 2010: pp. 9–22. https://doi.org/10.1007/978-1-4419-5653-8_2.
Cite This Article
  • APA Style

    Arinze Anikwue, Boniface Kabaso. (2021). Kognitor: Big Data Real-Time Reasoning and Probabilistic Programming. International Journal on Data Science and Technology, 7(2), 32-39. https://doi.org/10.11648/j.ijdst.20210702.12

    Copy | Download

    ACS Style

    Arinze Anikwue; Boniface Kabaso. Kognitor: Big Data Real-Time Reasoning and Probabilistic Programming. Int. J. Data Sci. Technol. 2021, 7(2), 32-39. doi: 10.11648/j.ijdst.20210702.12

    Copy | Download

    AMA Style

    Arinze Anikwue, Boniface Kabaso. Kognitor: Big Data Real-Time Reasoning and Probabilistic Programming. Int J Data Sci Technol. 2021;7(2):32-39. doi: 10.11648/j.ijdst.20210702.12

    Copy | Download

  • @article{10.11648/j.ijdst.20210702.12,
      author = {Arinze Anikwue and Boniface Kabaso},
      title = {Kognitor: Big Data Real-Time Reasoning and Probabilistic Programming},
      journal = {International Journal on Data Science and Technology},
      volume = {7},
      number = {2},
      pages = {32-39},
      doi = {10.11648/j.ijdst.20210702.12},
      url = {https://doi.org/10.11648/j.ijdst.20210702.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdst.20210702.12},
      abstract = {There is a huge increase in the amount of generated data since the explosion of the Internet. This generated data which is usually collected in different formats and from multiple sources is popularly termed Big Data. Big data contains uncertainty. To handle uncertainty in big data, probabilistic reasoning is used to develop probabilistic models that specify generic knowledge in different topics. These models are used in conjunction with an inference algorithm to enable decision makers especially during uncertain situations. Extensive knowledge in fields such as statistics, machine learning and probability theories are employed in the development of these probabilistic models. Thus, it is usually a difficult undertaking. Probabilistic programming was introduced to simplify and enable development of complex models. Again, decision makers often need to use knowledge from historic data as well as current data to make cogent decisions. Thus, the necessity to unify processing of historic and real-time data with low latency. The Lambda architecture was introduced for this purpose. This paper presents a framework called Kognitor that simplifies the design and development of difficult models using probabilistic programming and Lambda architecture. Evaluation of this framework is also presented in this paper using a case study to highlight the crucial potential of probabilistic programming to achieve simplification of model development and enable real-time reasoning on big data. Thus, demonstrating the effectiveness of the framework. Finally, results of this evaluation are presented in this paper. The Kognitor framework can be used to steer effective and easier implementation of complicated real-life situations as probabilistic models. This will be beneficial in the big data processing domain and for decision makers. Kognitor ensures cost-effectiveness using contemporary big data tools and technology on commodity hardware. Kognitor framework will also be beneficial in academia with respect to the use of probabilistic programming.},
     year = {2021}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Kognitor: Big Data Real-Time Reasoning and Probabilistic Programming
    AU  - Arinze Anikwue
    AU  - Boniface Kabaso
    Y1  - 2021/08/02
    PY  - 2021
    N1  - https://doi.org/10.11648/j.ijdst.20210702.12
    DO  - 10.11648/j.ijdst.20210702.12
    T2  - International Journal on Data Science and Technology
    JF  - International Journal on Data Science and Technology
    JO  - International Journal on Data Science and Technology
    SP  - 32
    EP  - 39
    PB  - Science Publishing Group
    SN  - 2472-2235
    UR  - https://doi.org/10.11648/j.ijdst.20210702.12
    AB  - There is a huge increase in the amount of generated data since the explosion of the Internet. This generated data which is usually collected in different formats and from multiple sources is popularly termed Big Data. Big data contains uncertainty. To handle uncertainty in big data, probabilistic reasoning is used to develop probabilistic models that specify generic knowledge in different topics. These models are used in conjunction with an inference algorithm to enable decision makers especially during uncertain situations. Extensive knowledge in fields such as statistics, machine learning and probability theories are employed in the development of these probabilistic models. Thus, it is usually a difficult undertaking. Probabilistic programming was introduced to simplify and enable development of complex models. Again, decision makers often need to use knowledge from historic data as well as current data to make cogent decisions. Thus, the necessity to unify processing of historic and real-time data with low latency. The Lambda architecture was introduced for this purpose. This paper presents a framework called Kognitor that simplifies the design and development of difficult models using probabilistic programming and Lambda architecture. Evaluation of this framework is also presented in this paper using a case study to highlight the crucial potential of probabilistic programming to achieve simplification of model development and enable real-time reasoning on big data. Thus, demonstrating the effectiveness of the framework. Finally, results of this evaluation are presented in this paper. The Kognitor framework can be used to steer effective and easier implementation of complicated real-life situations as probabilistic models. This will be beneficial in the big data processing domain and for decision makers. Kognitor ensures cost-effectiveness using contemporary big data tools and technology on commodity hardware. Kognitor framework will also be beneficial in academia with respect to the use of probabilistic programming.
    VL  - 7
    IS  - 2
    ER  - 

    Copy | Download

Author Information
  • Information Technology Department, Cape Peninsula University of Technology, Cape Town, South Africa

  • Information Technology Department, Cape Peninsula University of Technology, Cape Town, South Africa

  • Sections