
Introduction
The past two decades witnessed advanced developments in Artificial Intelligence (AI), Machine Learning (ML), and Blockchain technologies. During the same period, the production and sharing of data on a worldwide scale have also increased significantly, thanks to social media, mobile telecommunication, and the internet of things (IoT). Data is expected to grow from 41 to 175 zettabytes (1 zettabyte equals byte or a billion terabytes) between 2019 to 2025 respectively (Holst, 2020). This huge increase in data volume is associated with high velocity in data sharing through the internet and could computing and a variety in forms of data generated including images, texts, videos, and audio. The technology and data developments were accompanied by growth in revenue generated by the businesses operating this ecosystem. For instance, projected revenue from Big Data and business analytics amounted to U.S. dollars 189.1 billion (12% increase compared to 2018). This figure is projected to grow to U.S. dollars 274.3 billion in 2022 (Farmingham, 2019). Comparatively, Artificial Intelligence (AI) annual global software revenue is forecast to grow from $10.1 billion in 2018 to $126.0 billion by 2025 (Tractica, 2019). On the other hand, Machine Learning apps and platforms generated revenue amounting above U.S. dollars 42 billion in 2019 vs. 38 billion of revenue that went towards AI including advancing smart robots, virtual assistants, and natural language processing (Felman, 2019). Investments in STOs, ICOs, and crypto-currencies are increasing in association with the attention of regulators and the small and medium enterprises towards those important financing tools. The global Blockchain technology revenue is expected to grow from U.S. dollars 3 to 39 billion between 2020 and 2025 respectively (Liu, S, 2020). The significant attention to these technologies is driven by the attractiveness of their financial applications. More specifically, the financial services industry as well as commercial organizations and government are focusing their attention on employing those technologies in their strategic plans to automate their processes and in making use of the huge data available from a variety of sources to execute smart decisions and achieve efficient operations. The meeting among Artificial Intelligence, Machine Learning, Blockchain, and Big Data technologies with finance is collectively discussed in the context of what is commonly known as Fintech (Kissell and Mack, 2020).
The article is organized as follows:
- The first section lists the objectives of this article.
- The second section discusses the definition of Fintech and identifies an analysis framework to this buzzword.
- The third section discusses Artificial Intelligence.
- The Fourth and Fifth sections discuss Machine Learning and Big Data.
- The article is then concluded with a summary and recommendation section.
Although blockchain is an important technology under the Fintech paradigm, it is not discussed in this article. For those interested in the Blockchain technology and its financial applications, they are referred to my article on the crypto-assets (Sarhan, 2020).
Keywords: Fintech, Artificial Intelligence (AI), Big Data, Machine Learning (ML), Blockchain, Algorithm, Statistics.
Fintech
Fintech is a broad term and it intends to mean different things to different people (Allen, Gu, and Jagtiani, 2020). However, it can be thought of as the intersection between technology and finance (Gomber, Koch, and Siering, 2017). For some people, it is related to three technology deployments, including the application of the Distributed Ledger Technology (DL) in finance, the provision of financial service advice through robots (Robo Advisors), and peer-to-peer financing (Crowd Funding) (Preece, 2016). To others, it means either the technology-driven innovation occurring in the financial services industry or to refer to companies (new or start-ups) that are involved in developing new technologies and their applications including the business sector that comprises such companies (Kissell and Mack, 2020). Gomber et al. (2017) argue that one needs to think of Fintech under the framework of digital finance cube. The authors believe that the finance functions, the technologies applied to the finance functions, and the institutions involved in the digitization of the finance functions are the three dimensions of the digital finance cube (see figure (1)).
Therefore, one can define Fintech under this framework as follows: the activities that are undertaken around finance functions (financing, investments, money, payments, insurances, and financial advice) in order to digitalize them using finance technologies that are adopted or being developed by certain organization in both of the finance and IT industries. For instance, the digital investment function has been developed to be provided through robots (Robo Advisors). This robotic financial advice service is the result of cooperation between investment management firms and IT service companies specialized in Artificial Intelligence (AI) and Machine Learning (ML). This article focuses on Fintech from the perspective of four main technologies that are considered the most important financial technologies in the 21st century. More specifically, the article will discuss Big Data, Machine Learning, and Artificial Intelligence. Blockchain is not discussed in this article as it is covered in another article by the author. The next section will discuss Artificial Intelligence.
Artificial Intelligence (AI)
AI is one of the four-pillars of Fintech as discussed in previous section. Since the 1950s, scientists tried to mimic human cognition capabilities and these efforts lead to today’s revolutionary digitized world with the help of robots and Machines. This section discusses the AI and is organized as follows: First, the section discusses the definition of AI. Second, the section compares the AI to both of Deep Learning and Neural Network. Third, the section briefly lists the history of AI. The section then demonstrates the status of research with regards to AI. The applications of AI and the issues facing AI endeavors are discussed thereafter. The section is then concluded with a summary.
Definition
It is difficult to form a unified definition of AI (Gao, Jia, Zhao, Chen, Xu, Geng and Song, 2019). There is even no widely agreed-upon scientific definition of intelligence. Rather there is a list of characteristics that AI researchers are trying to replicate including perception, action, reasoning, adaptation and learning, communication, planning, autonomy, creativity, reflections and awareness, aesthetics, and organization (Hanoavar, 2016). Nevertheless, AI can be defined as “The enterprise of understanding and building intelligent systems” (Hanoavar, 2016, p. 2) or the “science of making machines do things that would require intelligence if done by men” (Marvin Minsky cited in Crevier, 1993, p. 9). AI, in fact, is the search of humans to build a machine that replicates their cognition and adaptation abilities and that can solve complex problems based on an understanding of the surrounding environment and internal logic that direct its actions in a way that optimizes the output.
Link to Machine Learning and Deep Learning
AI, Machine Learning, and Deep learning terms are used interchangeably (Marr, 2016 and Nocholson, 2019). However, ML is a subset of AI, because AI is a broader concept that is related to the machines being smart to perform tasks while ML is an application of AI based on the idea that we should give the machine access to data to let them learn for themselves (Marr, 2016). On the other hand, as will be explained later in the Machine Learning section, Deep learning is a subset of Machine Learning.
History
The AI was first coined by John McCarthy in 1955 and the first AI program was presented during the Dartmouth conference in 1956 (Anyoha, 2017). The AI technology thereafter undergone many ups and downs until when the revolution of the AI technology was revealed when “Deep Blue”, an AI Deep learning algorithm, was able to beat the chess world champion, Garry Kazparov, in 1997. Recently, in 2017, an AI Deep learning algorithm called “AlphaZero” used pattern recognition techniques to become the world chess champion. In contrast to Deep Blue. This program did not use the domain knowledge base and was able to teach itself by training for four hours playing against itself (Rasekhschaffe and Jones, 2019).
AI in research
The research in the AI field has taken momentum in recent years which indicates that the academic community interest is focused on this hot topic. Based on a bibliometric study of the research articles sourced from the Web of Science by (Lei and Liu, 2018). The number of research articles, with the term “Artificial Intelligence” in the research title, they find that from 2007 till 2016 the number of articles is 1,188, with 188 articles published in 2016 versus 72 articles in 2007. The top 10 countries involved in researching include the USA at the top followed in descending order by UK, Iran, Spain, China, Italy, India, Turkey, Canada, and France. The topics researched include neural network, genetic algorithm, fuzzy logic, optimization, support vector machine, Machine Learning, modeling, and prediction. The development in the number of articles is shown in figure (3).
The history of AI can be represented in a history timeline as presented in figure (2) below:
Applications
AI is a multidisciplinary science that is built on contributions from various sciences including Computer science, mathematics, psychology, biology, engineering and philosophy. The applications of AI are many including, among others, natural language processing (NLP) wherein the machine can talk to the human and understand the natural language (e.g. Siri and Alexa), image recognition (e.g. Self-driving cars), recommendation engines (e.g. Amazon) (Rasekhschaffe and Jones, 2019) and intelligent robots that learn from mistakes and executes human orders. These robots have sensors that enable them to obtain information about the surrounding environment. The robots have efficient processors, multiple sensors, and huge memory to exhibit intelligence (Tutorials Point, 2020a).
Issues
AI issues are discussed from various perspectives including, among others: First, AI is threatening privacy as it can recognize speech and can theoretically understand each e-mail and telephone conversation (Tutorials Point, 2020b). Second, machines under AI are developed to mimic human intelligence and learn from mistakes similar to AlphaZero algorithm and might become smarter than humans to the point that humans cannot control it. Therefore, exposing us to existential risk. Bill Gates (Microsoft) Elon Mask (Space X) and Stephen Hawking (Physicist) have shown similar concerns (Wikipedia, 2020a). Third, AI carries other risks such as the risk of a decrease in demand for human labor and autonomous weapons (artificial soldiers). Fourth, some of the Machine Learning algorithms are complex and have outputs that are considered as “black-box”. This creates challenges to users of these algorithms as they cannot interpret how the algorithms reached the conclusions. XAI (Explainable Artificial Intelligence) is an approach to better understand how the machine reached the conclusion in the output. ACCA report in 2020 clarifies the XAI as it “‘means to be able to shine a light on its inner workings and/or to reveal some insight on what factors influenced its output, and to what extent, and for this information to be human-readable, i.e not hidden within impenetrable lines of code” (ACCA, 2020, p. 8)
Summary
This section discussed the definition of the AI and briefly listed the historical milestones of the AI as a science. The section also discussed the history of AI as well as the applications and issues accompanied by the AI. The analysis in this section has provided a brief understanding of the AI to help the reader to recognize its nature and differentiate it from Machine Learning and Deep Learning. The following section will discuss the Machine Learning (ML) concept, types of ML algorithms, and the applications of ML in finance.
Machine Learning (ML)
As discussed in the previous section, AI as science works toward creating machines that execute tasks that require the intelligence of men to do. Machine Learning is one of the tools that are used in AI to help humans perform tasks that are difficult to be performed without the help of machines due to the large number of variables involved. This section discusses Machine Learning (ML) and is organized as follows: First, the definition of the ML is discussed. Second, ML is compared and contrasted with statistical tools. subsequently, the section discusses the mechanism of how ML algorithms work. The section then discusses how to choose among available ML algorithms. The benefit of aggregating ML algorithms and the applications of ML in finance are discussed and the section then concludes with a summary.
Definition
Machine Learning is a sub-field of Artificial Intelligence that gives computers the ability to learn without being explicitly programmed to solve the problem (El-Afly and Mohammed, 2020; Parveena and Jiaganesh, 2017). This definition means simply that ML algorithms are programmed to learn. As will be discussed later in this section, ML algorithms learn from large amounts of data and apply that learning to predict (Generalize). This process can be simply thought of as “Learn the pattern, apply the pattern” (DeRose and Le Lannou, 2020). In fact, the algorithm determines the structure in the data in order to solve a classification or regression problem.
What is the difference from statistical prediction tools?
First, statistical tools rely on theoretical assumptions and data structure. For instance, comparing t-paired sample means assumes that data are normally distributed while linear regression assumes that there is a linear relationship between the dependent and independent variables and that the error terms are stationary. In contrast, ML algorithms do not have such restrictions and can handle problems with many variables (high dimensionality) or with a high degree of non-linearity (DeRose and Le Lannou, 2020). Second, statistical tools and ML shares the same objective of minimizing the error of forecast, usually the Root Mean Square of Errors (RMSE), while they differ in how to reach this goal. Statistical tools use linear processes while ML uses non-linear ones. In addition, ML algorithms are more data demanding and rely heavily on computer science (Makridakis, Spiliotis and Assimakopoulos, 2018). In spite of the practicalities of the ML algorithms in solving complex problems compared to the statistical models, there is an argument that statistical tools outperform the ML counterparts if tested out-of-sample and in different time horizons (Makridakis et. al., 2018). Moreover, there is evidence presented in the M4 competition (a competition between 61 forecasting methods applied to 100,000 times series) that combining statistical forecasters provides the best results versus pure ML or pure statistical methods (Makridakis, Spiliotis and Assimakopoulos, 2020).
How do Machine Learning algorithms work?
ML mechanism is dependent on the type of the algorithm. In literature, there are two ways to group the algorithms. First, by the way the algorithm learns from data, and second is by grouping them in style or function (Brownlee, 2019). This article employs the first method i.e. by the way the algorithm learns from data.
The various ML algorithms grouped by the style of function are depicted in figure (4). The interested reader is referred to visit Wikipedia (Machine Learning) for more types and various algorithms. ML types by the way they learn are supervised, unsupervised, Neural Networks (Deep learning), and reinforcement learning (DeRose and Le Lannou, 2020).
Supervised Machine Learning
Under supervised ML, the algorithms are trained on specific data to learn the patterns in it. Before starting the training, variables should be defined into target and features which are equivalent to dependent and independent variables respectively. Data are obtained and divided into three parts. The training set, the cross-validation set, and the testing set. The training set (in-sample) is used to train the algorithm, while the cross-validation set is used to test and fine-tune the algorithm. Finally, the trained and tuned algorithm is tested using the test data set (out-of-sample).
Supervised ML algorithms perform two types of tasks (Mallikarjun, and Abbasi, 2020): First one is the classification task that classifies the variables into distinct categories. This can be done for categorical or ordinal target variables. An example of the categorical target is credit card fraud detection. The target is categorical (in this case binary (1 for fraudulent) or (0 for legitimate)), while the features are the transaction characteristics. Another example is the ordinal target such as classifying firms into credit categories. An example of the algorithms under the classification task is Logit, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and The Classification and Regression Tree (CART). The second task of the supervised ML algorithms is the regression which is making predictions of a continuous target variable. Examples of this type of algorithms are Linear; Penalized Regression (LASSO), Logistic, Classification and Regression Tree (CART), and Random Forests.
The measurement of the accuracy of prediction is performed using different measures including the RMSE (Root Mean Square of Errors) which is the square root of the average of the quadratic of the differences between the actual data, in the out-of-sample set, and the forecasted data. In theory, the lower the RMSE the better, however, one should be cognizant of two issues: Overfitting and underfitting. Basically, if the algorithm is too complex or the problem of high dimensions, then the model might end memorizing the pattern and being too perfect in representing it. Therefore, in the out-of-sample, such algorithms would result in poor performance as measured by the RMSE (Overfitting). This is also considered an error that is called (variance error). On the other hand, if the algorithm is not able to capture the pattern in the training data, then it would report poor performance in-sample i.e. if the algorithm is executed on the in-sample data, it would report higher RMSE between the predicted results and the actual data in the in-sample data (underfitting). This error is called (bias error). Usually the errors are related to the complexity of the algorithm. The optimal case scenario is to find a trade-off between the variance and the bias errors by finding the right algorithm that fits the data. Such an algorithm should possess a complexity level that minimizes both of the two errors. See figure (5) that shows the fitting curve and model fitting scenarios. The fitting curve is the trade-off between the bias error, variance error, and model complexity while model fitting scenarios show that what is needed is the good fit model that neither overfits nor underfits. In addition, figure (6) shows the learning curve, which is related to how far the algorithm learns to predict with accuracy based on the sample size. The assumption is, the higher the sample size the better the accuracy.
Unsupervised Machine Learning
Under this type of ML algorithms, there is only input data and no output variable, i.e. “unlike supervised learning above there are no correct answers and there is no teacher. Algorithms are left to their own devices to discover and present the interesting structure in the data” (Brownlee, 2016b). Data is supplied to the algorithm with the objective is to uncover the implicit structure and then apply the same structure for prediction to any new data point out-of-sample. This type of ML algorithm is concerned with performing three tasks: The first task is the dimensionality reduction which “focuses on reducing the number of features while retaining variation across observations to preserve the information contained in that variation.” (Mallikarjun, and Abbasi, 2020, p. 5). An example of this ML is the PCA (Principal Component Analysis). An example of PCA is the use of exploratory data analysis to discover the main features before training another algorithm. The second task is the clustering that sorts the observations into homogenous groups. An example includes K-Means and hierarchal clustering algorithms. An example of clustering is grouping companies according to similarities that are not captured by using human judgment or by industry or sector which is an important input for portfolio diversification and credit risk profiling. The last task is the association rule learning that identifies the rule that best explains the relationship between the variables. An example includes Aprioiri algorithm. An example of this algorithm is found in Wikipedia that by analyzing the sales in supermarkets, one might find that customers who buy onions and potatoes are likely to buy hamburger meat. This piece of information can be used in marketing.
Neural Networks (Deep Learning Network)
The name of this type of algorithm guides the understanding of its nature. The NN is built to work in a similar fashion of how our brains work. DeRose and Le Lannou (2020) discuss the NN ML algorithms tasks and indicates that the NN performs two tasks: The classification and regression as supervised ML but also can perform unsupervised learning under Reinforcement learning.
The network is divided into three layers. The input, the hidden, and the output layers. In the case of Deep Learning Network, the number of hidden layers is at least 3 but not less than 20 and this is considered the backbone of the AI revolution (DeRose and Le Lannou, 2020).
DeRose and Le Lannou, (2020) discusses the NN as follows: The input layer comprises of nodes that capture the data input (features). The hidden layer, where the learning occurs, comprises also of nodes (neurons) that assign arbitrary weights to transmissions from the input layer. The features input is scaled (normalized using its maximum) so that their values vary between 0 and 1. The hidden layer performs two activities including the summation operator and the activation function. First, the summation operator is simply the weighted average sum of the inputs from the input layer. Second is the activation function that transforms the weighted input from the summation operator activity and is usually a non-linear like (sigmoid) that has an S-shaped function and whose outputs are between 0 and 1 (i.e. non-linear). The output of the hidden layer to the output layer is based on the activation function. The learning occurs when the correction of the error takes place. The error results from comparing the actual output and the predicted output as per the NN. This is done by readjusting the weights in the hidden layer (Backward propagation) and the process then is repeated until the predicted results have an acceptable level of accuracy. See figure (7) that explains the difference between the regression model and the NN. Under regression analysis, the relation between the Y variable and the Xs is linear, while under the NN the Zs are the nodes of the hidden layer and they transmit data that has a value between 0 and 1 (the activation function here is the rectified linear unit function) which is a better fit for non-linear relations (i.e. the rate of change in output differs at different levels of input).
Researchers are exploring the use of NN ML to model asset prices. Those prices are noisy and stochastic processes with unstable relationships. NN is thought to be able to help in understanding how the market works (DeRose and Le Lannou, 2020). Nevertheless, the comparison between the performance of the ML algorithms and the statistical model in such exercises need to be consulted (see What is the difference from statistical prediction tools? Section in this article).
Reinforcement learning
Reinforcement learning is based on having an agent (the algorithm) that has no prior labeled data, as contrasted to the supervised learning algorithms. This algorithm takes actions that maximize its rewards over time through trial and error and based on the constraints imposed by the surrounding environment (DeRose and Le Lannou, 2020). For instance, moving a soldier on the chess forward can either result in a correct move that returns a reward or an incorrect move that does not returns a reward. Then, if the movement resulted in a reward, the machine saves this knowledge for future actions and vice-versa. This algorithm is used for example in self-driving cars and in robots and it was the same one used by AlphaGo which beat the world champion at the ancient game of Go in 2017 (Wikipedia, 2020b).
How to choose a suitable Machine Learning algorithm?
To answer this question, the thinking process has three steps: The first step is determining the objective of the ML task. More specifically, if the task is a classification, regression, dimensionality reduction, clustering, or complex tasks such as face recognition, image classification, speech recognition, and natural language processing. The second step in ML selection is to know the capabilities of the ML algorithms and the third step is to know what type of variables the ML algorithm can deal with and that matches the objective determined in the first step. More, specifically, supervised ML can perform either classification or regression tasks and the suitable variable set is the continues for the regression tasks and the categorical and ordinal variables for the classification tasks. For instance, SVM (Support Vector Machine) algorithm is an example of a supervised ML that can perform classification tasks for any new variable out-of-sample based on training the SVM on data that are categorical in nature (e.g. Yes/No, performing/non-performing …etc.). Unsupervised ML is capable of handling dimensionality reduction and clustering tasks on either of the continuous and the categorical or ordinal variables. PCA (Principal Component Analysis) algorithm is an example of an unsupervised ML algorithm that can reduce the number of features in the data to fit in on a computer screen. It does that by finding the homogenous features and groups them while keeping heterogeneous features in other groups. Doing so reduces the number of features and provides insight into the data. PCA group of features are called eigenvectors and each eigenvector is assigned an eigenvalue. The eigenvalues represent how much of the data variation is represented by that particular eigenvector. Finally, Neural Networks, Deep Learning, and Reinforcement learning are sophisticated algorithms that deal with continuous and categorical variables as well as dealing with complex tasks or learn from themselves. Figure (8) presents an example of how ML algorithms can be selected based on the three factors discussed in this section. Specifically, the type of the ML algorithm, the nature of the variable, and the objective of the task.
Can one aggregate ML algorithms to get better results?
To reduce the variance error and achieve the balance between the variance and bias errors, one can use the “ensemble method” that helps to achieve this objective. This method is based on the idea of aggregating the results of multiple algorithms (weak learners) to obtain better results. The aggregation can be done in two ways: Either, the same algorithm applied to different versions of the training data (generated using bootstrapping also known as bagging) or different algorithms are applied to the same training data wherein each new algorithm added corrects the errors in the previous one until the training set is predicted perfectly or the optimal number of models are added (also known as Boosting) (DeRose and Le Lannou, 2020; Rasekhschaffe and Jones, 2019; Brownlee, 2016a).
What are the applications of ML in finance?
ML algorithms are being applied in finance in various aspects. For instance, to reference but few, ML algorithms were used for fraud detection (Adepoju, Wosowei, lawte and Jaiman, 2019), Anti-Money Laundering AML investigations, portfolio optimization, stock selection (Rasekhschaffe and Jones, 2019) and Bank risk management (Leo, Sharma and Maddulety, 2019).
Summary
ML algorithms paved the way for a new era of analysis and prediction wherein the Machine can learn without being programmed for a specific task. They provide humans with support to comprehend and analyze huge data that exhibit high dimensional (multivariate) and non-linear relationships. Some complex algorithms can even learn from itself and beat the world champions in Chess and Go games. The application of ML in finance is numerous and is still being further explored in practice and in academia. However, one should be cognizant to the empirical evidence noted in the M4 competition that combined statistical models were able to outperform the ML algorithms out-of-sample. At the same time, with the huge increase in the size and type of data (Big Data) as discussed in the introduction section of this article, the use of ML is becoming more important if not a must. The following section discusses Big Data and illustrates the steps taken in carrying out a Big Data project.
Big Data
The previous section discussed ML and its applications. A large amount of data available for organizations is considered as assets and currently organization are monetizing those assets with the help of ML. Big Data and related technologies are an important hot topic these days. This section discusses the Big Data and is organized in four areas: First, the section discusses the definition of Big Data. Second, the section discusses how the Big Data are managed and what are the Big Data management challenges. Third, the section discusses how to implement a Big Data project, and finally, the section concludes with a summary.
Definition
Data that we used to know comprises figures and texts arranged in tables. However, Big Data is “Unlike traditional data, …. refers to large growing data sets that include heterogeneous formats: structured, unstructured and semi-structured data.” (Oussous, Benjelloun, Lahcen and Belfkih, 2018, p. 433). It needs more real-time analysis and brings opportunities by helping us understand the hidden values and by the same token carry challenges on how to organize and manage it (Chen, Mao and Liu, 2014). Nowadays, data has emerged to be more than tables that can be analyzed using traditional statistical models. Currently, data includes, among others, images, videos, audio, social media chat, mobile telecommunication messages and calls, and web searches all of which are generated on a per-second basis. The mass development in social media, mobile telecommunication, analytics, cloud computing, and the Internet of Things (IoT) creates challenges for data management and analysis. Big Data is characterized by 4Vs including Volume, Velocity, Variety, and Value plus veracity as an additional V (Mallikarjun, and Abbasi, 2020; Oussous et. al., 2018; Chen et al., 2014). Volume refers to the quantity, velocity refers to the speed by which data is generated, variety refers to the diversity in types of data, veracity refers to the credibility of data and value is the hidden character that is discovered by analysis. This is where the importance of Big Data stems from. The high volume associated with the timely acquisition of data and the diversity in nature of this data makes the data analytics in real-time an activity that brings economic advantages to users and organizations. In terms of revenue from data analytics, for example, Business and Data Analytics (BDA)- related Software revenue is expected to be U.S. dollars 67.2 billion in 2019, out of which 44% are software services provided via public could that will grow on a compound annual growth rate (CAGR) of 32.3% (IDC, 2019).
How Big Data is managed and what are Big Data management challenges?
Big Data management aims to “ensure reliable data that is easily accessible, manageable, properly stored and secured” (Oussous et al., 2018). Therefore, Big Data management can be discussed from five perspectives including production, acquisitions, storage, analyses, and security. First, Big Data are “generated by financial markets (e.g., stock and bond prices), businesses (e.g., company financials, production volumes), governments (e.g., economic and trade data), individuals (e.g., credit card purchases, social media posts), sensors (e.g., satellite imagery, traffic patterns), and the Internet of Things, or IoT, (i.e., the network of interrelated digital devices that can transfer data among themselves without human interaction) (Mallikarjun, and Abbasi, 2020, p. 1). Second, Big Data are either acquired by the organizations themselves through their systems such as Amazon, Google, Facebook, Yahoo, or acquired from third-party specialized vendors. Third, traditional data management and analysis systems are based on relational database management systems (RDMS) which are capable of handling structured data, other than, semi-structured and unstructured data. Chen et al. (2014) argue that RDMS systems are using more hardware and are becoming expensive to manage. Solutions proposed to deal with this problem include, for instance, cloud computing and distributed file systems such as NoSQL. The former solution enables businesses to handle Big Data efficiently with elasticity on infrastructure (e.g. IaaS: Infrastructure as a service, SaaS: software as service, PaaS: Platform as a service). The latter solution has achieved great success in processing clustered tasks. Apache Hadoop is a well-known Big Data technology that has a supporting community. Hadoop is open-source and runs tasks in parallel clusters through MapReduce which is a programming model used for Big Data analysis. Hadoop does not copy in memory the whole-distant data files, rather, it runs the tasks on where the data are stored. Therefore, lowers server and network communication load (the reader is referred to Oussous et al. (2018) for further reading in Hadoop infrastructure and layers). Fourth management activity on Big Data is the storage of Big Data. Storage is one of the challenges that raises questions about data redundancy and the data life cycle. Redundant data are those data points generated in the IoT for instance and are of no value-added, while Big Data life cycle is all decisions concerned with how long data should be kept in storage. The fifth management activity is data analysis. The tools used for analyzing Big Data include open source programming languages such as Python and R as well as Apache Mahout. The latter is capable of handling large scale data ML algorithms which is difficult to implement, for instance, using R.
Big Data is an asset and all organizations are trying to monetize that asset by taking timely decisions based on insights from the analysis undertaken on this data. Sometimes those decisions should be instant. Therefore, Big Data management challenges stems from the 4Vs. First, the volume and velocity of data are increasing. This requires the capability to capture these data by having a proper infrastructure that can generate, acquire, store, and analyze the data on a timely basis. Current systems and hardware might not be capable of capturing the data in a timely manner nor is capable of storing and analyzing it. The same applies to human capital. The question is whether organizations have the proper professional skills to deal with Big Data throughout the cycle of management of such data. Finally, the security and privacy of data is a great concern. Governments are paying more attention to personal data security such as the GDPR law in Europe and the Bank Secrecy laws worldwide. Those laws protect the privacy of personal data. Therefore, the organization has to take measures to avoid a breach of such laws, otherwise, the consequences are catastrophic including penalties and damage of reputation. The second challenge of data management is the variety of this data. One needs to have proper analysis tools as discussed earlier to be able to uncover the hidden patterns (value) in the huge volume of the unstructured and structured data. This requires ensuring that the data captured are reliable (veracity) and that the analysis tools can handle huge volume such as (Apache Mahout, R, and Python) as well as the proper algorithms are employed that are capable of handling non-linear and multi-variate data such as ML algorithms, Therefore, organizations are on the forefront to get all the benefit of Big Data. At the same time, they are facing many challenges that require taking proper measures to avoid inherent risks.
How to implement a Big Data project?
The discussion in this section is based on the work of (Mallikarjun, and Abbasi, 2020) unless otherwise stated.
Working with data always provides insights and support business management and investment decisions. For instance, structured data are used to predict stock performance or to assess the creditworthiness of a customer. However, with the addition of unstructured data such as texts, social media, images, videos and audio. The scope, the benefit, as well as the challenges of Big Data, became wider. An example of the challenges, Big Data analysis is data privacy and ethical use of data. An example of the benefits of Big Data analysis is one study in the United States that revealed the fact that positive public sentiment on Twitter can predict the Dow Jones Industrial Average up to three days later with nearly 87% accuracy.
This section discusses Big Data projects dealing with structured and textual types of unstructured data. The steps in implementing a Big Data project under both types of data are similar except to the fact that under the unstructured textual data project, the objective is to finally obtain data that are similar to the structured data which can be used for modeling purposes. Figure (9) depicts the difference in the process of the Big Data projects’ steps between the structured and the unstructured textual data.
If the project is a financial modeling project, both structured and unstructured data analysis can be executed. For instance, predicting stock performance can be modeled using structured data including stock fundamentals and unstructured data including what is written in social media and newspapers. The output of the unstructured data project can be used on its own or as an input to the analysis of the structured data.
The steps of structured data project include: First, conceptualization of the project to determine the output (stock price up or down for example), who are the users of the output, and if the output will be embedded in the current business processes. The second step is the collection of relevant data related to stock performance analysis. Such data are in tabular format with rows representing instance, columns representing features and the cell representing particular values. The third step is data preparation and wrangling. This step comprises data cleansing and data preprocessing. The former is executed to clean the data tables from missing or incorrect values, while the latter is related to aggregating, filtering, and selecting the relevant data. The fourth step is the data exploration that comprises exploratory data analysis (e.g. using histograms, bar charts and box plots, see figure (10) for histogram and box plot), feature engineering (creating new features by changing or transforming existing features) and feature selection (using statistical techniques to identify features that are of interest individually or when analyzed with other features, such technique include multicollinearity and Chi-square). The fifth step is the model training which is selecting the appropriate ML algorithm as discussed in the ML section and testing and fine-tuning the algorithm.
Figure (11) shows a data source in the tabular format of structured data before cleansing and after cleansing.
Figure (12) shows the data after applying data preprocessing. This step, for instance, resulted in combining, under the total income column, the salary and another income column (aggregation). In addition, it resulted in converting the state name to abbreviations, removing the date of birth, and replacing it with age as well as removing the currency sign that was associated with the amounts.
In the stock sentiment project under discussion and in contrast with the structured data analysis, unstructured data include texts from online news articles, social media, internal and external documents such as the financial statements as well as any other openly available data sources. The steps of the unstructured data analysis comprise the following: First is the problem formulation which includes identifying the classification objective and the input and output and how the output will be utilized. The second step is data curation which starts with a collection of textual data sources via web spidering (Scraping or crawling) programs. At this step, it is important to obtain reliable data for training the supervised ML. such textual data should have reliable labeling that will guide the training of the ML algorithm. For instance, the labels could be that the textual data include a sentiment that is either negative or positive. The third step has two activities. The first activity is the text preparation and wrangling which include textual data cleansing from annotation imported from source data such as punctuation, question marks, html tags, white spaces, numbers, and currency signs. Figure (13) shows how these cleansing look like when applied to a section of financial statements online.
The second activity in the third step is the data preprocessing to prepare the data to be ready in structured format. This activity is simply normalizing the data and include stemming which obtains the base word of the inflected form of a word such as “Analyze” instead of “analyzed” or “analyzing” , lowercasing such “The” to “the”, removing the high frequency and low-frequency words and removing stop words that are irrelevant to the stock sentiment such as “is” and “the”. Figure (14) shows the Bag of Words (BOW). BOW is created after normalization and is the representation used in text analysis.
The final step in the preprocessing activity is the preparation of the data term matrix (DTM). But before that, tokenization of the text should be defined. A token is equivalent to a word in the BOW just described earlier. The DTM is a matrix whose rows are texts; columns are the tokens and cell values are the number of times a token was mentioned in a particular text. Figure (15) shows sample DTM. This DTM is important as it represents the text in a tabular format usually used in structured data analysis.
The fourth step in the textual data analysis of the stock sentiment project is text exploration. This step includes the following activities: First activity is the text visualization to identify the most relevant and repetitive words in the text (see figure (16)). The second activity is calculating the chi-square of word association in positive and negative sentences in the text or in different documents. The third activity is calculating term frequency (TF) which is the ratio of how many times the token has been used in all texts compared to all tokens in the texts. The fourth activity is feature engineering (e.g. N-gram which is a multi-word association if they are meaningful to the analysis. For instance, the word “market” can be associated with stock to be “stock-market” which is usually used in financial analysis and has an association with stock sentiment projects. Sixth activity is features selection by using techniques such as DF to find which features to include and exclude (Document frequency (DF) is a measure used to know how many times a token was used in a document compared to the total number of documents). Finally, the resulting sentiment output of the above steps can be either used directly for decision or included in combined with structured variables.
Textual data ML model performance should be evaluated for the goodness of fit using several techniques (e.g. confusion matrix linked to Type I and Type II error analysis and Root Mean Square Error “RMSE”).
As far as ML algorithm goodness of fit is concerned in this project, the concept of hyperparameters should be understood and contrasted with the parameters. The parameters are simply what is discovered through modeling. For instance, the coefficients in the regression, while the hyperparameters are related to the ML algorithm and not related to the data as in the case of the parameters. An example of the hyperparameter is the number of hidden layers in the NN ML and the depth of the tree in the CART ML. Training the ML using different hyperparameters then comparing the results is an important step to know the best performing model (This is called Grid Search). Hyperparameters works as a control variable that helps in fine-tuning the model (regularization) to a point of best fit (good fit model). See figure (17) that shows that trade-off.
Finally, figure (18) shows an example of the output of the prediction of sentiment in each sentence in the text associated with a p-value. This because the logistic regression ML is used in this example and the best logistic regression algorithm was found to predict with the highest accuracy is the one with a p-value of 0.60. In this context, the p-value predicts the probability that a sentence in the test data contains positive sentiment. Therefore, if the p-value calculated at a sentence level is above 0.60. that sentence probably carries positive sentiment.

Summary
This section discussed the Big Data definition and identified that it has 4Vs which stands for volume, velocity, variety, and value. Further, Big Data are either structured which are the regular tabular format or the unstructured data that can be images, texts, videos, and audios. The section identified certain sources for Big Data including, among others, social media, mobile telecommunication, and IoT. The section discussed the management challenges of Big Data. More specifically, the acquisition of Big Data, the storage, the analysis, and security. Those are important concerns for an organization to consider in its IT and risk management strategies. Finally, the section illustrated the steps undertaken for executing a Big Data project. More specifically, the data collection, the curation and preprocessing the data exploration, feature selection, and engineering, and finally the ML selection, testing, and implementation. In this type of project, one would need to know the objective of the project and conceptualize the required inputs and outputs, in addition, to being equipped with a proper understanding of model biases including over and the underfitting.
Summary and recommendation of the article
In this article, Fintech as a buzzword of the 21st century was discussed and analyzed under the digital finance cube framework. The article was motivated by the vast expansion in technological developments on different fronts in financial services. According to the discussion in this article, Fintech can be considered as an umbrella for discussion of the use-cases of technology as applied in the various finance functions. The technologies discussed under the Fintech paradigm are Artificial Intelligence, Machine Learning, Big Data, and Blockchain technologies. Those technologies are the result of a meeting between finance and IT. They brought to humanity a lot of benefits and opportunities as well as risks. Among the benefits, to mention but few, the Robo-advisors (wealth management planning services robots), algorithms that can solve complex statistical problems (Machine Learning Algorithms), value from data (Big Data analytics), new ways of raising finance (ICOs, STOs). On the other hand, among the risks, are data security, the privacy of individuals’ information, the existential risk for the human race, and loss of jobs. The challenges to governments and organizations are to regulate the use of the technologies in a way that is ethical, secure, and that helps to make the best use of them.
Fintech research is an open and wide area with many green fields that have not been explored yet. There are many directions for future research including, among others, the comparison of the statistical accuracy of the ML algorithms in competition with the traditional statistical methods from theoretical and empirical points of view, the future role of Big tech firms (Technology firms such as Amazon and Alibaba) in the financial services industry and finally, the application of technology by Fintech and technology firms on each finance function as discussed in the digital finance cube framework.