Bassel Saleh & Co.

In­tro­duc­tion

The past two decades wit­nessed ad­vanced de­vel­op­ments in Ar­ti­fi­cial In­tel­li­gence (AI), Ma­chine Learn­ing (ML), and Blockchain tech­nolo­gies. Dur­ing the same pe­riod, the pro­duc­tion and shar­ing of data on a world­wide scale have also in­creased sig­nif­i­cantly, thanks to so­cial me­dia, mo­bile telecom­mu­ni­ca­tion, and the in­ter­net of things (IoT). Data is ex­pected to grow from 41 to 175 zettabytes (1 zettabyte equals byte or a bil­lion ter­abytes) be­tween 2019 to 2025 re­spec­tively (Holst, 2020). This huge in­crease in data vol­ume is as­so­ci­ated with high ve­loc­ity in data shar­ing through the in­ter­net and could com­put­ing and a va­ri­ety in forms of data gen­er­ated in­clud­ing im­ages, texts, videos, and au­dio. The tech­nol­ogy and data de­vel­op­ments were ac­com­pa­nied by growth in rev­enue gen­er­ated by the busi­nesses op­er­at­ing this ecosys­tem. For in­stance, pro­jected rev­enue from Big Data and busi­ness an­a­lyt­ics amounted to U.S. dol­lars 189.1 bil­lion (12% in­crease com­pared to 2018). This fig­ure is pro­jected to grow to U.S. dol­lars 274.3 bil­lion in 2022 (Farm­ing­ham, 2019). Com­par­a­tively, Ar­ti­fi­cial In­tel­li­gence (AI) an­nual global soft­ware rev­enue is fore­cast to grow from $10.1 bil­lion in 2018 to $126.0 bil­lion by 2025 (Trac­tica, 2019). On the other hand, Ma­chine Learn­ing apps and plat­forms gen­er­ated rev­enue amount­ing above U.S. dol­lars 42 bil­lion in 2019 vs. 38 bil­lion of rev­enue that went to­wards AI in­clud­ing ad­vanc­ing smart ro­bots, vir­tual as­sis­tants, and nat­ural lan­guage pro­cess­ing (Fel­man, 2019). In­vest­ments in STOs, ICOs, and crypto-cur­ren­cies are in­creas­ing in as­so­ci­a­tion with the at­ten­tion of reg­u­la­tors and the small and medium en­ter­prises to­wards those im­por­tant fi­nanc­ing tools. The global Blockchain tech­nol­ogy rev­enue is ex­pected to grow from U.S. dol­lars 3 to 39 bil­lion be­tween 2020 and 2025 re­spec­tively (Liu, S, 2020). The sig­nif­i­cant at­ten­tion to these tech­nolo­gies is dri­ven by the at­trac­tive­ness of their fi­nan­cial ap­pli­ca­tions. More specif­i­cally, the fi­nan­cial ser­vices in­dus­try as well as com­mer­cial or­ga­ni­za­tions and gov­ern­ment are fo­cus­ing their at­ten­tion on em­ploy­ing those tech­nolo­gies in their strate­gic plans to au­to­mate their processes and in mak­ing use of the huge data avail­able from a va­ri­ety of sources to ex­e­cute smart de­ci­sions and achieve ef­fi­cient op­er­a­tions. The meet­ing among Ar­ti­fi­cial In­tel­li­gence, Ma­chine Learn­ing, Blockchain, and Big Data tech­nolo­gies with fi­nance is col­lec­tively dis­cussed in the con­text of what is com­monly known as Fin­tech (Kissell and Mack, 2020).

The ar­ti­cle is or­ga­nized as fol­lows:

  1. The first sec­tion lists the ob­jec­tives of this ar­ti­cle.
  2. The sec­ond sec­tion dis­cusses the de­f­i­n­i­tion of Fin­tech and iden­ti­fies an analy­sis frame­work to this buzz­word.
  3. The third sec­tion dis­cusses Ar­ti­fi­cial In­tel­li­gence.
  4. The Fourth and Fifth sec­tions dis­cuss Ma­chine Learn­ing and Big Data.
  5. The ar­ti­cle is then con­cluded with a sum­mary and rec­om­men­da­tion sec­tion.

Al­though blockchain is an im­por­tant tech­nol­ogy un­der the Fin­tech par­a­digm, it is not dis­cussed in this ar­ti­cle. For those in­ter­ested in the Blockchain tech­nol­ogy and its fi­nan­cial ap­pli­ca­tions, they are re­ferred to my ar­ti­cle on the crypto-as­sets (Sarhan, 2020).

Key­words: Fin­tech, Ar­ti­fi­cial In­tel­li­gence (AI), Big Data, Ma­chine Learn­ing (ML), Blockchain, Al­go­rithm, Sta­tis­tics.

Fin­tech

Fin­tech is a broad term and it in­tends to mean dif­fer­ent things to dif­fer­ent peo­ple (Allen, Gu, and Jag­tiani, 2020). How­ever, it can be thought of as the in­ter­sec­tion be­tween tech­nol­ogy and fi­nance (Gomber, Koch, and Sier­ing, 2017). For some peo­ple, it is re­lated to three tech­nol­ogy de­ploy­ments, in­clud­ing the ap­pli­ca­tion of the Dis­trib­uted Ledger Tech­nol­ogy (DL) in fi­nance, the pro­vi­sion of fi­nan­cial ser­vice ad­vice through ro­bots (Robo Ad­vi­sors), and peer-to-peer fi­nanc­ing (Crowd Fund­ing) (Preece, 2016). To oth­ers, it means ei­ther the tech­nol­ogy-dri­ven in­no­va­tion oc­cur­ring in the fi­nan­cial ser­vices in­dus­try or to re­fer to com­pa­nies (new or start-ups) that are in­volved in de­vel­op­ing new tech­nolo­gies and their ap­pli­ca­tions in­clud­ing the busi­ness sec­tor that com­prises such com­pa­nies (Kissell and Mack, 2020).  Gomber et al. (2017) ar­gue that one needs to think of Fin­tech un­der the frame­work of dig­i­tal fi­nance cube. The au­thors be­lieve that the fi­nance func­tions, the tech­nolo­gies ap­plied to the fi­nance func­tions, and the in­sti­tu­tions in­volved in the dig­i­ti­za­tion of the fi­nance func­tions are the three di­men­sions of the dig­i­tal fi­nance cube (see fig­ure (1)).

There­fore, one can de­fine Fin­tech un­der this frame­work as fol­lows: the ac­tiv­i­ties that are un­der­taken around fi­nance func­tions (fi­nanc­ing, in­vest­ments, money, pay­ments, in­sur­ances, and fi­nan­cial ad­vice) in or­der to dig­i­tal­ize them us­ing fi­nance tech­nolo­gies that are adopted or be­ing de­vel­oped by cer­tain or­ga­ni­za­tion in both of the fi­nance and IT in­dus­tries. For in­stance, the dig­i­tal in­vest­ment func­tion has been de­vel­oped to be pro­vided through ro­bots (Robo Ad­vi­sors). This ro­botic fi­nan­cial ad­vice ser­vice is the re­sult of co­op­er­a­tion be­tween in­vest­ment man­age­ment firms and IT ser­vice com­pa­nies spe­cial­ized in Ar­ti­fi­cial In­tel­li­gence (AI) and Ma­chine Learn­ing (ML).  This ar­ti­cle fo­cuses on Fin­tech from the per­spec­tive of four main tech­nolo­gies that are con­sid­ered the most im­por­tant fi­nan­cial tech­nolo­gies in the 21st cen­tury. More specif­i­cally, the ar­ti­cle will dis­cuss Big Data, Ma­chine Learn­ing, and Ar­ti­fi­cial In­tel­li­gence. Blockchain is not dis­cussed in this ar­ti­cle as it is cov­ered in an­other ar­ti­cle by the au­thor. The next sec­tion will dis­cuss Ar­ti­fi­cial In­tel­li­gence.

Ar­ti­fi­cial In­tel­li­gence (AI)

AI is one of the four-pil­lars of Fin­tech as dis­cussed in pre­vi­ous sec­tion. Since the 1950s, sci­en­tists tried to mimic hu­man cog­ni­tion ca­pa­bil­i­ties and these ef­forts lead to to­day’s rev­o­lu­tion­ary dig­i­tized world with the help of ro­bots and Ma­chines. This sec­tion dis­cusses the AI and is or­ga­nized as fol­lows: First, the sec­tion dis­cusses the de­f­i­n­i­tion of AI. Sec­ond, the sec­tion com­pares the AI to both of Deep Learn­ing and Neural Net­work. Third, the sec­tion briefly lists the his­tory of AI. The sec­tion then demon­strates the sta­tus of re­search with re­gards to AI. The ap­pli­ca­tions of AI and the is­sues fac­ing AI en­deav­ors are dis­cussed there­after. The sec­tion is then con­cluded with a sum­mary.

De­f­i­n­i­tion

It is dif­fi­cult to form a uni­fied de­f­i­n­i­tion of AI (Gao, Jia, Zhao, Chen, Xu, Geng and Song, 2019). There is even no widely agreed-upon sci­en­tific de­f­i­n­i­tion of in­tel­li­gence. Rather there is a list of char­ac­ter­is­tics that AI re­searchers are try­ing to repli­cate in­clud­ing per­cep­tion, ac­tion, rea­son­ing, adap­ta­tion and learn­ing, com­mu­ni­ca­tion, plan­ning, au­ton­omy, cre­ativ­ity, re­flec­tions and aware­ness, aes­thet­ics, and or­ga­ni­za­tion (Hanoavar, 2016). Nev­er­the­less, AI can be de­fined as “The en­ter­prise of un­der­stand­ing and build­ing in­tel­li­gent sys­tems” (Hanoavar, 2016, p. 2) or the “sci­ence of mak­ing ma­chines do things that would re­quire in­tel­li­gence if done by men” (Mar­vin Min­sky cited in Crevier, 1993, p. 9). AI, in fact, is the search of hu­mans to build a ma­chine that repli­cates their cog­ni­tion and adap­ta­tion abil­i­ties and that can solve com­plex prob­lems based on an un­der­stand­ing of the sur­round­ing en­vi­ron­ment and in­ter­nal logic that di­rect its ac­tions in a way that op­ti­mizes the out­put.

Link to Ma­chine Learn­ing and Deep Learn­ing

AI, Ma­chine Learn­ing, and Deep learn­ing terms are used in­ter­change­ably (Marr, 2016 and No­chol­son, 2019). How­ever, ML is a sub­set of AI, be­cause AI is a broader con­cept that is re­lated to the ma­chines be­ing smart to per­form tasks while ML is an ap­pli­ca­tion of AI based on the idea that we should give the ma­chine ac­cess to data to let them learn for them­selves (Marr, 2016). On the other hand, as will be ex­plained later in the Ma­chine Learn­ing sec­tion, Deep learn­ing is a sub­set of Ma­chine Learn­ing.

His­tory

The AI was first coined by John Mc­Carthy in 1955 and the first AI pro­gram was pre­sented dur­ing the Dart­mouth con­fer­ence in 1956 (Any­oha, 2017). The AI tech­nol­ogy there­after un­der­gone many ups and downs un­til when the rev­o­lu­tion of the AI tech­nol­ogy was re­vealed when “Deep Blue”, an AI Deep learn­ing al­go­rithm, was able to beat the chess world cham­pion, Garry Kaz­parov, in 1997. Re­cently, in 2017, an AI Deep learn­ing al­go­rithm called “Al­p­haZero” used pat­tern recog­ni­tion tech­niques to be­come the world chess cham­pion. In con­trast to Deep Blue. This pro­gram did not use the do­main knowl­edge base and was able to teach it­self by train­ing for four hours play­ing against it­self (Rasekhschaffe and Jones, 2019).

AI in re­search

The re­search in the AI field has taken mo­men­tum in re­cent years which in­di­cates that the aca­d­e­mic com­mu­nity in­ter­est is fo­cused on this hot topic. Based on a bib­lio­met­ric study of the re­search ar­ti­cles sourced from the Web of Sci­ence by (Lei and Liu, 2018). The num­ber of re­search ar­ti­cles, with the term “Ar­ti­fi­cial In­tel­li­gence” in the re­search ti­tle, they find that from 2007 till 2016 the num­ber of ar­ti­cles is 1,188, with 188 ar­ti­cles pub­lished in 2016 ver­sus 72 ar­ti­cles in 2007. The top 10 coun­tries in­volved in re­search­ing in­clude the USA at the top fol­lowed in de­scend­ing or­der by UK, Iran, Spain, China, Italy, In­dia, Turkey, Canada, and France. The top­ics re­searched in­clude neural net­work, ge­netic al­go­rithm, fuzzy logic, op­ti­miza­tion, sup­port vec­tor ma­chine, Ma­chine Learn­ing, mod­el­ing, and pre­dic­tion. The de­vel­op­ment in the num­ber of ar­ti­cles is shown in fig­ure (3).

The his­tory of AI can be rep­re­sented in a his­tory time­line as pre­sented in fig­ure (2) be­low:

Ap­pli­ca­tions

AI is a mul­ti­dis­ci­pli­nary sci­ence that is built on con­tri­bu­tions from var­i­ous sci­ences in­clud­ing Com­puter sci­ence, math­e­mat­ics, psy­chol­ogy, bi­ol­ogy, en­gi­neer­ing and phi­los­o­phy. The ap­pli­ca­tions of AI are many in­clud­ing, among oth­ers, nat­ural lan­guage pro­cess­ing (NLP) wherein the ma­chine can talk to the hu­man and un­der­stand the nat­ural lan­guage (e.g. Siri and Alexa), im­age recog­ni­tion (e.g. Self-dri­ving cars), rec­om­men­da­tion en­gines (e.g. Ama­zon) (Rasekhschaffe and Jones, 2019) and in­tel­li­gent ro­bots that learn from mis­takes and ex­e­cutes hu­man or­ders. These ro­bots have sen­sors that en­able them to ob­tain in­for­ma­tion about the sur­round­ing en­vi­ron­ment. The ro­bots have ef­fi­cient proces­sors, mul­ti­ple sen­sors, and huge mem­ory to ex­hibit in­tel­li­gence (Tu­to­ri­als Point, 2020a).

Is­sues

AI is­sues are dis­cussed from var­i­ous per­spec­tives in­clud­ing, among oth­ers: First, AI is threat­en­ing pri­vacy as it can rec­og­nize speech and can the­o­ret­i­cally un­der­stand each e-mail and tele­phone con­ver­sa­tion (Tu­to­ri­als Point, 2020b). Sec­ond, ma­chines un­der AI are de­vel­oped to mimic hu­man in­tel­li­gence and learn from mis­takes sim­i­lar to Al­p­haZero al­go­rithm and might be­come smarter than hu­mans to the point that hu­mans can­not con­trol it. There­fore, ex­pos­ing us to ex­is­ten­tial risk. Bill Gates (Mi­crosoft) Elon Mask (Space X) and Stephen Hawk­ing (Physi­cist) have shown sim­i­lar con­cerns (Wikipedia, 2020a). Third, AI car­ries other risks such as the risk of a de­crease in de­mand for hu­man la­bor and au­tonomous weapons (ar­ti­fi­cial sol­diers). Fourth, some of the Ma­chine Learn­ing al­go­rithms are com­plex and have out­puts that are con­sid­ered as “black-box”. This cre­ates chal­lenges to users of these al­go­rithms as they can­not in­ter­pret how the al­go­rithms reached the con­clu­sions. XAI (Ex­plain­able Ar­ti­fi­cial In­tel­li­gence) is an ap­proach to bet­ter un­der­stand how the ma­chine reached the con­clu­sion in the out­put. ACCA re­port in 2020 clar­i­fies the XAI as it “‘means to be able to shine a light on its in­ner work­ings and/or to re­veal some in­sight on what fac­tors in­flu­enced its out­put, and to what ex­tent, and for this in­for­ma­tion to be hu­man-read­able, i.e not hid­den within im­pen­e­tra­ble lines of code” (ACCA, 2020, p. 8)

Sum­mary

This sec­tion dis­cussed the de­f­i­n­i­tion of the AI and briefly listed the his­tor­i­cal mile­stones of the AI as a sci­ence. The sec­tion also dis­cussed the his­tory of AI as well as the ap­pli­ca­tions and is­sues ac­com­pa­nied by the AI. The analy­sis in this sec­tion has pro­vided a brief un­der­stand­ing of the AI to help the reader to rec­og­nize its na­ture and dif­fer­en­ti­ate it from Ma­chine Learn­ing and Deep Learn­ing. The fol­low­ing sec­tion will dis­cuss the Ma­chine Learn­ing (ML) con­cept, types of ML al­go­rithms, and the ap­pli­ca­tions of ML in fi­nance.

Ma­chine Learn­ing (ML)

As dis­cussed in the pre­vi­ous sec­tion, AI as sci­ence works to­ward cre­at­ing ma­chines that ex­e­cute tasks that re­quire the in­tel­li­gence of men to do. Ma­chine Learn­ing is one of the tools that are used in AI to help hu­mans per­form tasks that are dif­fi­cult to be per­formed with­out the help of ma­chines due to the large num­ber of vari­ables in­volved. This sec­tion dis­cusses Ma­chine Learn­ing (ML) and is or­ga­nized as fol­lows: First, the de­f­i­n­i­tion of the ML is dis­cussed. Sec­ond, ML is com­pared and con­trasted with sta­tis­ti­cal tools. sub­se­quently, the sec­tion dis­cusses the mech­a­nism of how ML al­go­rithms work. The sec­tion then dis­cusses how to choose among avail­able ML al­go­rithms. The ben­e­fit of ag­gre­gat­ing ML al­go­rithms and the ap­pli­ca­tions of ML in fi­nance are dis­cussed and the sec­tion then con­cludes with a sum­mary.

De­f­i­n­i­tion

Ma­chine Learn­ing is a sub-field of Ar­ti­fi­cial In­tel­li­gence that gives com­put­ers the abil­ity to learn with­out be­ing ex­plic­itly pro­grammed to solve the prob­lem (El-Afly and Mo­hammed, 2020; Parveena and Ji­a­ganesh, 2017). This de­f­i­n­i­tion means sim­ply that ML al­go­rithms are pro­grammed to learn. As will be dis­cussed later in this sec­tion, ML al­go­rithms learn from large amounts of data and ap­ply that learn­ing to pre­dict (Gen­er­al­ize). This process can be sim­ply thought of as “Learn the pat­tern, ap­ply the pat­tern” (DeRose and Le Lan­nou, 2020). In fact, the al­go­rithm de­ter­mines the struc­ture in the data in or­der to solve a clas­si­fi­ca­tion or re­gres­sion prob­lem.

What is the dif­fer­ence from sta­tis­ti­cal pre­dic­tion tools?

First, sta­tis­ti­cal tools rely on the­o­ret­i­cal as­sump­tions and data struc­ture. For in­stance, com­par­ing t-paired sam­ple means as­sumes that data are nor­mally dis­trib­uted while lin­ear re­gres­sion as­sumes that there is a lin­ear re­la­tion­ship be­tween the de­pen­dent and in­de­pen­dent vari­ables and that the er­ror terms are sta­tion­ary. In con­trast, ML al­go­rithms do not have such re­stric­tions and can han­dle prob­lems with many vari­ables (high di­men­sion­al­ity) or with a high de­gree of non-lin­ear­ity (DeRose and Le Lan­nou, 2020). Sec­ond, sta­tis­ti­cal tools and ML shares the same ob­jec­tive of min­i­miz­ing the er­ror of fore­cast, usu­ally the Root Mean Square of Er­rors (RMSE), while they dif­fer in how to reach this goal. Sta­tis­ti­cal tools use lin­ear processes while ML uses non-lin­ear ones. In ad­di­tion, ML al­go­rithms are more data de­mand­ing and rely heav­ily on com­puter sci­ence (Makri­dakis, Spili­o­tis and As­si­makopou­los, 2018). In spite of the prac­ti­cal­i­ties of the ML al­go­rithms in solv­ing com­plex prob­lems com­pared to the sta­tis­ti­cal mod­els, there is an ar­gu­ment that sta­tis­ti­cal tools out­per­form the ML coun­ter­parts if tested out-of-sam­ple and in dif­fer­ent time hori­zons (Makri­dakis et. al., 2018). More­over, there is ev­i­dence pre­sented in the M4 com­pe­ti­tion (a com­pe­ti­tion be­tween 61 fore­cast­ing meth­ods ap­plied to 100,000 times se­ries) that com­bin­ing sta­tis­ti­cal fore­cast­ers pro­vides the best re­sults ver­sus pure ML or pure sta­tis­ti­cal meth­ods (Makri­dakis, Spili­o­tis and As­si­makopou­los, 2020).

How do Ma­chine Learn­ing al­go­rithms work?

ML mech­a­nism is de­pen­dent on the type of the al­go­rithm. In lit­er­a­ture, there are two ways to group the al­go­rithms. First, by the way the al­go­rithm learns from data, and sec­ond is by group­ing them in style or func­tion (Brown­lee, 2019). This ar­ti­cle em­ploys the first method i.e. by the way the al­go­rithm learns from data.

The var­i­ous ML al­go­rithms grouped by the style of func­tion are de­picted in fig­ure (4). The in­ter­ested reader is re­ferred to visit Wikipedia (Ma­chine Learn­ing) for more types and var­i­ous al­go­rithms. ML types by the way they learn are su­per­vised, un­su­per­vised, Neural Net­works (Deep learn­ing), and re­in­force­ment learn­ing (DeRose and Le Lan­nou, 2020).

Su­per­vised Ma­chine Learn­ing

Un­der su­per­vised ML, the al­go­rithms are trained on spe­cific data to learn the pat­terns in it. Be­fore start­ing the train­ing, vari­ables should be de­fined into tar­get and fea­tures which are equiv­a­lent to de­pen­dent and in­de­pen­dent vari­ables re­spec­tively. Data are ob­tained and di­vided into three parts. The train­ing set, the cross-val­i­da­tion set, and the test­ing set. The train­ing set (in-sam­ple) is used to train the al­go­rithm, while the cross-val­i­da­tion set is used to test and fine-tune the al­go­rithm. Fi­nally, the trained and tuned al­go­rithm is tested us­ing the test data set (out-of-sam­ple).

Su­per­vised ML al­go­rithms per­form two types of tasks (Mallikar­jun, and Ab­basi, 2020): First one is the clas­si­fi­ca­tion task that clas­si­fies the vari­ables into dis­tinct cat­e­gories. This can be done for cat­e­gor­i­cal or or­di­nal tar­get vari­ables. An ex­am­ple of the cat­e­gor­i­cal tar­get is credit card fraud de­tec­tion. The tar­get is cat­e­gor­i­cal (in this case bi­nary (1 for fraud­u­lent) or (0 for le­git­i­mate)), while the fea­tures are the trans­ac­tion char­ac­ter­is­tics. An­other ex­am­ple is the or­di­nal tar­get such as clas­si­fy­ing firms into credit cat­e­gories. An ex­am­ple of the al­go­rithms un­der the clas­si­fi­ca­tion task is Logit, Sup­port Vec­tor Ma­chine (SVM), K-Near­est Neigh­bor (KNN), and The Clas­si­fi­ca­tion and Re­gres­sion Tree (CART).  The sec­ond task of the su­per­vised ML al­go­rithms is the re­gres­sion which is mak­ing pre­dic­tions of a con­tin­u­ous tar­get vari­able. Ex­am­ples of this type of al­go­rithms are Lin­ear; Pe­nal­ized Re­gres­sion (LASSO), Lo­gis­tic, Clas­si­fi­ca­tion and Re­gres­sion Tree (CART), and Ran­dom Forests.

The mea­sure­ment of the ac­cu­racy of pre­dic­tion is per­formed us­ing dif­fer­ent mea­sures in­clud­ing the RMSE (Root Mean Square of Er­rors) which is the square root of the av­er­age of the qua­dratic of the dif­fer­ences be­tween the ac­tual data, in the out-of-sam­ple set, and the fore­casted data. In the­ory, the lower the RMSE the bet­ter, how­ever, one should be cog­nizant of two is­sues: Over­fit­ting and un­der­fit­ting. Ba­si­cally, if the al­go­rithm is too com­plex or the prob­lem of high di­men­sions, then the model might end mem­o­riz­ing the pat­tern and be­ing too per­fect in rep­re­sent­ing it. There­fore, in the out-of-sam­ple, such al­go­rithms would re­sult in poor per­for­mance as mea­sured by the RMSE (Over­fit­ting). This is also con­sid­ered an er­ror that is called (vari­ance er­ror). On the other hand, if the al­go­rithm is not able to cap­ture the pat­tern in the train­ing data, then it would re­port poor per­for­mance in-sam­ple i.e. if the al­go­rithm is ex­e­cuted on the in-sam­ple data, it would re­port higher RMSE be­tween the pre­dicted re­sults and the ac­tual data in the in-sam­ple data (un­der­fit­ting). This er­ror is called (bias er­ror). Usu­ally the er­rors are re­lated to the com­plex­ity of the al­go­rithm. The op­ti­mal case sce­nario is to find a trade-off be­tween the vari­ance and the bias er­rors by find­ing the right al­go­rithm that fits the data. Such an al­go­rithm should pos­sess a com­plex­ity level that min­i­mizes both of the two er­rors. See fig­ure (5) that shows the fit­ting curve and model fit­ting sce­nar­ios. The fit­ting curve is the trade-off be­tween the bias er­ror, vari­ance er­ror, and model com­plex­ity while model fit­ting sce­nar­ios show that what is needed is the good fit model that nei­ther over­fits nor un­der­fits. In ad­di­tion, fig­ure (6) shows the learn­ing curve, which is re­lated to how far the al­go­rithm learns to pre­dict with ac­cu­racy based on the sam­ple size. The as­sump­tion is, the higher the sam­ple size the bet­ter the ac­cu­racy.

Un­su­per­vised Ma­chine Learn­ing

Un­der this type of ML al­go­rithms, there is only in­put data and no out­put vari­able, i.e. “un­like su­per­vised learn­ing above there are no cor­rect an­swers and there is no teacher. Al­go­rithms are left to their own de­vices to dis­cover and pre­sent the in­ter­est­ing struc­ture in the data” (Brown­lee, 2016b). Data is sup­plied to the al­go­rithm with the ob­jec­tive is to un­cover the im­plicit struc­ture and then ap­ply the same struc­ture for pre­dic­tion to any new data point out-of-sam­ple. This type of ML al­go­rithm is con­cerned with per­form­ing three tasks: The first task is the di­men­sion­al­ity re­duc­tion which “fo­cuses on re­duc­ing the num­ber of fea­tures while re­tain­ing vari­a­tion across ob­ser­va­tions to pre­serve the in­for­ma­tion con­tained in that vari­a­tion.” (Mallikar­jun, and Ab­basi, 2020, p. 5). An ex­am­ple of this ML is the PCA (Prin­ci­pal Com­po­nent Analy­sis). An ex­am­ple of PCA is the use of ex­ploratory data analy­sis to dis­cover the main fea­tures be­fore train­ing an­other al­go­rithm. The sec­ond task is the clus­ter­ing that sorts the ob­ser­va­tions into ho­moge­nous groups. An ex­am­ple in­cludes K-Means and hi­er­ar­chal clus­ter­ing al­go­rithms. An ex­am­ple of clus­ter­ing is group­ing com­pa­nies ac­cord­ing to sim­i­lar­i­ties that are not cap­tured by us­ing hu­man judg­ment or by in­dus­try or sec­tor which is an im­por­tant in­put for port­fo­lio di­ver­si­fi­ca­tion and credit risk pro­fil­ing.  The last task is the as­so­ci­a­tion rule learn­ing that iden­ti­fies the rule that best ex­plains the re­la­tion­ship be­tween the vari­ables. An ex­am­ple in­cludes Apri­oiri al­go­rithm. An ex­am­ple of this al­go­rithm is found in Wikipedia that by an­a­lyz­ing the sales in su­per­mar­kets, one might find that cus­tomers who buy onions and pota­toes are likely to buy ham­burger meat. This piece of in­for­ma­tion can be used in mar­ket­ing.

Neural Net­works (Deep Learn­ing Net­work)

The name of this type of al­go­rithm guides the un­der­stand­ing of its na­ture. The NN is built to work in a sim­i­lar fash­ion of how our brains work. DeRose and Le Lan­nou (2020) dis­cuss the NN ML al­go­rithms tasks and in­di­cates that the NN per­forms two tasks: The clas­si­fi­ca­tion and re­gres­sion as su­per­vised ML but also can per­form un­su­per­vised learn­ing un­der Re­in­force­ment learn­ing.

The net­work is di­vided into three lay­ers. The in­put, the hid­den, and the out­put lay­ers. In the case of Deep Learn­ing Net­work, the num­ber of hid­den lay­ers is at least 3 but not less than 20 and this is con­sid­ered the back­bone of the AI rev­o­lu­tion (DeRose and Le Lan­nou, 2020).

DeRose and Le Lan­nou, (2020) dis­cusses the NN as fol­lows: The in­put layer com­prises of nodes that cap­ture the data in­put (fea­tures). The hid­den layer, where the learn­ing oc­curs, com­prises also of nodes (neu­rons) that as­sign ar­bi­trary weights to trans­mis­sions from the in­put layer. The fea­tures in­put is scaled (nor­mal­ized us­ing its max­i­mum) so that their val­ues vary be­tween 0 and 1. The hid­den layer per­forms two ac­tiv­i­ties in­clud­ing the sum­ma­tion op­er­a­tor and the ac­ti­va­tion func­tion. First, the sum­ma­tion op­er­a­tor is sim­ply the weighted av­er­age sum of the in­puts from the in­put layer. Sec­ond is the ac­ti­va­tion func­tion that trans­forms the weighted in­put from the sum­ma­tion op­er­a­tor ac­tiv­ity and is usu­ally a non-lin­ear like (sig­moid) that has an S-shaped func­tion and whose out­puts are be­tween 0 and 1 (i.e. non-lin­ear). The out­put of the hid­den layer to the out­put layer is based on the ac­ti­va­tion func­tion. The learn­ing oc­curs when the cor­rec­tion of the er­ror takes place. The er­ror re­sults from com­par­ing the ac­tual out­put and the pre­dicted out­put as per the NN. This is done by read­just­ing the weights in the hid­den layer (Back­ward prop­a­ga­tion) and the process then is re­peated un­til the pre­dicted re­sults have an ac­cept­able level of ac­cu­racy. See fig­ure (7) that ex­plains the dif­fer­ence be­tween the re­gres­sion model and the NN. Un­der re­gres­sion analy­sis, the re­la­tion be­tween the Y vari­able and the Xs is lin­ear, while un­der the NN the Zs are the nodes of the hid­den layer and they trans­mit data that has a value be­tween 0 and 1 (the ac­ti­va­tion func­tion here is the rec­ti­fied lin­ear unit func­tion) which is a bet­ter fit for non-lin­ear re­la­tions (i.e. the rate of change in out­put dif­fers at dif­fer­ent lev­els of in­put).

Re­searchers are ex­plor­ing the use of NN ML to model as­set prices. Those prices are noisy and sto­chas­tic processes with un­sta­ble re­la­tion­ships. NN is thought to be able to help in un­der­stand­ing how the mar­ket works (DeRose and Le Lan­nou, 2020). Nev­er­the­less, the com­par­i­son be­tween the per­for­mance of the ML al­go­rithms and the sta­tis­ti­cal model in such ex­er­cises need to be con­sulted (see What is the dif­fer­ence from sta­tis­ti­cal pre­dic­tion tools?  Sec­tion in this ar­ti­cle).

Re­in­force­ment learn­ing

Re­in­force­ment learn­ing is based on hav­ing an agent (the al­go­rithm) that has no prior la­beled data, as con­trasted to the su­per­vised learn­ing al­go­rithms. This al­go­rithm takes ac­tions that max­i­mize its re­wards over time through trial and er­ror and based on the con­straints im­posed by the sur­round­ing en­vi­ron­ment (DeRose and Le Lan­nou, 2020). For in­stance, mov­ing a sol­dier on the chess for­ward can ei­ther re­sult in a cor­rect move that re­turns a re­ward or an in­cor­rect move that does not re­turns a re­ward. Then, if the move­ment re­sulted in a re­ward, the ma­chine saves this knowl­edge for fu­ture ac­tions and vice-versa. This al­go­rithm is used for ex­am­ple in self-dri­ving cars and in ro­bots and it was the same one used by Al­phaGo which beat the world cham­pion at the an­cient game of Go in 2017 (Wikipedia, 2020b).

How to choose a suit­able Ma­chine Learn­ing al­go­rithm?

To an­swer this ques­tion, the think­ing process has three steps: The first step is de­ter­min­ing the ob­jec­tive of the ML task. More specif­i­cally, if the task is a clas­si­fi­ca­tion, re­gres­sion, di­men­sion­al­ity re­duc­tion, clus­ter­ing, or com­plex tasks such as face recog­ni­tion, im­age clas­si­fi­ca­tion, speech recog­ni­tion, and nat­ural lan­guage pro­cess­ing. The sec­ond step in ML se­lec­tion is to know the ca­pa­bil­i­ties of the ML al­go­rithms and the third step is to know what type of vari­ables the ML al­go­rithm can deal with and that matches the ob­jec­tive de­ter­mined in the first step. More, specif­i­cally, su­per­vised ML can per­form ei­ther clas­si­fi­ca­tion or re­gres­sion tasks and the suit­able vari­able set is the con­tin­ues for the re­gres­sion tasks and the cat­e­gor­i­cal and or­di­nal vari­ables for the clas­si­fi­ca­tion tasks. For in­stance, SVM (Sup­port Vec­tor Ma­chine) al­go­rithm is an ex­am­ple of a su­per­vised ML that can per­form clas­si­fi­ca­tion tasks for any new vari­able out-of-sam­ple based on train­ing the SVM on data that are cat­e­gor­i­cal in na­ture (e.g. Yes/No, per­form­ing/non-per­form­ing …etc.). Un­su­per­vised ML is ca­pa­ble of han­dling di­men­sion­al­ity re­duc­tion and clus­ter­ing tasks on ei­ther of the con­tin­u­ous and the cat­e­gor­i­cal or or­di­nal vari­ables. PCA (Prin­ci­pal Com­po­nent Analy­sis) al­go­rithm is an ex­am­ple of an un­su­per­vised ML al­go­rithm that can re­duce the num­ber of fea­tures in the data to fit in on a com­puter screen. It does that by find­ing the ho­moge­nous fea­tures and groups them while keep­ing het­ero­ge­neous fea­tures in other groups. Do­ing so re­duces the num­ber of fea­tures and pro­vides in­sight into the data. PCA group of fea­tures are called eigen­vec­tors and each eigen­vec­tor is as­signed an eigen­value. The eigen­val­ues rep­re­sent how much of the data vari­a­tion is rep­re­sented by that par­tic­u­lar eigen­vec­tor. Fi­nally, Neural Net­works, Deep Learn­ing, and Re­in­force­ment learn­ing are so­phis­ti­cated al­go­rithms that deal with con­tin­u­ous and cat­e­gor­i­cal vari­ables as well as deal­ing with com­plex tasks or learn from them­selves. Fig­ure (8) pre­sents an ex­am­ple of how ML al­go­rithms can be se­lected based on the three fac­tors dis­cussed in this sec­tion. Specif­i­cally, the type of the ML al­go­rithm, the na­ture of the vari­able, and the ob­jec­tive of the task.

Can one ag­gre­gate ML al­go­rithms to get bet­ter re­sults?

To re­duce the vari­ance er­ror and achieve the bal­ance be­tween the vari­ance and bias er­rors, one can use the “en­sem­ble method” that helps to achieve this ob­jec­tive. This method is based on the idea of ag­gre­gat­ing the re­sults of mul­ti­ple al­go­rithms (weak learn­ers) to ob­tain bet­ter re­sults. The ag­gre­ga­tion can be done in two ways: Ei­ther, the same al­go­rithm ap­plied to dif­fer­ent ver­sions of the train­ing data (gen­er­ated us­ing boot­strap­ping also known as bag­ging) or dif­fer­ent al­go­rithms are ap­plied to the same train­ing data wherein each new al­go­rithm added cor­rects the er­rors in the pre­vi­ous one un­til the train­ing set is pre­dicted per­fectly or the op­ti­mal num­ber of mod­els are added (also known as Boost­ing) (DeRose and Le Lan­nou, 2020; Rasekhschaffe and Jones, 2019; Brown­lee, 2016a).

What are the ap­pli­ca­tions of ML in fi­nance?

ML al­go­rithms are be­ing ap­plied in fi­nance in var­i­ous as­pects. For in­stance, to ref­er­ence but few, ML al­go­rithms were used for fraud de­tec­tion (Ade­poju, Wosowei, lawte and Jaiman, 2019), Anti-Money Laun­der­ing AML in­ves­ti­ga­tions, port­fo­lio op­ti­miza­tion, stock se­lec­tion (Rasekhschaffe and Jones, 2019) and Bank risk man­age­ment (Leo, Sharma and Mad­dulety, 2019).

Sum­mary

ML al­go­rithms paved the way for a new era of analy­sis and pre­dic­tion wherein the Ma­chine can learn with­out be­ing pro­grammed for a spe­cific task. They pro­vide hu­mans with sup­port to com­pre­hend and an­a­lyze huge data that ex­hibit high di­men­sional (mul­ti­vari­ate) and non-lin­ear re­la­tion­ships. Some com­plex al­go­rithms can even learn from it­self and beat the world cham­pi­ons in Chess and Go games. The ap­pli­ca­tion of ML in fi­nance is nu­mer­ous and is still be­ing fur­ther ex­plored in prac­tice and in acad­e­mia. How­ever, one should be cog­nizant to the em­pir­i­cal ev­i­dence noted in the M4 com­pe­ti­tion that com­bined sta­tis­ti­cal mod­els were able to out­per­form the ML al­go­rithms out-of-sam­ple. At the same time, with the huge in­crease in the size and type of data (Big Data) as dis­cussed in the in­tro­duc­tion sec­tion of this ar­ti­cle, the use of ML is be­com­ing more im­por­tant if not a must. The fol­low­ing sec­tion dis­cusses Big Data and il­lus­trates the steps taken in car­ry­ing out a Big Data pro­ject.

Big Data 

The pre­vi­ous sec­tion dis­cussed ML and its ap­pli­ca­tions. A large amount of data avail­able for or­ga­ni­za­tions is con­sid­ered as as­sets and cur­rently or­ga­ni­za­tion are mon­e­tiz­ing those as­sets with the help of ML. Big Data and re­lated tech­nolo­gies are an im­por­tant hot topic these days. This sec­tion dis­cusses the Big Data and is or­ga­nized in four ar­eas: First, the sec­tion dis­cusses the de­f­i­n­i­tion of Big Data. Sec­ond, the sec­tion dis­cusses how the Big Data are man­aged and what are the Big Data man­age­ment chal­lenges. Third, the sec­tion dis­cusses how to im­ple­ment a Big Data pro­ject, and fi­nally, the sec­tion con­cludes with a sum­mary. 

De­f­i­n­i­tion

Data that we used to know com­prises fig­ures and texts arranged in ta­bles. How­ever, Big Data is “Un­like tra­di­tional data, …. refers to large grow­ing data sets that in­clude het­ero­ge­neous for­mats: struc­tured, un­struc­tured and semi-struc­tured data.” (Ous­sous, Ben­jel­loun, Lah­cen and Belfkih, 2018, p. 433). It needs more real-time analy­sis and brings op­por­tu­ni­ties by help­ing us un­der­stand the hid­den val­ues and by the same to­ken carry chal­lenges on how to or­ga­nize and man­age it (Chen, Mao and Liu, 2014). Nowa­days, data has emerged to be more than ta­bles that can be an­a­lyzed us­ing tra­di­tional sta­tis­ti­cal mod­els. Cur­rently, data in­cludes, among oth­ers, im­ages, videos, au­dio, so­cial me­dia chat, mo­bile telecom­mu­ni­ca­tion mes­sages and calls, and web searches all of which are gen­er­ated on a per-sec­ond ba­sis. The mass de­vel­op­ment in so­cial me­dia, mo­bile telecom­mu­ni­ca­tion, an­a­lyt­ics, cloud com­put­ing, and the In­ter­net of Things (IoT) cre­ates chal­lenges for data man­age­ment and analy­sis. Big Data is char­ac­ter­ized by 4Vs in­clud­ing Vol­ume, Ve­loc­ity, Va­ri­ety, and Value plus ve­rac­ity as an ad­di­tional V (Mallikar­jun, and Ab­basi, 2020; Ous­sous et. al., 2018; Chen et al., 2014). Vol­ume refers to the quan­tity, ve­loc­ity refers to the speed by which data is gen­er­ated, va­ri­ety refers to the di­ver­sity in types of data, ve­rac­ity refers to the cred­i­bil­ity of data and value is the hid­den char­ac­ter that is dis­cov­ered by analy­sis. This is where the im­por­tance of Big Data stems from. The high vol­ume as­so­ci­ated with the timely ac­qui­si­tion of data and the di­ver­sity in na­ture of this data makes the data an­a­lyt­ics in real-time an ac­tiv­ity that brings eco­nomic ad­van­tages to users and or­ga­ni­za­tions. In terms of rev­enue from data an­a­lyt­ics, for ex­am­ple, Busi­ness and Data An­a­lyt­ics (BDA)- re­lated Soft­ware rev­enue is ex­pected to be U.S. dol­lars 67.2 bil­lion in 2019, out of which 44% are soft­ware ser­vices pro­vided via pub­lic could that will grow on a com­pound an­nual growth rate (CAGR) of 32.3% (IDC, 2019).

How Big Data is man­aged and what are Big Data man­age­ment chal­lenges?

Big Data man­age­ment aims to “en­sure re­li­able data that is eas­ily ac­ces­si­ble, man­age­able, prop­erly stored and se­cured” (Ous­sous et al., 2018). There­fore, Big Data man­age­ment can be dis­cussed from five per­spec­tives in­clud­ing pro­duc­tion, ac­qui­si­tions, stor­age, analy­ses, and se­cu­rity. First, Big Data are “gen­er­ated by fi­nan­cial mar­kets (e.g., stock and bond prices), busi­nesses (e.g., com­pany fi­nan­cials, pro­duc­tion vol­umes), gov­ern­ments (e.g., eco­nomic and trade data), in­di­vid­u­als (e.g., credit card pur­chases, so­cial me­dia posts), sen­sors (e.g., satel­lite im­agery, traf­fic pat­terns), and the In­ter­net of Things, or IoT, (i.e., the net­work of in­ter­re­lated dig­i­tal de­vices that can trans­fer data among them­selves with­out hu­man in­ter­ac­tion) (Mallikar­jun, and Ab­basi, 2020, p. 1). Sec­ond, Big Data are ei­ther ac­quired by the or­ga­ni­za­tions them­selves through their sys­tems such as Ama­zon, Google, Face­book, Ya­hoo, or ac­quired from third-party spe­cial­ized ven­dors. Third, tra­di­tional data man­age­ment and analy­sis sys­tems are based on re­la­tional data­base man­age­ment sys­tems (RDMS) which are ca­pa­ble of han­dling struc­tured data, other than, semi-struc­tured and un­struc­tured data. Chen et al. (2014) ar­gue that RDMS sys­tems are us­ing more hard­ware and are be­com­ing ex­pen­sive to man­age. So­lu­tions pro­posed to deal with this prob­lem in­clude, for in­stance, cloud com­put­ing and dis­trib­uted file sys­tems such as NoSQL. The for­mer so­lu­tion en­ables busi­nesses to han­dle Big Data ef­fi­ciently with elas­tic­ity on in­fra­struc­ture (e.g. IaaS: In­fra­struc­ture as a ser­vice, SaaS: soft­ware as ser­vice, PaaS: Plat­form as a ser­vice). The lat­ter so­lu­tion has achieved great suc­cess in pro­cess­ing clus­tered tasks. Apache Hadoop is a well-known Big Data tech­nol­ogy that has a sup­port­ing com­mu­nity. Hadoop is open-source and runs tasks in par­al­lel clus­ters through MapRe­duce which is a pro­gram­ming model used for Big Data analy­sis. Hadoop does not copy in mem­ory the whole-dis­tant data files, rather, it runs the tasks on where the data are stored. There­fore, low­ers server and net­work com­mu­ni­ca­tion load (the reader is re­ferred to Ous­sous et al. (2018) for fur­ther read­ing in Hadoop in­fra­struc­ture and lay­ers). Fourth man­age­ment ac­tiv­ity on Big Data is the stor­age of Big Data. Stor­age is one of the chal­lenges that raises ques­tions about data re­dun­dancy and the data life cy­cle. Re­dun­dant data are those data points gen­er­ated in the IoT for in­stance and are of no value-added, while Big Data life cy­cle is all de­ci­sions con­cerned with how long data should be kept in stor­age. The fifth man­age­ment ac­tiv­ity is data analy­sis. The tools used for an­a­lyz­ing Big Data in­clude open source pro­gram­ming lan­guages such as Python and R as well as Apache Ma­hout. The lat­ter is ca­pa­ble of han­dling large scale data ML al­go­rithms which is dif­fi­cult to im­ple­ment, for in­stance, us­ing R.

Big Data is an as­set and all or­ga­ni­za­tions are try­ing to mon­e­tize that as­set by tak­ing timely de­ci­sions based on in­sights from the analy­sis un­der­taken on this data. Some­times those de­ci­sions should be in­stant. There­fore, Big Data man­age­ment chal­lenges stems from the 4Vs. First, the vol­ume and ve­loc­ity of data are in­creas­ing. This re­quires the ca­pa­bil­ity to cap­ture these data by hav­ing a proper in­fra­struc­ture that can gen­er­ate, ac­quire, store, and an­a­lyze the data on a timely ba­sis. Cur­rent sys­tems and hard­ware might not be ca­pa­ble of cap­tur­ing the data in a timely man­ner nor is ca­pa­ble of stor­ing and an­a­lyz­ing it. The same ap­plies to hu­man cap­i­tal. The ques­tion is whether or­ga­ni­za­tions have the proper pro­fes­sional skills to deal with Big Data through­out the cy­cle of man­age­ment of such data. Fi­nally, the se­cu­rity and pri­vacy of data is a great con­cern. Gov­ern­ments are pay­ing more at­ten­tion to per­sonal data se­cu­rity such as the GDPR law in Eu­rope and the Bank Se­crecy laws world­wide. Those laws pro­tect the pri­vacy of per­sonal data. There­fore, the or­ga­ni­za­tion has to take mea­sures to avoid a breach of such laws, oth­er­wise, the con­se­quences are cat­a­strophic in­clud­ing penal­ties and dam­age of rep­u­ta­tion. The sec­ond chal­lenge of data man­age­ment is the va­ri­ety of this data. One needs to have proper analy­sis tools as dis­cussed ear­lier to be able to un­cover the hid­den pat­terns (value) in the huge vol­ume of the un­struc­tured and struc­tured data. This re­quires en­sur­ing that the data cap­tured are re­li­able (ve­rac­ity) and that the analy­sis tools can han­dle huge vol­ume such as (Apache Ma­hout, R, and Python) as well as the proper al­go­rithms are em­ployed that are ca­pa­ble of han­dling non-lin­ear and multi-vari­ate data such as ML al­go­rithms, There­fore, or­ga­ni­za­tions are on the fore­front to get all the ben­e­fit of Big Data. At the same time, they are fac­ing many chal­lenges that re­quire tak­ing proper mea­sures to avoid in­her­ent risks.

How to im­ple­ment a Big Data pro­ject?

The dis­cus­sion in this sec­tion is based on the work of (Mallikar­jun, and Ab­basi, 2020) un­less oth­er­wise stated.

Work­ing with data al­ways pro­vides in­sights and sup­port busi­ness man­age­ment and in­vest­ment de­ci­sions. For in­stance, struc­tured data are used to pre­dict stock per­for­mance or to as­sess the cred­it­wor­thi­ness of a cus­tomer. How­ever, with the ad­di­tion of un­struc­tured data such as texts, so­cial me­dia, im­ages, videos and au­dio. The scope, the ben­e­fit, as well as the chal­lenges of Big Data, be­came wider. An ex­am­ple of the chal­lenges, Big Data analy­sis is data pri­vacy and eth­i­cal use of data. An ex­am­ple of the ben­e­fits of Big Data analy­sis is one study in the United States that re­vealed the fact that pos­i­tive pub­lic sen­ti­ment on Twit­ter can pre­dict the Dow Jones In­dus­trial Av­er­age up to three days later with nearly 87% ac­cu­racy.      

This sec­tion dis­cusses Big Data pro­jects deal­ing with struc­tured and tex­tual types of un­struc­tured data. The steps in im­ple­ment­ing a Big Data pro­ject un­der both types of data are sim­i­lar ex­cept to the fact that un­der the un­struc­tured tex­tual data pro­ject, the ob­jec­tive is to fi­nally ob­tain data that are sim­i­lar to the struc­tured data which can be used for mod­el­ing pur­poses. Fig­ure (9) de­picts the dif­fer­ence in the process of the Big Data pro­jects’ steps be­tween the struc­tured and the un­struc­tured tex­tual data.

If the pro­ject is a fi­nan­cial mod­el­ing pro­ject, both struc­tured and un­struc­tured data analy­sis can be ex­e­cuted. For in­stance, pre­dict­ing stock per­for­mance can be mod­eled us­ing struc­tured data in­clud­ing stock fun­da­men­tals and un­struc­tured data in­clud­ing what is writ­ten in so­cial me­dia and news­pa­pers. The out­put of the un­struc­tured data pro­ject can be used on its own or as an in­put to the analy­sis of the struc­tured data.

The steps of struc­tured data pro­ject in­clude: First, con­cep­tu­al­iza­tion of the pro­ject to de­ter­mine the out­put (stock price up or down for ex­am­ple), who are the users of the out­put, and if the out­put will be em­bed­ded in the cur­rent busi­ness processes. The sec­ond step is the col­lec­tion of rel­e­vant data re­lated to stock per­for­mance analy­sis. Such data are in tab­u­lar for­mat with rows rep­re­sent­ing in­stance, columns rep­re­sent­ing fea­tures and the cell rep­re­sent­ing par­tic­u­lar val­ues. The third step is data prepa­ra­tion and wran­gling. This step com­prises data cleans­ing and data pre­pro­cess­ing. The for­mer is ex­e­cuted to clean the data ta­bles from miss­ing or in­cor­rect val­ues, while the lat­ter is re­lated to ag­gre­gat­ing, fil­ter­ing, and se­lect­ing the rel­e­vant data. The fourth step is the data ex­plo­ration that com­prises ex­ploratory data analy­sis (e.g. us­ing his­tograms, bar charts and box plots, see fig­ure (10) for his­togram and box plot), fea­ture en­gi­neer­ing (cre­at­ing new fea­tures by chang­ing or trans­form­ing ex­ist­ing fea­tures) and fea­ture se­lec­tion (us­ing sta­tis­ti­cal tech­niques to iden­tify fea­tures that are of in­ter­est in­di­vid­u­ally or when an­a­lyzed with other fea­tures, such tech­nique in­clude mul­ti­collinear­ity and Chi-square). The fifth step is the model train­ing which is se­lect­ing the ap­pro­pri­ate ML al­go­rithm as dis­cussed in the ML sec­tion and test­ing and fine-tun­ing the al­go­rithm.

Fig­ure (11) shows a data source in the tab­u­lar for­mat of struc­tured data be­fore cleans­ing and af­ter cleans­ing.

Fig­ure (12) shows the data af­ter ap­ply­ing data pre­pro­cess­ing. This step, for in­stance, re­sulted in com­bin­ing, un­der the to­tal in­come col­umn, the salary and an­other in­come col­umn (ag­gre­ga­tion). In ad­di­tion, it re­sulted in con­vert­ing the state name to ab­bre­vi­a­tions, re­mov­ing the date of birth, and re­plac­ing it with age as well as re­mov­ing the cur­rency sign that was as­so­ci­ated with the amounts.

In the stock sen­ti­ment pro­ject un­der dis­cus­sion and in con­trast with the struc­tured data analy­sis, un­struc­tured data in­clude texts from on­line news ar­ti­cles, so­cial me­dia, in­ter­nal and ex­ter­nal doc­u­ments such as the fi­nan­cial state­ments as well as any other openly avail­able data sources. The steps of the un­struc­tured data analy­sis com­prise the fol­low­ing: First is the prob­lem for­mu­la­tion which in­cludes iden­ti­fy­ing the clas­si­fi­ca­tion ob­jec­tive and the in­put and out­put and how the out­put will be uti­lized. The sec­ond step is data cu­ra­tion which starts with a col­lec­tion of tex­tual data sources via web spi­der­ing (Scrap­ing or crawl­ing) pro­grams. At this step, it is im­por­tant to ob­tain re­li­able data for train­ing the su­per­vised ML. such tex­tual data should have re­li­able la­bel­ing that will guide the train­ing of the ML al­go­rithm. For in­stance, the la­bels could be that the tex­tual data in­clude a sen­ti­ment that is ei­ther neg­a­tive or pos­i­tive. The third step has two ac­tiv­i­ties. The first ac­tiv­ity is the text prepa­ra­tion and wran­gling which in­clude tex­tual data cleans­ing from an­no­ta­tion im­ported from source data such as punc­tu­a­tion, ques­tion marks, html tags, white spaces, num­bers, and cur­rency signs. Fig­ure (13) shows how these cleans­ing look like when ap­plied to a sec­tion of fi­nan­cial state­ments on­line.

The sec­ond ac­tiv­ity in the third step is the data pre­pro­cess­ing to pre­pare the data to be ready in struc­tured for­mat. This ac­tiv­ity is sim­ply nor­mal­iz­ing the data and in­clude stem­ming which ob­tains the base word of the in­flected form of a word such as “An­a­lyze” in­stead of “an­a­lyzed” or “an­a­lyz­ing” , low­er­cas­ing such “The” to “the”, re­mov­ing the high fre­quency and low-fre­quency words and re­mov­ing stop words that are ir­rel­e­vant to the stock sen­ti­ment such as “is” and “the”.  Fig­ure (14) shows the Bag of Words (BOW). BOW is cre­ated af­ter nor­mal­iza­tion and is the rep­re­sen­ta­tion used in text analy­sis.

The fi­nal step in the pre­pro­cess­ing ac­tiv­ity is the prepa­ra­tion of the data term ma­trix (DTM). But be­fore that, to­k­eniza­tion of the text should be de­fined. A to­ken is equiv­a­lent to a word in the BOW just de­scribed ear­lier. The DTM is a ma­trix whose rows are texts; columns are the to­kens and cell val­ues are the num­ber of times a to­ken was men­tioned in a par­tic­u­lar text. Fig­ure (15) shows sam­ple DTM. This DTM is im­por­tant as it rep­re­sents the text in a tab­u­lar for­mat usu­ally used in struc­tured data analy­sis.

The fourth step in the tex­tual data analy­sis of the stock sen­ti­ment pro­ject is text ex­plo­ration. This step in­cludes the fol­low­ing ac­tiv­i­ties: First ac­tiv­ity is the text vi­su­al­iza­tion to iden­tify the most rel­e­vant and repet­i­tive words in the text (see fig­ure (16)). The sec­ond ac­tiv­ity is cal­cu­lat­ing the chi-square of word as­so­ci­a­tion in pos­i­tive and neg­a­tive sen­tences in the text or in dif­fer­ent doc­u­ments. The third ac­tiv­ity is cal­cu­lat­ing term fre­quency (TF) which is the ra­tio of how many times the to­ken has been used in all texts com­pared to all to­kens in the texts. The fourth ac­tiv­ity is fea­ture en­gi­neer­ing (e.g. N-gram which is a multi-word as­so­ci­a­tion if they are mean­ing­ful to the analy­sis. For in­stance, the word “mar­ket” can be as­so­ci­ated with stock to be “stock-mar­ket” which is usu­ally used in fi­nan­cial analy­sis and has an as­so­ci­a­tion with stock sen­ti­ment pro­jects. Sixth ac­tiv­ity is fea­tures se­lec­tion by us­ing tech­niques such as DF to find which fea­tures to in­clude and ex­clude (Doc­u­ment fre­quency (DF) is a mea­sure used to know how many times a to­ken was used in a doc­u­ment com­pared to the to­tal num­ber of doc­u­ments). Fi­nally, the re­sult­ing sen­ti­ment out­put of the above steps can be ei­ther used di­rectly for de­ci­sion or in­cluded in com­bined with struc­tured vari­ables.

Tex­tual data ML model per­for­mance should be eval­u­ated for the good­ness of fit us­ing sev­eral tech­niques (e.g. con­fu­sion ma­trix linked to Type I and Type II er­ror analy­sis and Root Mean Square Er­ror “RMSE”).

As far as ML al­go­rithm good­ness of fit is con­cerned in this pro­ject, the con­cept of hy­per­pa­ra­me­ters should be un­der­stood and con­trasted with the pa­ra­me­ters. The pa­ra­me­ters are sim­ply what is dis­cov­ered through mod­el­ing. For in­stance, the co­ef­fi­cients in the re­gres­sion, while the hy­per­pa­ra­me­ters are re­lated to the ML al­go­rithm and not re­lated to the data as in the case of the pa­ra­me­ters. An ex­am­ple of the hy­per­pa­ra­me­ter is the num­ber of hid­den lay­ers in the NN ML and the depth of the tree in the CART ML. Train­ing the ML us­ing dif­fer­ent hy­per­pa­ra­me­ters then com­par­ing the re­sults is an im­por­tant step to know the best per­form­ing model (This is called Grid Search). Hy­per­pa­ra­me­ters works as a con­trol vari­able that helps in fine-tun­ing the model (reg­u­lar­iza­tion) to a point of best fit (good fit model). See fig­ure (17) that shows that trade-off.

Fi­nally, fig­ure (18) shows an ex­am­ple of the out­put of the pre­dic­tion of sen­ti­ment in each sen­tence in the text as­so­ci­ated with a p-value. This be­cause the lo­gis­tic re­gres­sion ML is used in this ex­am­ple and the best lo­gis­tic re­gres­sion al­go­rithm was found to pre­dict with the high­est ac­cu­racy is the one with a p-value of 0.60. In this con­text, the p-value pre­dicts the prob­a­bil­ity that a sen­tence in the test data con­tains pos­i­tive sen­ti­ment. There­fore, if the p-value cal­cu­lated at a sen­tence level is above 0.60. that sen­tence prob­a­bly car­ries pos­i­tive sen­ti­ment.

Sum­mary

This sec­tion dis­cussed the Big Data de­f­i­n­i­tion and iden­ti­fied that it has 4Vs which stands for vol­ume, ve­loc­ity, va­ri­ety, and value. Fur­ther, Big Data are ei­ther struc­tured which are the reg­u­lar tab­u­lar for­mat or the un­struc­tured data that can be im­ages, texts, videos, and au­dios. The sec­tion iden­ti­fied cer­tain sources for Big Data in­clud­ing, among oth­ers, so­cial me­dia, mo­bile telecom­mu­ni­ca­tion, and IoT. The sec­tion dis­cussed the man­age­ment chal­lenges of Big Data. More specif­i­cally, the ac­qui­si­tion of Big Data, the stor­age, the analy­sis, and se­cu­rity. Those are im­por­tant con­cerns for an or­ga­ni­za­tion to con­sider in its IT and risk man­age­ment strate­gies. Fi­nally, the sec­tion il­lus­trated the steps un­der­taken for ex­e­cut­ing a Big Data pro­ject. More specif­i­cally, the data col­lec­tion, the cu­ra­tion and pre­pro­cess­ing the data ex­plo­ration, fea­ture se­lec­tion, and en­gi­neer­ing, and fi­nally the ML se­lec­tion, test­ing, and im­ple­men­ta­tion. In this type of pro­ject, one would need to know the ob­jec­tive of the pro­ject and con­cep­tu­al­ize the re­quired in­puts and out­puts, in ad­di­tion, to be­ing equipped with a proper un­der­stand­ing of model bi­ases in­clud­ing over and the un­der­fit­ting.

Sum­mary and rec­om­men­da­tion of the ar­ti­cle

In this ar­ti­cle, Fin­tech as a buzz­word of the 21st cen­tury was dis­cussed and an­a­lyzed un­der the dig­i­tal fi­nance cube frame­work. The ar­ti­cle was mo­ti­vated by the vast ex­pan­sion in tech­no­log­i­cal de­vel­op­ments on dif­fer­ent fronts in fi­nan­cial ser­vices. Ac­cord­ing to the dis­cus­sion in this ar­ti­cle, Fin­tech can be con­sid­ered as an um­brella for dis­cus­sion of the use-cases of tech­nol­ogy as ap­plied in the var­i­ous fi­nance func­tions. The tech­nolo­gies dis­cussed un­der the Fin­tech par­a­digm are Ar­ti­fi­cial In­tel­li­gence, Ma­chine Learn­ing, Big Data, and Blockchain tech­nolo­gies. Those tech­nolo­gies are the re­sult of a meet­ing be­tween fi­nance and IT. They brought to hu­man­ity a lot of ben­e­fits and op­por­tu­ni­ties as well as risks. Among the ben­e­fits, to men­tion but few, the Robo-ad­vi­sors (wealth man­age­ment plan­ning ser­vices ro­bots), al­go­rithms that can solve com­plex sta­tis­ti­cal prob­lems (Ma­chine Learn­ing Al­go­rithms), value from data (Big Data an­a­lyt­ics), new ways of rais­ing fi­nance (ICOs, STOs). On the other hand, among the risks, are data se­cu­rity, the pri­vacy of in­di­vid­u­als’ in­for­ma­tion, the ex­is­ten­tial risk for the hu­man race, and loss of jobs. The chal­lenges to gov­ern­ments and or­ga­ni­za­tions are to reg­u­late the use of the tech­nolo­gies in a way that is eth­i­cal, se­cure, and that helps to make the best use of them.

Fin­tech re­search is an open and wide area with many green fields that have not been ex­plored yet. There are many di­rec­tions for fu­ture re­search in­clud­ing, among oth­ers, the com­par­i­son of the sta­tis­ti­cal ac­cu­racy of the ML al­go­rithms in com­pe­ti­tion with the tra­di­tional sta­tis­ti­cal meth­ods from the­o­ret­i­cal and em­pir­i­cal points of view, the fu­ture role of Big tech firms (Tech­nol­ogy firms such as Ama­zon and Al­ibaba) in the fi­nan­cial ser­vices in­dus­try and fi­nally, the ap­pli­ca­tion of tech­nol­ogy by Fin­tech and tech­nol­ogy firms on each fi­nance func­tion as dis­cussed in the dig­i­tal fi­nance cube frame­work.

Website | + posts
Share on facebook
Share on twitter
Share on linkedin
Share on pinterest

Leave a Reply

Your email address will not be published. Required fields are marked *