This is part 2 of a guest post by British Bondora investor â€˜ParisinGOCâ€™.
Read part 1 first.
Data Mining the Bondora data.
The initial process.
To help understand the specific data cleansing that the Bondora Data Set needed, I first made use of the RapidMiner metadata view â€“ a summary of all the attributes presented to the software – showing Attribute name, type, statistics (dependant on type, includes the least occurring and most occurring values, the modal value and the average value), Range (min, max, quantity of each value for polynominal and text attributes) and, most critically, â€œMissingsâ€ and â€œRoleâ€.
â€œRoleâ€ is the name given by RapidMiner to the special attributes that are needed to allow certain operations. In my case, the Decision Tree module needed to know which Attribute was the â€œTargetâ€, that is the attribute that is the focus of the analysis and to which the Decision Tree has to relate the other attributes in its processing.Â My â€œTargetâ€ was the â€œDefaultâ€ attribute â€“ a â€œBinominalâ€ (called as such by RapidMiner and meaning an attribute with just 2 values) attribute â€“ 1 if the loan had defaulted, 0 if not.
â€œMissingsâ€ is easy â€“ this is the number of times this attribute has no valid value. For example, my import of the raw Bondora input data has 150 attributes.Â Only half of these attributes have no missing values.Â The remainder have between 13 and 19132 rows with missing values from a data set of 20767 rows.
To know whether these â€œmissingsâ€ would impact my analysis, I needed to get to know the data in more detail.
I knew that Bondora had started to offer loans in Finland in summer 2013 with Spain following in October of that year and Slovakia in the first half of 2014.
I therefore decided not to bother with any loan issued prior to 2013.
Since default is defined by Bondora as 60+ days payment missed, I also needed to have a cut-off date so I would not skew the data by using recent examples that were less than 60 days old and therefore could not yet have defaulted anyway.Â (I move this date forward with each new revision of the Tree(s).)
I also wanted the Decision Tree module to make its decisions based only on data provided at the time of application, so all attributes dealing only with post-application events were dropped, except (of course) for the all important Default attribute.
This left 22 attributes with 7278 Rows of data, approximately one seventh of the original attributes and one third of the original rows.Â Only a very few of these attributes had any missings, so these specific rows of data were dropped. One remaining attribute â€“ â€œNumber of Dependantsâ€ â€“ was problematical, having several hundred â€œmissingsâ€.Â I chose to interpret a missing entry as zero â€“ in other words, no dependants.Â This fixed the attribute, but is a potential source of error. Of the 22 attributes, 19 were numeric or integer, meaning that they needed further work to ensure they could be used in the Decision Tree module.
I embarked upon a process of learning by failure as the Decision Tree module rejected each attribute in turn, leaving me with the need to decide upon meaningful groupings for each attribute to form an input for the Decision Tree module.Â This set of decisions will need to be reviewed into the future in case my choice of solution for each attribute introduces unexpected effects.
All in all, this took about 100 man-hours of effort, a pitiful figure when I reflected that I had once sold the idea of data mining within a large financial institution and had then gone on to set up an experimental data mining team. I consoled myself with the fact that it was over 15 years ago and, even then, I had recognised that the individual members of team I built quickly knew a lot more about the subject than I.
I am now on version 6 of my Bondora Data Set Decision Trees.Â The process of generating a new set of Trees is not now a huge task, except for the fact that Bondora is a dynamic environment and the data also changes rapidly and, sometimes, severely.Â Care has to be taken with every new download and the downloaded data inspected closely before it can be imported and used.
The TreesÂ – there is 1 tree for each Country – have for the most part changed only in detail across the updates except for 1 major item.
I became annoyed that the attribute â€œFunded Amountâ€ (the amount borrowed) was a major node in almost all the trees but particularly the Estonian tree, which had the largest number of rows of data (over 4000). I constantly questioned why another attribute that directly linked the borrowersâ€™ ability to pay the monthly instalments did not feature at all, but that the amount initially borrowed featured heavily.Â After all, the amount initially borrowed could have hugely differing effects on the monthly repayments simply by changing the loan duration â€“ say from 24 to 60 months.
I set about experimenting with the â€œTotal Incomeâ€ attribute, changing the criteria for grouping the various values and hit upon a particular number of groups that had a dramatic effect on the overall shape of the Estonian Decision Tree.Â It seems that grouping â€œTotal Incomeâ€ into 19 distinct ranges with a target of approximately equal numbers of examples in each group (that is, grouping by bin size, not range) made this attribute THE most important selector for those Estonian borrowers whose employment status was â€œFully Employedâ€.Â This resulted in the â€œFunded Amountâ€ attribute all but disappearing from the Estonian Tree.
Attempts to reduce the appearance of â€œFunded Amountâ€ by manipulating the grouping of â€œTotal Incomeâ€ had nowhere near the same impact for Finland, Spain or Slovakia, but I still succeeded in reducing â€œFunded Amountâ€ to a bit-part player in the other Trees.
All of this reminded me of a phrase that the team I put together all those years ago used many times: â€œIf you torture the data, it will tell you the truth you want to hear.Â The trick is to let the data speak its own truth.â€Â Time will tell whether I have tortured the Bondora Data into telling my truth and not its own.
Using the output
After looking into alternative visualisations for the many Decision Tree export options provided by RapidMiner, I could neither find nor engineer my own any better than that provided by the package.Â The reason for looking was to avoid using the very large application that is RapidMiner simply to display something that could be easily to hand in a form that was lighter in resource usage.
As it is, I have stored the output from the module in the RapidMiner Data Repository on my PC and I load only the Decision Tree output as required, after first loading and minimising the RapidMiner application.
In part 3 you will read how â€˜ParisinGOCâ€™ used the analysed data to implement his investment strategy on Bondora.