This is a guest post by British Bondora investor ‘ParisinGOC’.
Financial institutions across the world have many ways of assessing whether a loan is worth making. A simple search on the web reveals that many use Data Mining. More specifically, “Decision Trees” are a particular tool within Data Mining that has been analysed and I quickly found at least 2 papers (Mining Interesting Rules in Bank Loans Data and Assessing Loan Risks: A Data Mining Case Study) amongst many pointing in this direction.
Having had some experience of Data Mining in a financial environment, I believed I could use these same techniques in my own P2P lending which, after over 12 months activity, I felt could be improved.
In this document, I explore the use of the freely available Data Mining Software “RapidMiner” and its Decision Tree capabilities when applied to the data available to investors from Bondora, a peer-to-peer (P2P) lending site.
Bondora is a P2P lending site based in Estonia that “unites investors and borrowers from all corners of the world”, allowing investors to invest funds to satisfy advertised borrowing needs.
Fundamentally, Bondora also provides comprehensive data to investors, allowing detailed data downloads of the individual loans held by the investor, as well as data on every application made to Bondora (originally known as Isepankur) since the first application on 21st February, 2009.
It is the complete Bondora data set that I have used as the raw data for analysis as it is the best data available to find out which potential borrowers are the right match to the potential lenders. Only if enough lenders feel that a loan application is worth investing in will the loan be fulfilled. Self-selection is taking place in both elements of the loan fulfilment and this data is the result of that interaction.
Also shown in this data are some elements of loan performance post-drawdown. Crucially, it shows those loans that subsequently defaulted (failed to make any payments for a period in excess of 60 days). Although Bondora will chase the debt on behalf of the investor and have a track record of some success, there is no guarantee that the investment, or any part of it, will be returned.
www.investopedia.com/terms/d/decision-tree.asp states: A schematic tree-shaped diagram used to determine a course of action or show a statistical probability.
In this case, I am using the data provided by Bondora on all its previous applications to reveal how the resulting loans that share similar characteristics have performed.
Specifically, I am using this data to show the percentage of those previous loans that have defaulted and using this to indicate how a similar, new application may perform should the application succeed in attracting enough investors.
In other words, I am using past performance data to show how future investments may perform – I feel sure I have seen this phrase somewhere before!
How decision Trees are made
There are several different ways of making a Decision Tree. In my case, I use a product called “RapidMiner” – the free version (5.3.015 at the time of writing), which is downloadable. I chose this product because it is almost identical in look and feel to a Data Mining product I last used almost 20 years ago called “Clementine”. This product is now part of the popular and well-respected SPSS statistical package.
In RapidMiner, the user is presented with a workspace, on which can be placed various data manipulation tools, connected in different ways depending upon the desires of the operator. This plug-and-play approach as allowed me to try many tools and techniques mostly far beyond anything but a rudimentary understanding on my part. Whether I can really appreciate what I have done is another matter entirely. I believe my understanding of RapidMiner is best summed up in the warranty the product shows on the splash screen at startup: “RapidMiner comes with absolutely no warranty”. Correspondingly, I neither give nor imply any guarantee or warranty within the following text!
In summary, RapidMiner Decision Tree algorithms search through the attributes of the data extracting the attribute that best separates the given examples. By recursively operating on the selected data to get the best fit, the algorithm keeps picking the best attribute and never looks back to reconsider earlier choices. When the choices of attribute classify the data within the set parameters, it then stops.
In my analysis, I look for categorisations of data that determine which loan applications went on to default (“True”) or not (“False”) and display this in the form of a simple tree diagram.
Data is dirty!
Any data collected over time by any organisation is subject to changing collection and processing procedures. This means that the data suffers from all sorts of errors, both in the individual attributes and when taken as a collection.
To create an accurate Decision Tree, it is critical that these errors are corrected or ameliorated in some way. This process can take many forms such as substitution (alternative, calculated, dummy or other acceptable values), summarisation (the data is too varied and a smaller number of values is necessary), aggregation (the data is combined with other attributes to create a new attribute) or whatever suits the analysis.
Furthermore, the data may be incomplete due to new or “better” data being collected in its place or only becoming available at some later date. This is particularly the case in the Bondora data set as the organisation has expanded into more countries over time, creating more attributes as it goes.
I therefore had to “clean” the Bondora data set to remove any faults that may misdirect the Decision Tree(s). This I achieved mainly by reducing the sample size to a set of examples that would give me enough of what I wanted to allow a reasonable stab at a result. (I have deliberately kept my language non-technical here, so you can see that I am not implying some formal analysis of confidence levels.)
Selecting the right data.
My analysis is designed to help me select those loan applications that are least likely to default.
When looking at a loan application in which to invest, the only data available at the time is that supplied by the applicant, supplemented by data generated by Bondora. This may include data such as how the applicant data has been verified (if at all) or whether the applicant has a history of bad payments.
Besides whether a previous loan has defaulted, the download also contains loan performance data. This includes whether a loan has been overdue and by how much as well, how many bids (investors) contributed to the loan, the maturity date, etc. This added data is not available at the time of application so I simply ignored this further data for my present analysis.
Decision Tree requirements
The Decision Tree build process is itself subject to certain limitations. For instance, if an attribute has an infinite range with no repeating values, a Decision Tree cannot form a decision around a particular value in the range as no one value would appear to be more significant than any other value of the same attribute.
Whilst such an attribute is unlikely, it is an extreme example of how numeric data – such as the real Bondora attribute “Total Income” – presents itself to the Decision Tree process. Essentially, there are so many differing values that no single value stands out as a determinant.
For the software to be able to build a Decision Tree, I had to seek out all the numeric attributes and change them to something that was accepted by the Decision Tree module. In the case of Total Income, this could be achieved by aggregating the data into ranges of income, for example 0 to 500 Euros per month.
For other attributes – such as Payment History – this was a simple change to the data type as Bondora already classified this as 600, 700, 800, 900 or 1000, representing a scale from recent problems, through past problems to 1000, which actually means “None Recorded”. This naturally forms particular a data type in RapidMiner called “Polynominal” and is perfect for a Decision Tree analysis.
It’s all in the name really!
As expected, the natural output is a diagram that looks somewhat like a cartoon tree. A problem here is that the Bondora Trees are so large that it is impossible to get a full tree on screen and still be able to read the legends.
My preferred view (amongst many provided by RapidMiner) is called “Balloon”, where the branches from a node radiate in a full 360degree circle, with further nodes in smaller radii. (See Illustration 1; annotation by editor: The False/True is the Decision Tree interpretation of the percentage of 0 or 1from the “Default” attribute; The Blue/Red represent proportions of No/Yes respectively; RapidMiner displays numbers (Total/No/Yes) on mouseover) This is a compact view from a full tree perspective that then requires the viewer to zoom in, effectively as a fly-through, but the view retains (for me anyway) a sense of position within the total tree. Whilst there are export options for each type of view, I have yet to find any 3rd-party software to replicate this view and zoom technique from the RapidMiner export. The exports from RapidMiner seem to lose the 3-D axis, rendering all outputs as 2-D. A zoom in of this data just makes everything bigger, losing the “Fly-through” effect and consequent loss of data.
A pure text view is available and is useful for comparing the most recent iteration with a previous version – simply lining up the 2 versions and scrolling side-by-side down the screen easily reveals any changes.
Continue reading: in part 2 you will read how ‘ParisinGOC’ applied this in his analysis for Bondora loans. Stay tuned!