Show your workings: Lenders push to demystify AI models

Machine learning could help with loan decisions—but only if banks can explain how it works. And that’s not easy.

  • Banks are categorizing machine learning models by the risk they pose to the institution, and choosing explainability techniques accordingly.
  • Lending decisions are considered a high-risk application and machine learning is being used sparingly in credit modeling.
  • Some banks are using easier-to-explain machine learning algos such as gradient boosted decision trees in their loan divisions, though neural networks remain mostly out of bounds.
  • Explainability techniques such as Shap and Lime often give contradictory results, so banks need to triangulate between the two.
  • Quants at Wells Fargo claim to have discovered an explainability technique that may allow banks to use neural networks for credit modeling.

From derivatives pricing to credit card fraud detection—and a few places in between—artificial intelligence is extending its reach across the financial sector. But difficulties with explaining to regulators and senior management how self-learning algorithms work continue to hold back the use of machine learning in most banks’ core business of lending.

“Credit underwriting is the highest risk use of this technology and we would expect a great deal of explainability to be provided,” says a model risk manager at a US regional bank. “We are beginning to use it for credit decisions but are very hesitant and are not yet comfortable that the benefits outweigh the drawbacks.”

At the regional bank, the use of machine learning in credit underwriting has so far been limited to relatively small portfolios. The bank uses machine learning more extensively in other areas, including anti-money laundering and fraud prevention as well as marketing, where explainability is less of a concern.

“We do require an attempt at explainability from all machine learning models, but we acknowledge that different uses will have different needs in terms of the level of explainability,” says the model risk manager.

An important part of model validation is to understand exactly why an algorithm produces a given result. A self-learning algo that spouts unpredictable outputs leaves the bank at a higher risk of losses.

In credit underwriting, where lenders assess the suitability of customers for loans, the stakes are even higher. Banks could leave themselves open to costly lawsuits if their models unwittingly discriminate against particular social groups.

But while the risks may be high, so too are the rewards. Machine learning could transform credit underwriting by helping banks automate much of the drudgery in assessing loan applications. Greater speed and accuracy in this area could bring cost savings and a lower risk of loan losses.

So far, most applications of artificial intelligence in lending have involved decision tree methods, which use if/then parameters and are considered the most basic and transparent forms of machine learning.

JP Morgan has been developing gradient-boosted decision tree models—where multiple decision trees are combined to reduce prediction error—to generate proprietary credit scores for use in its consumer and community bank. The machine learning models are able to take in hundreds of attributes. The bank says the models provide a finer-grained ordering of risk than traditional credit scoring models that rely on logistic regressions.

The decision tree approach was specifically chosen because it is relatively easy to explain. “Explainability is a key consideration in choosing what tool to use for business problems. When the need for explainability is high you should choose tools accordingly,” says David Heike, head of risk modeling for consumer and community banking at JP Morgan Chase.

Two of the most popular techniques for explainability are Shapley Additive Explanations (Shap) and Local Interpretable Model-agnostic Explanations (Lime). Lime is a statistical technique that analyses a model’s parameters to determine how it arrives at its outputs. Shap, which is drawn from game theory, explains the impact of each variable in a model on other variables, in the same manner as one would analyze the impact of one player in a sports team on other players.

These techniques do a good job of explaining the outputs of decision tree models. Things get much more difficult when it comes to deep learning—an advanced subset of machine learning that is starting to be deployed in other business lines, such as trading. Neural networks, one of the most common forms of deep learning, can find non-linear relationships in large swathes of data, making them potentially useful in credit modeling. But they are also opaque and difficult to understand. Neural networks can contain multiple hidden layers that transform the input data in ways that are difficult to trace, let alone explain.

Some firms are now searching for new techniques that can demystify the inner workings of neural networks so they can be applied to more sensitive tasks, such as credit decisioning.

Wells Fargo has developed methods for explaining a widely used form of deep neural network with rectified linear units (ReLU). One such method decomposes the network into an equivalent set of local linear models which are easier to interpret, the bank says.

If the technique proves successful, it may be an important step in helping lenders to explain complex machine learning algorithms in credit underwriting.

Horses for courses

With regulators on both sides of the Atlantic scrutinizing the use of machine learning models, banks are adopting a ‘horses for courses’ approach, reserving the more advanced techniques for less sensitive tasks with lower explainability requirements. The more sensitive the application, the easier it must be to explain.

Along the continuum of risk, banks tend to divide applications roughly into four broad levels. Level one applications—the riskiest—include credit scoring models. The use of poorly understood algorithms for consumer lending has the potential for serious harm to the lender as well as to the borrower. In the US, banks have a legal obligation to explain the decision to approve or decline an application for credit. The legislation is laid out in the Fair Credit Reporting Act and the Equal Credit Opportunity Act.

There are no equivalent laws in Europe, but fair credit is covered under the broader European Convention on Human Rights, with European Union member states responsible for their own specific laws on discrimination.

Level two machine learning applications include fraud alerts and anti-money laundering systems, where the machine is making decisions that need to be acted upon in real time or near real time, and that could also affect customers.

Fraud stands out as a particularly good use of machine learning given the dynamic nature of fraud attacks

David Heike, JP Morgan

Level three includes applications that could affect a firm financially but has no direct impact on customers, such as trading or internal stress testing.

The lowest risk category—level four—would include applications that have relatively little financial impact, such as product marketing. Here, banks are more free to apply more sophisticated but hard-to-explain machine learning techniques, such as artificial neural networks.

“For low-risk models, the techniques, the approach, the accuracy, the correctness of explainability is less demanding. For high risk, one should employ inherently interpretable machine learning models,” says the head of model risk at a large US bank.

The first and so far only risk management application of neural networks within JP Morgan’s consumer business is for signature verification. Neural networks are especially well-suited to finding patterns in unstructured data such as images, which contain thousands of seemingly random data points. JP Morgan uses a neural network algorithm in its consumer business to compare the signature on a cheque to past signatures to uncover inconsistency. If the algorithm is unable to determine whether the signature is genuine, it gets passed on to human operators for resolution.

“Fraud stands out as a particularly good use of machine learning given the dynamic nature of fraud attacks. Moreover, explainability is easier because you’re explaining to sophisticated users who understand complex models,” says Heike.

Although techniques exist for explaining the neural network’s results, they are less well developed than for other types of machine learning applications. But because signature verification is a relatively low-risk application, it’s not necessary to understand the intricacies of how it arrived at its decision.

Among the level three applications, banks are actively exploring the use of deep neural networks in areas such as derivative pricing, which has traditionally relied on a combination of classical models. These models include Black-Scholes, Monte Carlo simulations and finite difference method (FDM) techniques.

For derivatives pricing, Danske Bank uses a deep neural network that learns the pricing function from data. Once the pricing function is learned, it can be evaluated in near real time with different scenarios, orders of magnitude faster than Monte Carlo or FDM, helping to resolve computational bottlenecks. To some extent, pricing by machine learning is similar to traditional analytics, except that the pricing function is not derived by human mathematicians, but learned by machines from simulated data. The algorithm finds an intelligent manner to compute prices, but it does not determine or explain those prices in any way.

“Explainability is mostly a red herring, at least in this context,” says Antoine Savine, chief quantitative analyst at Danske Bank. “Prices are explained by the pricing model that simulates data, not the algorithm that learns the pricing function from that data. In this context, machine learning is just a way to compute prices efficiently, like FDM or Monte Carlo, and like with numerical methods, the key notion is not explainability but convergence and error analysis.”

Savine adds that explainability may be critical in other applications of machine learning in finance, such as trading strategies, credit ratings or synthetic data generation.

Cutting through the noise

For the most risk-sensitive applications, such as lending, banks will choose techniques that are the easiest to explain using post hoc techniques or techniques that are inherently self-explanatory.

Traditional consumer credit underwriting models, such as Fico scores, rely on statistical techniques such as logistic regression. Here, the explanation can be extracted directly from the model by measuring the sensitivity of the output to changes in inputs, such as income or amount of debt outstanding.

Explainability techniques in machine learning essentially freeze the model and take it apart in the manner a mechanic would a car. The problem is they often have a hard time differentiating between what’s substantial and significant from what’s affected by noise. When applied to financial data, which is inherently noisy, it becomes doubly difficult.

The situation is analogous to asking Google to provide driving directions from New York to Chicago. After setting out, an accident snarls traffic on the chosen route, at which point Google comes out with a whole new set of directions. With route-planner algorithms, the user knows that the new route is the result of a traffic accident. But with machine learning models in finance, the cause of a new output is not so easy to determine.

“When you’re deducing the behavior of the model, and you train the model again you get a completely different set of explanations. How can you reconcile very different techniques all pointing at different targets and none of them getting at the core issue, which is noise and whether you have enough observations for that particular explanation to be statistically significant?” says Matthew Dixon, a professor of computer science at Illinois Institute of Technology.

In addition to the riskiness of the application, the choice of explainability technique hinges on whether the explanation needs to be global or local. Global describes how the model behaves under all possible assumptions, and local describes how the model arrived at a particular decision.

For example, if a bank wanted to measure the sensitivity of a consumer lending model’s outputs to factors such as disposable income or GDP, then a global technique is required. On the other hand, if the bank wanted to understand how changing a variable would impact a particular lending decision, then a local technique is needed.

Shap and Lime are the most popular techniques for local explainability.

We have zero appetite for black box models, and we do no buy or build black box AI models

Executive at a US bank

An example of a global explainability technique is ‘partial dependence plots’, which measure the sensitivity of a model’s output to a particular set of inputs. Other techniques exist, including relevance, sensitivity and neural activity analysis, as well as fitting a simple model on top of a more complex model. There are also visualization techniques for explainability that are applicable to decision tree-based models.

Part of the problem is that each explainability technique tends to come up with a different explanation for a model’s output. Therefore, it is common practice to use more than one technique. “All of these post hoc explainability approaches are approximations and may encounter computational problems. That’s why they often don’t agree with one other. So one needs to be careful and should apply multiple techniques,” says the head of model risk at the large US bank.

Wells Fargo has developed a technique that pares back the number of linear equations that a deep neural network produces, helping users interpret the results of the model. Simply put, a neural network attempts to mimic the activity of neurons in the human brain. Each neuron—or node—receives an input and performs a calculation to decide whether that input meets a predefined threshold. If the threshold is met, the node “fires” and produces an output which travels to the next node. The Wells Fargo model uses the ReLU mode of transmission—a widely used activation function for neural networks.

What Wells Fargo has done is to trace the journey that the information takes through each layer of nodes, and convert the data into linear equations—thousands or even millions of them, depending on the size of the net. But because many nodes have made the same decision, many of these linear equations are the same. So the equations can be sorted into groups, in a process known as regularisation.

The sorting process enables the bank to reduce the number of equations and nodes—but without affecting the performance of the network. With a smaller number of linear equations, the structure of the network is more interpretable. At least, that’s the theory.

Wells Fargo has since applied the technique to convolutional neural networks for natural language processing.

In determining where and how to apply machine learning techniques, modeling teams work closely with business lines to ensure the model risk management team has access to developmental code and a clear description of model features. Most banks will not buy any externally sourced models that do not meet these qualifications. “We have zero appetite for black box models, and we do no buy or build black box AI models,” says an executive at a third US bank.

The policy of avoiding black box models proved to be prescient during the Covid pandemic, when AI-based fraud detection systems were unhinged by the changing patterns of customer behavior stemming from the lockdown. Typically, when a person uses a debit or credit card, whether in person or online, a model predicts in less than one second whether the transaction is fraudulent, and will freeze or flag the transaction. Through testing, the bank determined that the most predictive variable for fraud pre-Covid was whether the transaction was card present or card not present.

Realizing that its fraud detection models would lose their predictive power, the bank bought vendor transaction data covering states that were affected by stay at home orders, and by mid-April, after the models had been recalibrated using that data, it became apparent that card present and card not present was no longer one of the most predictive variables. “We performed manual and automated reviews to assess every variable and every AI model at the bank,” says the executive.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

Most read articles loading...

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here