Evariste

For Evariste’s second blog post, we’re going to show off the fancy new version of Frobenius built by our outstanding quant Oliver Vipond. The interface is significantly more user friendly, allows for better visualisation of the data, and integrates global models allowing for more accurate multi-parameter optimisation.

The dataset used here will be from the OpenSource Malaria project coordinated by Professor Mat Todd (UCL). The data used here has already been the subject of an open competition and represents one of the best publicly available datasets for testing our platform. As well as filling the gap left behind as major players in pharma have moved out of the NTD/anti-microbial space, open science projects like this are absolutely crucial for helping smaller companies get their feet off the ground.

As with the previous post, I won’t spend any time discussing the biology in detail but to (briefly) summarise, growing resistance to existing antimalarial drugs represents an enormous challenge to healthcare systems in the developing world. Despite vast improvements in prevention and treatment over the last few decades, malaria still kills over 400,000 people every year, 2/3rds of whom are under 5.

As before, I’ll take you through the steps required to get Frobenius 2.0 to output some sensible suggestions. Most of this builds on the previous blog post, so if you want to know more about the fingerprints used and the data processing then take a look at that first blog.

1. Decide what data you’re going to use

As ever - garbage in, garbage out. Happily, this dataset has been really nicely curated (from a potency point of view) as it was previously used in the open competition referenced above. This makes life easier for us in terms of selecting what to include in the models. There are a few hundred compounds here which allows us to build fairly robust models for potency.

Chart, scatter chartDescription automatically generated — Comparison of our insample predictions vs the experimental potency data

The usefulness of R^2 for assessing the predictive power of a model is debatable. If this was all we were using to predict the relative value of new compounds, it wouldn’t exactly be ground-breaking. The real utility of Frobenius is the statistical wizardry that means we are also predicting our own error, and therefore the likelihood of exceeding the endpoints we select, which is the actual solution to an optimisation problem.

2. Pick your target endpoints

As well as increasing the potency of the compounds in this series, we also want to try and design compounds with improved solubility. We have a good predictive model for solubility (at least for relative differences if not always absolute values) and we also have a fairly robust model for predicting logD, although again the caveat about relative vs absolute predictions applies. There’s a strong correlation between logD and solubility, and so by selecting both as endpoints we reduce the chance of an outlier prediction in one model overestimating the chances of success for a poor compound and vice versa.

In this case, we want to design compounds with improved potency (pIC50 > 8), moderate solubility (> -4 log(mol/L) in water), and a predicted logD from 0 - 4 (a somewhat drug-like range).

One thing we’re currently working on but haven’t moved into the production code yet, is how best to incorporate local data into the global models. This is an extremely powerful way to supplement the predictions and relatively easy to do on a project-by-project basis. However it’s very hard to do in a principled, automated way that works well for all models and all projects. Moreover, the global models will often be built using different unit measurements to the local data, a problem which poses its own set of questions

3. Build the models and analyse the output

Having cleaned the data, selected the endpoints, and run the models, we’ve now identified the best compounds to use as starting points for further design. The new version of Frobenius also includes a snazzy visual analysis section which allows further interrogation of the data.

The t-SNE embedding on the left keeps molecules that have similar ECFP4 fingerprints close together and is coloured by potency, more yellow = more potent. This shows that the more potent molecules tend to cluster together in chemical space, although one feature of this dataset is a reasonable amount of ‘steep’ SAR, or non-additivity, where relatively similar molecules have quite different potencies. Different projects all feature this sort of profile to different extents, this is one of the more challenging datasets to predict on. The plot on the right is coloured by solubility, where yellow is more soluble. You can see that potency and solubility are to some extent negatively correlated, this can probably be explained by a number of factors, logD being the obvious one.

Taking a look at the top compounds below, we can see that 1 and 2 contain highly unusual, not very drug-like, borocycles. These are scored highly because they are essentially guaranteed to have the level of solubility required, even if the potency isn’t predicted to be as high as the others. The next three molecules however, all look like good starting points for further design.

Diagram, schematicDescription automatically generated — The top starting points selected by the model and measured pIC50 values

We can also take a closer look at the molecules using a compound report feature. This is part of our effort to make our modelling as interpretable as possible and it provides a nice visual interpretation of what the model has learnt.

This feature highlights in green parts of the molecule that are contributing positively to the chosen endpoints and uses pink/red to indicate the parts that are having a negative impact. In the image below, potency is on the left and solubility is on the right. You can see that the model considers that most of 3 is contributing to potency (this makes sense as it’s one of the most potent molecules). There are darker green patches around the regions where substitution might be less tolerated, and this is reflected in the designs below. You can see from the image on the right that the more polar parts of 3 are highlighted green and the broadly lipophilic regions are in red, which is a nice confirmation that the solubility model is fairly sensible.

A picture containing diagramDescription automatically generated — Compound report feature identifying the regions of 3 that contribute positively (green) and negatively (pink) to potency (left) and solubility (right)

4. Design some new analogues

We can now apply Frobenius’ three designers to 3, 4, and 5. The output below is based on compound 3, you can see that the changes are focused on the areas of the molecule that are a slightly paler green in the image above, the ‘least optimised’ regions.

These all seem to be pretty sensible suggestions, which interrogate the SAR of several positions and are biased towards changes that will increase solubility. There are some features which aren’t massively drug-like, the primary benzylic amines and the 3,4-fluoropyridine might pose potential issues, but I certainly wouldn’t rule these compounds out on that basis alone. The overall probabilities of the designs achieving the desired endpoints are pretty low, around 0.5 - 1% in most cases. Given that one of these endpoints (logD) isn’t something you necessarily need to optimise towards, but rather something you might want to filter the designs by, the chances of us having designed some potent, soluble molecules is a bit higher. The designs based on 4 and 5 broadly follow the same patterns.

5. Pessimistic design

When we select compounds to synthesise (or recommend that a partner makes them) we don’t just send a list of several thousands of compounds. Equally, we don’t pick just the top compound and wait for the results to come back one by one because this would be impractical. Instead, we have to select a set of compounds that maximises the probability of success whilst also sufficiently exploring chemical space. We do this using an algorithm that selects a subset of compounds (depending on the project and the chemistry) based on the principles of pessimistic design. This means that after we’ve picked the top compound, the algorithm builds a new model assuming that the top compound has failed to hit the desired endpoints, then selects the next best compound predicated on that failure.

In this example, I took about 10 highly scoring designs from each of the three starting points. Applying the pessimistic design algorithm to select the ideal list of 10 compounds returns the targets below. Here, the algorithm is exclusively focussing on the likelihood of achieving the potency threshold as we’ve already biased the designs towards improved solubility. The independent probabilities are shown in black, the probabilities in red are the correlated probabilities assuming the failure of each molecule in turn.

6. Summary and next steps

As you can see, the newest version of Frobenius incorporates a whole bunch of interesting features and gives an interpretable, sensible output when faced with a multi-parameter optimisation problem. In future blogs, we’ll dig into our compound design algorithms, automated synthetic route planning, the integration of local data, and how we can further optimise the selection of compounds for purchase/synthesis.

‍

DDR Conference Poster