1 Introduction
Artificial intelligence and medicine have a longstanding and proficuous relationship, possibly started with the development of the MYCIN system in the early 70s for therapy advice Shortliffe et al. (1973).
Medicine has provided the field of artificial intelligence with a plethora of challenging and appealing problems to be solved, particularly in clinical diagnosis (“given a set of signs collected from a patient, select the best diagnosis”) and in therapy advice (“given an established diagnosis, select the best course of actions for treatment”). Artificial intelligence, in turn, has offered promising technologies for problem solving in the medical domain (Peek et al., 2015).
The field of oncology has proven to be particularly fit for modelling and analysis based on artificial intelligence, at least prospectively (Kourou et al., 2015; Kaplan et al., 2018), due to two major reasons:

Symptoms in oncology are frequently difficult to identify before later stages of the disease, and cancer can be treated most effectively if identified at early stages of development. Signs of the disease can be diffuse and require high expertise to be selected, collected and analysed. Hence, technologies that can highlight evidence of cancer at early stages are most welcome and challenging at the same time.

Therapy in oncology can be frequently highly aggressive, as it can be based on drugs which feature high toxicity and/or radiotherapy, which has severe side effects. Hence, technologies that can either refine therapy plans at a personal level to minimise side effects or provide unequivocal evidence of the efficacy of novel therapies are also most welcome.
The practical use of artificial intelligence in medicine, however, has occurred more often in the management of supporting information, rather than in the direct support of the activities of healthcare professionals: medical doctors have been able to augment their capabilities with the support of systems for knowledge representation and processing, as well as automated assistants to process large data chunks aid decision making. However, the practice of automated diagnosis and therapy advice has hardly moved outwith research laboratories to reach everyday activities (Chen and Asch, 2017). Some issues can explain why this has happened:

Medicine is strongly regulated by specialised organisations such as the Food and Drug Administration in the US and the European Medicines Agency and CE Mark certifying bodies in the European Economic Area. The levels of detail and transparency required for the description of methods and techniques to be certified by these organisations has proven to be hard, costly and time consuming to achieve, and the effort to reach such levels in the description of novel techniques frequently stays beyond the scope of academic initiatives.

Empirical validation of novel methods and techniques for automated diagnosis and therapy advice requires clinical trials which are highly costly, labour and time consuming. In the medical domain, clinical trials are a required practice for risk mitigation and trust building. Similar procedures are not usual in computer science, and the mismatch in required costs and resources sets apart research initiatives from within the contexts of medical sciences and computer science.
In order to bridge the gap between laboratory experimentation and practice in the use of artificial intelligence technologies for clinical diagnosis and therapy advice, these issues must be faced.
In recent years, the field of artificial intelligence has steered towards machine learning, given its significant advances and results (Kononenko, 2001). Following this trend, in the present article we focus on machine learning – more specifically, supervised classification learning, given that this particularly class of techniques can be formally characterised and analysed in detail – and how it can be explored to build systems for automated diagnosis and therapy advice which can be used in practice. We consider three specific points related to the issues mentioned in the previous paragraphs:

One reason for the recent praise to machine learning has been the claim that it decreases subjectivity in artificial intelligence, as domain knowledge and expertise are replaced by statistically grounded data analysis. Critical aspects of domain modelling, however, are still at the core of systems development based on machine learning. Accounting for regulatory issues and transparency requirements, we unveil these aspects and clarify how and why domain knowledge and human expertise – on the medical domain as well as on computational and statistical techniques – are central to the design of systems based on machine learning for the medical issues on which we are working.

The demanding requirements of resources connected to clinical trials increase the relevance of controlling sample complexity (in other words, extracting as much information as possible from samples which are kept as small as possible in order to train learning algorithms), which in turn points to the importance of good domain modelling for the problems we are considering. For these reasons, we develop a careful analysis of our population of interest, accounting for the fact that controlled sample cardinality can be key to ensure the feasibility of proposed solutions in practice.

Our population of interest features an interesting dynamics of evolution, which we believe has not been considered in detail in previous initiatives. We take this dynamics into account and characterise the population as a drifting domain, in the sense that it is permanently evolving, although almost always in a smooth rate. We explore how drifting domains can be formally characterised to build accurate predictive models based on classification learning.
The propositions and results presented in this article result from experiments under development at Autem Medical Research, focusing on the development of novel technologies for cancer diagnosis and therapy. In this article we focus on conceptual issues, and refer to our experiments mainly to illustrate our arguments. In a future article we shall present our empirical results from the perspective of advances in clinical diagnosis and therapy advice.
In section 2 we characterise a method and model to build systems based on machine learning for the problems we are considering in medicine, in which we highlight the importance of domain knowledge and expertise for model design. In section 3 we discuss specific aspects of our domain of interest, including how and why we can focus on populations with finite cardinality and how we can build lower bounds for population cardinality (aka sample complexity) given our requirements for precision and reliability. In section 4 we characterise our proposed notion of drifting domains and how it can be used to increase precision and reliability of our results. Finally, in section 5 we present a discussion and conclusions.
2 Classification learning for diagnosis and therapy advice
The task of clinical diagnosis can be characterised as a set of steps:

An individual patient comes to the medical doctor. The doctor selects a set of signs to observe and analyse. The choice of the set is based on:

Previous experience and scholarly knowledge.

Tacit selection of a reference population , assuming of course that .


Given the observations related to signs , the doctor builds a preliminary set of hypotheses about the diagnostics of . These hypotheses are dependent upon previous experience, scholarly knowledge and the reference population, which determine an (unknown yet clearly defined) upper bound on the precision and reliability of diagnostics.

Hypotheses are prioritised according to

strength of evidence indicating each hypothesis, and

the severity of corresponding diseases.
Following these priorities, for each hypothesis:

A second set of signs is selected. Again, the choice of is based on previous experience, scholarly knowledge and the reference population.

Given the observations related to signs , the doctor tags the corresponding hypothesis as either possible or discarded.


The medical doctor then proceeds to perform information fusion about all hypotheses at hand and final decision about diagnostics.
We have focused on the automation of steps 1 to 3 in clinical diagnoses as outlined in the previous paragraphs. These steps inherently depend upon expert knowledge, regardless of the computational techniques that can be employed for automation.
Required general steps to automate this procedure using supervised classification learning can be characterised as follows:

Given an individual patient , a reference population is explicitly characterised, such that .

Based on expert knowledge, a set or preliminary signs is selected.

An oracle containing information about the correlation between signs in and hypotheses in for diagnostics for population is retrieved from a database of oracles. An oracle is a collection of pairs in which is a tuple of signs and is a corresponding diagnostics as observed in a patient .
The cardinality of oracle (i.e. the number of patients ) must be sufficiently large to ensure appropriate levels of precision and reliability of diagnostics performed about any patient
, which correspond to a sufficiently large similarity level between empirical classifiers and the best available classifier as determined by unknown upper bounds provided by human expertise.

The correlation between signs and hypotheses is characterised with respect to a family of functions that best captures how decision procedures can be optmised for this correlation. In machine learning and statistics jargon, such families of functions are called kernel functions. The choice of the appropriate family of kernels (e.g. polynomial, gaussian, sigmoid etc. (Scholkopf and Smola, 2001)) is based on visual inspection of the correlation graphs and expert knowledge about the methods and techniques used to build models for machine learning. The choice of a family of kernels determines another (unknown) upper bound on the precision and reliability of diagnostics.

Hypotheses are prioritised and, for each hypothesis, a second set of signs is selected, based on expert knowledge and the reference population.

A second family of kernels is selected given the observed correlation between and the corresponding hypothesis.

Samples of appropriate cardinality are selected from the population , considering lower bounds for the cardinality of samples as a function of the required precision and reliability of diagnostics that can ensure sufficiently high similarity between empirical classifiers and the best available classifier. These lower bounds can be characterised based on existing theoretical results as detailed in section 3
. If a sample has cardinality above the identified upper bounds, we have statistical guarantees that obtained classifiers will be, with high probability, sufficiently close to the best available classifier given the upper bounds determined by the expert choices as identified in the previous steps.

Automated decision procedures are built for the diagnosis of patient employing a sample of appropriate cardinality and the sets of signs and .

Decision procedures are employed to build information to support the medical doctor in diagnostics.
Similarly, the task of therapy advice can be characterised as a set of steps:

An individual patient comes to the medical doctor, featuring a previously identified most likely diagnostics. The doctor selects a set of tests to perform, in order to decide for a therapy plan. As in diagnosis, the set is based on:

Previous experience and scholarly knowledge.

Tacit selection of a reference population such that .


Given the outcomes of , the doctor builds a personalised therapy plan for patient . This plan can contain additional decision points in the form of IFTHEN rules.

The effectiveness of the treatment is assessed based on empirical observation of attributes which are determined based on expert knowledge and the reference population.
General steps to automate this procedure can be characterised as follows:

Given a patient and corresponding most likely diagnostics, and given the reference population employed in the analysis, alternative therapy plans are ranked according to previous empirical results. Ranking is based on (most likely nonlinear) correlation analyses between different therapy plans and their corresponding effective measurements, which are built using samples of appropriate cardinality, which must be such that precision and reliability can be ensured with respect to the best available information about therapy plans. The best available information about different plans, in turn, is based on a history of empirical results. In oncology, given the high mortality related to certain types of tumour, these empirical results can be based on relatively small numbers of cases which are,in turn, described in great detail.

The most highly ranked therapy is selected and applied on .
This characterisation of both clinical diagnosis and therapy advice as sets of steps aims at the identification of the issues that impose limitations on the precision and reliability of diagnosis and therapy advice:

The choice of the sets of signs and , as well as the set of therapy plans depends upon previous experience and scholarly knowledge (expert knowledge).

The choice of the reference population to ground analyses also depends on expert knowledge.

The choice of the family of kernels to characterise the correlations between signs and hypotheses, as well as between diagnostics and therapy plans, depends upon experience and scholarly knowledge about statistical behaviour of specified random variables with respect to stochastic decision procedures in the domain of interest.
These limitations are imposed upon the best possible decision procedures for patients in the population . Additionally,

The cardinality of the samples used to build empirical estimates for the best possible decision procedures is determined given lower bounds provided by statistical analysis.
Explicit account of these limitations and their reasons are useful to clarify that:

Most limitations in the quality of automated clinical diagnosis and therapy advice originate from previous experience of medical doctors, scholarly knowledge and the reference population employed to make decisions. These issues are not particular to automated systems and are at place in standard medical practice.

Limitations in precision and reliability of decisions due to sample cardinality can be safely bounded provided that we have access to sufficiently large samples. These bounds, as detailed in the following section, are grounded on scrutinous mathematical analysis.
Moreover,
In the following sections we discuss in detail the lower bounds for the cardinality of samples to estimate decision procedures, as well as the dynamics of the reference populations .
3 Domain characterisation
The initial problem we consider, as characterised in the previous sections, is as follows: given a reference population of patients featuring sufficiently high homogeneity with respect to the correlation between observable signs and corresponding diagnoses , we wish to determine a minimal cardinality for oracles of the form such that for the indices we have that , each is a tuple of values of signs corresponding to observations about patient , and is a confirmed diagnostics for patient with respect to a disease , in which indicates confirmed disease and indicates refuted disease.
We assume a high correlation between values of signs and diagnostics . We do not assume, however, that this correlation is perfect (which would correspond to a fully deterministic power, at least in a theoretical limit in which information about all patients in is available, to diagnose disease given observed values of signs ). The set of all pairs signs, diagnostics can, therefore, be partially inconsistent, amounting for an (unknown yet determined) upper bound on the precision of diagnoses as well as definition of therapy plans. This upper bound determines the best available classifiers. Our task is to build empirical classifiers based on oracles which are provably sufficiently similar to the best available classifiers.
The theory of Probably Approximately Correct Learning (Valiant, 1984), as further extended to cope with partial inconsistencies (Haussler, 1992; Mohri et al., 2012), can provide us with lower bounds for the cardinality of as a function of:

: the cardinality of the valued signs space. Assuming that each sign can have a finite set of values with cardinality , we have that .

: precision, determined as an upper bound for the acceptable disagreement between an empirical classifier and the best available classifier. For example, if , then the probability that, given a tuple of values of signs , the empirical classifier and the best available classifier provide the same diagnostics is at least .

: reliability, as an upper bound for the risk to build a classifier whose precision parameter is above the specified value . For example, if and , then there is a probability below to select a random classifier built using any oracle with a disagreement below with respect to the best available classifier.
Following Mohri et al. (2012), we can define a lower bound for the cardinality of (denoted as ) as:
The same lower bound applies for diagnosis and for therapy advice, if we use, for example,
Support Vector Classification
for diagnosis and Support Vector Correlation for therapy advice (Mohri et al., 2012).As an example, assuming , we will have . This is a realistic assumption, considering the number of parameters and corresponding values which are usually considered by a medical doctor for the tasks under consideration here.
Employing this value, estimates can be obtained for given values for and as presented in Table 1.
0.1  0.2  0.3  

0.1  400  366  346  
0.2  100  92  87  
0.3  45  41  39 
As a concrete illustration, according to Table 1, if we wish to have a probability below that a classifier will be built with disagreement above with respect to the best available classifier, then we need to have access to a reference population with at least patients.
These results bring existing methods and techniques to the context of clinical diagnosis and therapy advice, and provide medical doctors with concrete parameters and specifications to allow the development of systems based on classification learning for direct support of their activities. They are based, however, on an implicit assumption that any reference population and corresponding oracles for any disease under consideration are static. This assumption is not observed in practice:

Patients pass away, and new patients appear all the time.

Environmental factors (e.g. pollution rates, dietary habits, stress levels) affect the reference population and the extent to which observed signals correlate with diagnoses.
For these reasons, it is reasonable to assume that the reference population undergoes small updates all the time. In order to take into account these updates, we introduce in the next section the concept of drifting domains and show how it can be employed to build more refined and precise estimates for clinical diagnosis and therapy advice.
4 Drifting domains
Some implicit assumptions about our domain have been used in the previous sections:

Our domain is finite, even though it can be large. This way, technical assumptions about probability distributions and logical deductions can be simplified respectively to discrete distributions and propositional reasoning.

Our domain is static and fixed, even though we may not have complete information about each and every element of the domain.
We challenge the second assumption, given our previous consideration that our domain of interest undergoes permanent updates. We assume that these updates are gradual and smooth, as we believe that this assumption is realistic and it simplifies our analyses.
In order to characterise this assumption, we denote domains in which these characteristics are found drifting domains. A drifting domain is, therefore:

A finite domain whose cardinality can be unknown and is permanently updated with small random values which can be positive or negative.

Such that a fixed set of signs characterise each and all elements in the domain.

Such that each sign admits a finite set of values.

Such that the value of each sign associated to each element in the domain is permanently updated, in such way that no “sudden jump” in a value can be observed.
Given these assumptions, and considering that an undetermined time interval occurs between data is collected to build oracles, and that oracles are used for classification of patients, then one reasonable strategy to build more accurate oracles is to assume that, for each value of each sign collected from a patient, the actual present value of that sign can be a different value “around” the observed value.
One possible way to formalise this strategy is to assume that, for each observed value of a sign, the actual present value is going to be within an interval centred in the observed value. In order to keep calculations simple, we can assume a probability distribution around the observed value, such as e.g. a standardised normal distribution in which the mean value is the observed value. If, additionally, all values are assumed to be discrete approximations of real scalar values, we can further assume that present the values of a sign are, with high probability (above
), within two standard deviations below and above the observed value.
This way, for each observed value we build the interval such that and . If we consider the extreme values in this interval, we have for each observed value the two values .
Given as the cardinality of the set of signs, and given one specific observation about one patient belonging to an oracle, reasoning based on extremes of a surrounding interval for the value of each sign builds alternative “versions” of that observation. Assuming that the cardinality of a sample is , we then have a collection of “possible worlds” whose cardinality is defined as:
If we build one empirical classifier for each possible world, we can test the observations of a new patient with respect to different classifiers. Two possibilities can occur:

All classifiers agree on the diagnostics for the patient. In this case, this diagnostics is strengthened by being tested considering all variations of the observed sample given the drifting domain under consideration.

We obtain conflicting classifiers across the possible worlds. In this case, upon final decision of the medical doctor, three different strategies can be considered:

Cautious strategy: the doctor concludes that data is inconsistent and/or insufficient for decision and requires a second cycle of observations, selection of a reference population etc. hoping to be able to resolve the conflict.

Asymmetric strategy: in diagnosis false negatives can be more harmful than false positives, i.e. it can be more damaging to diagnose an unhealthy patient as healthy than the opposite. In this case, if at least one classifier diagnoses the patient as unhealthy, following this strategy the patient can be taken for further examination as potentially unhealthy.

Uncertaintybased strategy: some heuristics can be built to assess uncertainty degrees corresponding to the conflicting outcomes of classifiers. For example, some voting procedure can be adopted, such that the confidence on a diagnostics is based on the proportion of classifiers that indicate that diagnostics.
For diagnosis, any of the three strategies can be adopted. For therapy advice, however, only the first strategy makes sense, given that the selection of the wrong therapy plan is potentially harmful in a symmetric way.

5 Conclusion
In this article we have built considerations on how to close the gap between laboratory experimentation and medical practice on using classification learning for clinical diagnosis and therapy advice, with a specific focus on oncology.
More specifically, we have provided an explicit and detailed account of how systems for classification learning can be inserted into the activities workflow of a medical doctor to support diagnosis and therapy advice. Given that an important barrier to the application of machine learning techniques in medicine can be the requirements of large volumes of data, which can point to the necessity of building and running prohibitively costly clinical trials, we have also developed an analysis of sample complexity estimates to build oracles to train systems based on supervised learning, and suggested a pathway to build oracles based on clinical trials of viable dimensions. Finally, we have considered the dynamics of populations from which samples can be taken, and proposed a strategy to refine the analysis of classification results that take into account this dynamics, based on a proposed notion of drifting domains.
The considerations we have built here are based on actual experiments under development at Autem Medical Research, where we have worked on novel, lesser aggressive therapies for certain types of cancer and on novel, noninvasive, speedy and low cost technologies for early diagnosis of cancer. Following the guidelines presented here, we have been able to build classifiers to make diagnosis with error rates below based on oracles such that . These classifiers are, at present, undergoing scrutinous analysis and shall be described in a specific article in the near future.
References
 Chen and Asch (2017) Chen, J. H., Asch, S. M., 2017. Machine learning and prediction in medicine: beyond the peak of inflated expectations. The New England journal of medicine 376 (26), 2507.
 Haussler (1992) Haussler, D., 1992. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation 100 (1), 78–150.
 Kaplan et al. (2018) Kaplan, H., Berry, A., Rinn, K., Ellis, E., Birchfield, G., Wahl, T., Liu, X., Tameishi, M., Beatty, J. D., Dawson, P., Mehta, V., Holman, A., Atwood, M., Alexander, S., Bonham, C., Summers, L., Khalil, I., Hayete, B., Wuest, D., Zheng, W., Liu, Y., Wang, X., Brown, T. D., 2018. Abstract 5299: Machine learning approach to personalized medicine in breast cancer patients: Development of datadriven, personalized, causal modeling through identification and understanding of optimal treatments for predicting better disease outcomes. Cancer Research 78 (13 Supplement), 5299–5299.
 Kononenko (2001) Kononenko, I., 2001. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine 23 (1), 89–109.
 Kourou et al. (2015) Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., Fotiadis, D. I., 2015. Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal 13, 8–17.
 Mohri et al. (2012) Mohri, M., Rostamizadeh, A., Talwalkar, A., 2012. Foundations of machine learning. MIT press.
 Peek et al. (2015) Peek, N., Combi, C., Marin, R., Bellazzi, R., 2015. Thirty years of artificial intelligence in medicine (AIME) conferences: A review of research themes. Artificial intelligence in medicine 65 (1), 61–73.

Scholkopf and Smola (2001)
Scholkopf, B., Smola, A. J., 2001. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
 Shortliffe et al. (1973) Shortliffe, E. H., Axline, S. G., Buchanan, B. G., Merigan, T. C., Cohen, S. N., 1973. An artificial intelligence program to advise physicians regarding antimicrobial therapy. Computers and Biomedical Research 6 (6), 544–560.
 Valiant (1984) Valiant, L. G., 1984. A theory of the learnable. Communications of the ACM 27 (11), 1134–1142.