1. Introduction
In the context of evidencebased medicine and guidelines, there is still a high degree of unwarranted differences in individual diseasespecific healthcare pathways. A healthcare pathway can be broadly seen as the route that a patient follows from the first contact with a medical doctor, e.g., the general practitioner, through referral to specialists or hospitals to the completion of treatment for any specific disease. It is a timeline in which all treatmentrelated events can be entered, including diagnoses, treatments, and further consultations or hospital readmissions. The novel availability of medical routine data, especially in the universityclinical context, not only makes it possible to show differences in treatment. Rather, it may also allow to uncover typical clinical healthcare pathways, i.e., typical temporal sequences of clinical interventions or readmissions into the clinic, and to make them available to other clinicians in context. This could enable to improve general standards of clinical care and thus overall health outcomes. Still, pathways of patients in a large provider as a university hospital are heterogeneous as many diagnostic and treatment options exist and patients are partly readmitted to the hospital after discharge for different reasons. A similar healthcare pathway could still be considered a typical healthcare pathway. For this purpose, key questions would be how to measure such similarity and how to decide whether two different paths are still similar and when they would be considered different. To date, very few methodological works on the measurement of healthcare similarity can be found and these are predominantly informaticsbased. For instance, assuming that healthcare pathways depend on factors such as choices made by the treating physician, huang_similarity_2014 suggest a fully unsupervised algorithmic approach based on a probabilistic graphical model representing a mixture of treatment behaviors by latent features.
From a clinical and also patientcentered perspective, it is essential to keep the care pathway as short as possible and prevent complications or diseaserelated hospital readmissions. In this paper, motivated by an application from urologic clinical practice, we would therefore like to focus on objective and universallyrecorded clinical event measures including main events ‘hospital treatment’ and ‘hospital readmission’. We understand the pathway as a temporal process with possible transitions from a single initial treatment state to hospital readmission of different type. We consider the timetofirst hospital readmission, whichever comes first, which constitutes a competing risks setting (andersen_competing_2002). Specifically, we aim to judge similarity of such pathways for samples of two different populations: group (i) patients receiving specific inhouse diagnostics before hospital treatment, and group (ii) patients not receiving
specific inhouse diagnostics before hospital treatment. Our interest in the similarity of these pathways has the following reasons: While a certain disease requires specific treatment that is often only offered in specialized clinics, diagnostical tools are often more diverse and partly offered in outpatient facilities. Therefore, treatment data including diagnostics performed are often not readily available from the nonclinical sector (at least in Germany) and can not yet or only insufficiently be used for the investigation of healthcare pathway similarities. From such a path perspective, one may ask whether it makes a difference in terms of hospital readmission whether a particular diagnostic procedure was performed in the clinic or not. From the perspective of the clinical practitioner, it may be plausible to assume that the probability of hospital readmission differs only by treatment, not by different pretreatment diagnostics. If we could statistically show a similarity of the pathways of both groups, the latter assumption would be confirmed and we may attribute similar pathways to typical pathways.
In this paper, we propose a multistate modelbased approach to reveal such path similarities of two groups of individuals. Multistate models based on counting processes for event history data have been successfully applied to analyze progression of a disease (andersen_statistical_1993; andersen_multistate_2002; manzini_advantages_2018). In the context of care pathways or similarity, however, they have been used only rarely so far and for other purposes. gasperoni_multistate_2017 investigated multistate models for evaluating the impact of risk factors on heart failure care paths involving multiple hospital admissions, admissions to home care or intermediate care units or death. gasperoni_evaluating_2020 considered potential similarities and differences among healthcare providers on the clinical path of heart failure patients.
Our approach differs from this work and we aim for testing the similarity of the transition intensities from two independent competing risks Markov models with constant intensities. Then, the problem is methodologically related to the meanwhile classical problem of
bioequivalence, which aims at demonstrating the similarity between two pharmacokinetic profiles by considering the area under the curve or the maximum concentrations of the two curves (see the monographs chowliu1992; wellek2010testingamong many others). However, none of these methods for establishing bioequivalence can be transferred to the comparison of transition intensities as they are usually developed under the assumption of normally distributed (independent) data. Further, although the asymptotic distribution could be derived for this case as well, an approach based on asymptotics would not yield satisfying power for small sample sizes or data with only few events. In fact in the following we will develop new bootstrap methodology to address this problem.
The paper is structured as follows: Section 2 describes the clinical healthcare pathways in the application example involving prostate cancer surgery. Section 3.1 introduces the competing risks notation for samples of two different populations. In Section 3.2
we describe a novel bootstrap procedure for testing similarity of transition intensities from two competing risks models. In a large simulation study created on the basis of the numbers and estimates from the application example we investigate the performance of our similarity approach with respect to different sample sizes and different similarity thresholds (Section
4). In Section 5 we briefly discuss how the results from the simulation study translate to the application example. We close the paper with a discussion in Section 6.2. Clinical healthcare pathways involving prostate cancer surgery
The application example that drove our methodological development comes from the clinical practice of the Department of Urology at the Medical CenterUniversity of Freiburg. The clinic covers the entire spectrum of urological diagnostics and therapy according to the current state of the art. As data basis, we use the German reimbursement claims dataset for inpatient healthcare, which was systematically integrated into a central database at the Medical CenterUniversity of Freiburg as part of the German Medical Informatics Initiative. For each inpatient case, the admission and discharge diagnoses (main and secondary diagnoses) are coded in the form of ICD10 (10th revision of the International Statistical Classification of Diseases and Related Health Problems) codes; in addition, all applied and billingrelevant diagnostic and therapeutic procedures are coded together with a time stamp in the form of OPS (“Operationen und Prozedurenschlüssel”) codes.
2.1. Hospital readmission after surgery with and without prior fusion biopsy
One of the most frequent reasons for inpatient admission at the Department of Urology is prostate cancer. One possible treatment option is the open or roboticassisted surgery with the resection of the prostatic gland along with the vesicular glands, also referred to as radical prostatectomy (mottet_eauestrosiog_2017). From our reimbursement claims database, we retrospectively identified all patients with prostate cancer who underwent open radical prostatectomy (ORP) at the Medical Center  University of Freiburg between 01 January 2015 and 01 February 2021. This includes all cases with OPS code 5604 (radical prostatovesiculectomy) irrespective of the concrete surgical procedure – but without the additional OPS code 5987 for robotic assistance – and resulted in a total of n=695 patients.
Prior to surgical intervention, diagnostics are performed in a variety of ways both in the clinic or in an outofhospital setting. The current diagnostic standard is a multiparametric magnetic resonance imagingbased pathway with targeted fusion biopsy (FB; OPS code 1465). However, only a part of the patients receives their biopsies at the Department of Urology, which often depends on the practice of the referring urologists in private practices. In our data n=213 (31%) patients received FB diagnostic prior to ORP, while a larger part of the patients, n=482 (69%), did not receive FB diagnostic at the Department of Urology prior to ORP. We did not place a time restriction on when exactly the FB diagnostic took place prior to ORP. Therefore, we distinguish the two populations based on the presurgery FB diagnostic obtained and are interested in their further healthcare paths regarding hospital readmission by means of the two independent competing risks models as illustrated in Figure 1.
Radical prostatectomy carries a risk of postoperative complications. One of the more common complications is lymphocele, which typically develops within a couple of weeks after surgery and can be treated at the Department of Urology. Patients may, however, be also readmitted to the clinic after radical prostatectomy for other reasons related to the initial surgery. A typical time window for surgeryrelated hospital readmission is 90 days after surgery. Therefore, competing outcomes of interest are different reasons for hospital readmission within 90days. In the data, we identified the most frequent readmission diagnoses defined by the ICD10 main diagnosis codes “C61: Malignant neoplasm of prostate” (model 1: n=18, 8%; model 2: n=60, 12%), “I89.8: Other specified noninfective disorders of lymphatic vessels and lymph nodes” (model 1: n=17, 8%; model 2: n=29, 6%), and combined all other observed diagnoses into one state “any other diagnosis” (model 1: n=6, 3%; model 2: n=31, 6%). While I89.8 is typically coded for a complication after radical prostatectomy requiring specific treatment, a readmission with a C61 diagnosis may mask diagnostic procedures after surgery only or specific followup treatment. For all patients in our data set we have at least 90 days clinical followup information available, so censoring is only administrative at 90 days after ORP. All cases had to be complete, that is, a discharge date for the initial stay with ORP as well as a potential readmission stay had to be present at the time of data retrieval.
We note here that the terms in Figure 1 describe the transition intensities to move from the initial state (ORP) into any of the competing states (hospital readmission with ICD10:I89.8, ICD10:C61, or any other ICD10) and are central in this work. They are formally defined in equation (4) in Section 3. Assuming the transition intensities to be constant over time, one may estimate them separately by dividing the sum of typejevents through the sum of persontime at risk in the initial state (see equation (10) in Section 3 for a formal definition). In our data, this yields the following estimates:
(1) 
The estimated constant intensities should be interpreted in the context of the scale in which the time was measured. In our case, the time was measured in days. Since the observation period for all patients was identical (90 days) and overall only few events were observed, the magnitude of the intensities can be appropriately converted into an approximate number of expected events using the formula: transition intensity estimate times observation period (in days) times sample size of the population under consideration (compare eq. (10) for the precise definition of the estimate). For example, for transition intensity estimate this means .
As the readmission intensities are overall low, from a pathway analytic perspective the question is whether they are sufficiently similar for patients with prior inhouse FB diagnostics versus without prior inhouse FB diagnostics w.r.t. the specific transition such that the two populations can be combined, e.g., for a common analysis on hospital readmission due to complications.
3. Similarity of competing risks processes for two populations
3.1. Competing risk models
To model the event histories as competing risks for samples of two different populations, we use two independent Markov processes
(2) 
with state spaces following andersen_statistical_1993. The processes have possible transitions from state to state with transition probabilities
(3) 
Every individual starts in state at time , i.e. . The timetofirstevent is defined as stopping time and the type of the first event is . Let
(4) 
denote the causespecific transition intensity from state to state for the th model, . The transition intensities completely determine the stochastic behavior of the process. In our application example, the two competing risk models with the initial state ORP and three competing risks each are shown in Figure 1, in which the transition intensities are assigned to the transition arrows.
3.2. Similarity of competing risk models
We are interested in the similarity between the transition intensities in the two models. In other words, we want to test the hypotheses
(5) 
versus
(6) 
Here denotes the maximal deviation between the functions and and are prespecified thresholds, defining for each pair of transition intensities the maximum deviation under which and are considered as similar.
In order to make the method easily understandable and to be able to provide closed form solutions for the estimates (for a discussion on that, see for example von2017) we will assume constant transition intensities throughout this paper. This assumption is frequently made in the literature (see for instance fay2003; choudhury2002 among many others). For the same reason we restrict ourselves to the case of no censoring (see Section 3.3 for a brief discussion of the rightcensored case).
In the following, we describe a novel bootstrap procedure for testing the hypotheses (5) and (6) for competing risk models with constant transition intensities, which is motivated by the methodology developed in detmolvolbre2015 for comparing regression curves. To be precise, assume that two independent samples and from Markov processes (2) are observed over the interval , containing the state and transition time of an individuals . We define
as the indicator that a state transition of the individual from to has occurred in the time interval (note that is either or ). We also denote by the corresponding transition time (if the transition time is undefined). Further we introduce the notation
which indicates whether at time the th individual of the th group is at risk or not. Under the assumption of constant transition intensities it then follows from andersen_multistate_2002 that the corresponding likelihood function in the th model is given by
where
(8) 
is the number of transitions from state to state in the th group,
is the total observation time of all individuals in the th group,
is the vector of transition intensities in model
and denotes the indicator of the event . The logarithm of (3.2) is given by(9) 
Taking the partial derivatives and equating to zero yields the maximum likelihood estimates (MLE)
(10) 
Via in (10) the intensity estimate depends on the time scale, as already pointed out at the end of Section 2. We now want to address the question of similarity as stated in the hypotheses (5) and (6). Due to the assumption of constant transition intensities the maximum deviation simplifies to
that is we consider the absolute difference between these intensities for all states
. In order to reject the null hypothesis in (
5) the differences between transition intensities have to be smaller than the prespecified margins for all states. Hence the test problem can be assessed by simultaneously testing the individual hypotheses(11) 
versus
(12) 
for all .
According to the intersection union principle (berger1982) the global null hypothesis in (5) can be rejected at a significance level of if the individual null hypotheses in (11) are rejected at a significance level of for all . This means in particular that there is no adjustment of the level necessary.
The following algorithm summarizes how these individual tests are performed.
3.1 Similarity of transition intensities via constrained parametric bootstrap

For both samples, calculate the MLE of the transition intensities and as given in (10
) and the corresponding test statistics
, . 
Similarity test for state : For each state do:

In order to approximate the null distribution we define constrained estimates of and minimizing the sum of the loglikelihood functions defined in (9) under the additional restriction
(13) that is we estimate the transition intensities such that the models correspond to the margin of the (individual) null hypothesis (11) for state . Further define
(14) and note that . Consequently, if the test statistic is larger or equal than the similarity threshold , which reflects the null situation, the original (and hence unconstrained) estimates and can be used.

Use the constrained estimates , derived in (14), to simulate bootstrap data and . Specifically we use the simulation approach as described in beyersmann2009, where at first for all individuals survival times are simulated with allcause hazard and then a multinomial experiment is run to decide on state with probability .

For the datasets and calculate the MLE and as in (10) and the test statistic for state as in Step (i), that is


Repeat steps (iib) and (iic) times to generate replicates of the test statistic and let denote the corresponding order statistic. An estimate of the quantile of the distribution of the statistic is then given by and the null hypotheses in (11) is rejected at the targeted significance level whenever . Alternatively a test decision can be made based on the value , where denotes the empirical distribution function of the bootstrap sample. Finally we reject the individual null hypothesis (11) for if for a prespecified significance level .

The global null hypothesis in (5) is rejected if
(15)
As stated above, the global null hypothesis (5) is rejected if all individual null hypotheses are rejected. As a consequence of this procedure the power of the test decreases with an increasing size of states in the model as these are leading to a higher number of individual tests (see berger1982 for theoretical arguments on this). More precisely, it is a well known fact that methods based on the intersection union principle can be rather conservative (see for example phillips1990), depending on the sample size, the variability of the data and the number of individual tests. It can be shown that the test is consistent and controls its level. The theoretical arguments for that follow from adapting the proofs of detmolvolbre2015 to the present situation. We will investigate the finite sample properties by means of a simulation study in Section 4.
3.3. Rightcensoring
Note that in case of rightcensoring the methodology described above can be extended by adding corresponding factors from the distribution of the censoring times to the likelihood in (3.2). This requires the assumption of independence between censoring times and survival times. Under the assumption of independence the MLE in (10) still remains valid. By estimating the censoring distribution from the data, Step (iib) in Algorithm 3.2 can be conducted by additionally simulating (bootstrap) censoring times , , and defining the observed time as the minimum of the survival time and the censoring time.
4. Simulation study
4.1. Design
In the following we will investigate the finite sample properties of the proposed methods by means of a simulation study, driven by the application example given in Section 2. We assume that individuals of two groups () are observed regarding three different outcomes over a period of days, hence we consider two competing risk models with each states over the time range . If there is no transition to one of the three states, an individual is administratively censored after these days. The data in the following simulation study is generated according to the algorithm described in beyersmann2009.
We consider in total four different scenarios, which are summarized in Table 1. For the first three scenarios we choose the transition intensities of the application example in (2.1) (compare also Figure 1). This choice results in true absolute differences of
which are also given in Table 1. In order to demonstrate the effect of different numbers of states, we start by testing for similarity of all three transition intensities simultaneously in the first scenario, whereas in the second and in the third scenario we only consider two states and hence only the difference of two transition intensities. Precisely, in Scenario 2 we only compare the transition intensities for State 1 and 3 and in Scenario 3 we only consider State 1 and 2, respectively. Finally, in the fourth scenario we choose identical models, that is and , respectively, resulting in a difference of for all transition intensities.
Intensities model 1  Intensities model 2  True absolute differences  
Scenario 1  0.001  0.0011  0.0004  0.0008  0.0017  0.0009  0.0002  0.0006  0.0005 
Scenario 2  0.001    0.0004  0.0008    0.0009  0.0002    0.0005 
Scenario 3  0.001  0.0011    0.0008  0.0017    0.0002  0.0006   
Scenario 4  0.001  0.0011  0.0004  0.001  0.0011  0.0004  0  0  0 
In other applications the number of patients ending up in one of the three states might be even smaller than the ones found in our application example. To this end, we consider a broader range of different sample sizes given by
where the choice of is the one closest to the application data in this paper and consequently the first three settings correspond to situations with less patients, particularly resulting in a smaller number of cases per state. For example, choosing results for the first model after 90 days of observation in on average 16 patients in state 1, 18 patients in state 2 and 6 patients in state 3 and for the second model in 12 patients in state 1, 26 patients in state 2 and 14 patients in state 3, respectively (note that the numbers of patients have been rounded due to an easier interpretability).
In order to simulate both the type I error and the power of the procedure described in Algorithm
3.2, we consider different similarity thresholds or, for scenarios 2 and 3, , respectively. Precisely we choose , where for the first three scenarios, the first four choices correspond to the null hypothesis (5) and the other three to the alternative in (6) (note that due to the sake of brevity not all choices are presented in the tables). Regarding the fourth scenario, we only consider as in this case we only simulate the power of the test.4.2. Type I errors
Table 2 displays the type I errors for scenarios 13. It turns out that the proportions of rejections of the null hypothesis (5) for the global test are close to zero. These findings are in line with the theoretical arguments given after Algorithm 3.2, as tests based on the intersection union principle tend to be conservative in some situations. However it also becomes visible that this effect decreases when considering only two states instead of three (see the columns corresponding to Scenario 2 and 3, respectively). Moreover we note that the individual tests yield a very precise approximation of the nominal level, as the proportion of rejections is close to in all scenarios under consideration.
The difference between type I error rates of the individual tests and the global test become in particular visible when considering the first row of Figure 2, which yields a visualization of the results presented for Scenario 1 in Table 2. Whereas the proportion of rejections are all around for the individual tests on all three states, the line indicating the results for the global test is close to zero.
Finally, the points on the left of Figure 3, corresponding to the smallest threshold, namely , display the type I errors for a sample size of in a scenario which is not on the margin but in the interior of the null hypothesis. In this situation type I errors are smaller and well below the nominal level. Considering the individual test on the first state the proportion of rejection is close to as the absolute distance , which is rather close to the chosen threshold . For the other two states we have and and hence, regarding the similarity thresholds of , these situations correspond even stronger to the null situation, resulting in lower type I errors of the individual tests, given by for state and 0.009 for state , respectively (compare Figure 3).
Scenario 1  Scenario 2  Scenario 3  

(200,200)  0.000 (0.055/0.057/0.052)  0.004 (0.047/0.048)  0.004 (0.051/0.064) 
(250,300)  0.000 (0.066/0.049/0.048)  0.002 (0.047/0.055)  0.001 (0.049/0.057) 
(300,300)  0.000 (0.064/0.053/0.047)  0.002 (0.048/0.058)  0.002 (0.046/0.042) 
(250,450)  0.000 (0.051/0.058/0.063)  0.004 (0.047/0.061)  0.000 (0.038/0.051) 
(300,500)  0.001 (0.067/0.064/0.063)  0.005 (0.038/0.062)  0.004 (0.052/0.063) 
(500,500)  0.000 (0.053/0.052/0.062)  0.005 (0.052/0.059)  0.002 (0.041/0.062) 
4.3. Power
Table 3 displays the simulated power of the global test and the individual tests for scenarios 1,2,3 and 4, respectively, as well as the two lower lines of Figure 2 visualize some of the results from Scenario 1 of Table 3. In general we observe that the test achieves a reasonable power in all scenarios under consideration and for increasing sample sizes the power converges to 1. For example, considering in Scenario 4, the simulated power lies between and , depending on the threshold under consideration (see Scenario 4 in Table 3). In particular keeping in mind the very small transition intensities (which result in only few cases in the several states, compare to the application example in Figure 1) these results are very promising. In addition, considering the first three scenarios it becomes obvious that the power increases significantly when just considering two instead of three states (compare for the same thresholds the results for Scenario 1 in Table 3 to Scenarios 2 and 3). This becomes also visible in the two lower rows of Figure 2 presenting the power of comparing states individually and simultaneously for the first scenario. When assuming the same similarity thresholds the power for the individual test on the second state is clearly below the observed values for the other two states. This results from the fact that the true absolute difference is given by and hence larger compared to and , which are given by and , respectively.
When comparing the scenarios with only two states, that is Scenario 2 and Scenario 3 in Table 3, we observe that the power of the global test is higher in the first. This holds for all sample sizes and choices of the threshold and results from the different power obtained for the individual tests, which is due to the underlying assumed transition intensities. However, as mentioned beforehand, this effect decreases with increasing sample sizes, where, for all scenarios under consideration, the power converges to 1.
Finally, Figure 3 displays the proportion of rejections for Scenario 1 in dependence of the chosen similarity threshold . We observe that for the first two choices all values, that is the proportion of rejections for the individual test and the global test, respectively, are below or close to as these situations correspond to the null hypothesis (see also the discussion at the end of Section 4.2). For the other three choices of presented in the right part of the figure simulations correspond to the alternative (6). Consequently, with increasing similarity thresholds, the proportion of rejections, which results in claiming similarity, increases.
Scenario 1  
(200,200)  0.083 (0.700/0.234/0.492)  0.217 (0.699/0.655/0.484)  0.618 (0.969/0.688/0.930) 
(250,300)  0.169 (0.839/0.307/0.655)  0.416 (0.836/0.787/0.622)  0.784 (0.995/0.800/0.987) 
(300,300)  0.192 (0.867/0.333/0.638)  0.457 (0.884/0.810/0.649)  0.820 (0.999/0.829/0.999) 
(250,450)  0.239 (0.863/0.380/0.761)  0.580 (0.852/0.893/0.753)  0.858 (0.995/0.863/1.000) 
(300,500)  0.282 (0.919/0.389/0.810)  0.701 (0.918/0.930/0.826)  0.941 (1.000/0.942/0.999) 
(500,500)  0.388 (0.981/0.467/0.845)  0.796 (0.982/0.948/0.851)  0.955 (1.000/0.957/0.998) 
Scenario 2  
(200,200)  0.382 (0.749/0.511)  0.685 (0.723/0.952)  0.932 (0.979/0.952) 
(250,300)  0.578 (0.864/0.669)  0.852 (0.859/0.992)  0.982 (0.994/0.988) 
(300,300)  0.594 (0.876/0.685)  0.869 (0.876/0.991)  0.989 (0.998/0.991) 
(250,450)  0.691 (0.887/0.782)  0.883 (0.883/0.999)  1.000 (1.000/1.000) 
(300,500)  0.765 (0.914/0.836)  0.914 (0.914/1.000)  1.000 (1.000/1.000) 
(500,500)  0.853 (0.984/0.866)  0.984 (0.984/1.000)  1.000 (1.000/1.000) 
Scenario 3  
(200,200)  0.184 (0.716/0.259)  0.496 (0.716/0.683)  0.666 (0.976/0.683) 
(250,300)  0.266 (0.842/0.309)  0.684 (0.840/0.813)  0.803 (0.991/0.810) 
(300,300)  0.264 (0.889/0.304)  0.735 (0.890/0.830)  0.827 (0.997/0.830) 
(250,450)  0.318 (0.884/0.373)  0.788 (0.884/0.900)  0.899 (0.999/0.900) 
(300,500)  0.374 (0.912/0.409)  0.831 (0.912/0.915)  0.913 (0.998/0.915) 
(500,500)  0.432 (0.974/0.446)  0.940 (0.975/0.965)  0.968 (1.000/0.968) 
Scenario 4  
(200,200)  0.514 (0.757/0.685/0.981)  0.738 (0.771/0.971/0.986)  0.954 (0.986/0.968/1.000) 
(250,300)  0.760 (0.882/0.865/1.000)  0.875 (0.881/0.994/0.991)  0.996 (0.998/0.998/1.000) 
(300,300)  0.837 (0.936/0.896/0.998)  0.930 (0.933/0.998/0.999)  0.998 (1.000/0.998/1.000) 
(250,450)  0.856 (0.939/0.910/0.999)  0.924 (0.927/0.998/0.998)  1.000 (1.000/1.000/1.000) 
(300,500)  0.925 (0.968/0.956/1.000)  0.953 (0.953/1.000/1.000)  1.000 (1.000/1.000/1.000) 
(500,500)  0.982 (0.990/0.992/1.000)  0.993 (0.993/1.000/1.000)  1.000 (1.000/1.000/1.000) 
5. Similarity of healthcare pathways involving prostate cancer surgery
We now want to address the question whether the readmission intensities for patients with prior inhouse FB diagnostic are similar to the ones of the patients without prior inhouse FB diagnostic (eq. (2.1)). Therefore we perform the test on similarity described in Algorithm 3.2 considering numerous different similarity thresholds , , on the given application example. The choice of these thresholds is motivated from the simulation studies presented in Section 4. In Table 4 we display the pvalues of the individual tests on states , and , respectively, for six different similarity thresholds. We observe that for the smallest threshold, that is , all individual pvalues are far above the nominal level of . For the individual pvalue of the test for the first state is now given by , which is below the nominal level and results in claiming similarity of transition intensities for the first state. Considering the same threshold for state 2 and 3, respectively, yields that for these states the individual null hypotheses cannot be rejected. Further, considering we observe that now similarity of the corresponding readmission intensities can be claimed for state 1 and 3, as both individual pvalues are below the nominal level. The same holds for , as the pvalues for state 1 and 3 are given by and , respectively, whereas the pvalue of the test for state 2 is given by . However, since for both thresholds, that is and , each pvalue for state 2 is larger than , the global null hypothesis in (5) cannot be rejected according to the decision rule (15). For the two largest choices of given by and respectively, all individual pvalues are well below which means that the global null hypothesis (5) can be rejected and similarity can be claimed for all three states, that is we decide for similarity of both patient populations regarding all their readmission intensities. Finally we observe that the same conclusion can be made for all thresholds fulfilling , and as this choice guarantees that all individual pvalues are below the nominal level of . In terms of difference in number of events this translates as follows: Assuming for example two samples of 350 patients each and followup of 90 days, these thresholds correspond to allowing for a difference of approximately at most 22, 38, and 25 events for transitions into states 1, 2, and 3 between both groups.
State  Similarity threshold  
0.0005  0.0007  0.0008  0.0010  0.0012  0.0015  
State 1  0.166  0.044  0.026  0.006  0.002  0.0001 
State 2  0.514  0.366  0.251  0.094  0.037  0.0001 
State 3  0.502  0.104  0.045  0.004  0.0001  0.0001 
6. Discussion
In this paper we developed a hypothesis test based on a constrained (parametric) bootstrap to assess the similarity of competing risk models with constant transition intensities. Specifically, we performed an individual test for each state and combined these individual tests by applying the intersection union principle. We examined the finite sample properties by numerous simulations motivated by an example application in urology, and demonstrated that the test properly controls its level and yields a reasonable power. It would be interesting to investigate further whether the power can be improved even more by not performing individual tests, but by defining a global test statistic that directly accounts for all states. This alternative test statistic might yield a procedure with increased power but comes at the cost of not being able to draw conclusions for each state individually as all information from the different states is summarized in one quantity.
We proposed measuring similarity by the absolute difference between transition intensities. However, instead of considering differences a similar methodology can be developed for comparing the ratios of the transition intensities. In the case of the application example, this would mean examining the following ratios to test for similarity: and . On the one hand, considering ratios would have the advantage that they are timeinvariant, i.e., not depending on the time scale anymore. On the other hand, for small transition intensities, i.e., settings with few events, differences may better communicate situations where there is no large difference in terms of intensities or events as compared to ratios. For instance, while is in fact fairly small this may by far not be assumed when examining the ratio . In general, the choice how to measure the deviation between the transition intensities depends on the goal of the study and should be carefully investigated by the researcher. This also applies to the corresponding equivalence thresholds which offer on the one hand a maximum of flexibility for our approach but on the other hand also provide the need of a very careful discussion in advance. Currently there are no guidelines fixing these thresholds in studies as considered in Section 2, which makes this decision an important topic for further research.
With respect to the application example, we were able to identify thresholds for which the global null hypothesis could be rejected and therefore the transition intensities are to be considered similar. The chosen thresholds were based on the results of the simulation study and were also chosen differently to illustrate the effect on the pvalues. This extensive procedure is not necessary for future applications of the method to clinical data, instead a careful preliminary determination of plausible equivalence thresholds is required. This can be done in such a way that one considers which difference in number of events one would still like to allow and then calculates the corresponding threshold accordingly, taking into account the examined time span and sample size. We point out here that while very stringent thresholds are often expected from an equivalence test of a therapeutic study, these would rather not be fulfilled in our application example. However, we can consider this somewhat less stringent and continue to assume similarity, since the actual goal is to use the overall data to examine an outcome that is supposed to be no longer directly related to the diagnostic procedure. A further strength of the kind of data we used in our application is that it was drawn from sets of claims data having a standardized format used for quality assurance and for the calculation of the German Diagnosis Related Groups system. The process and quality assurance measures for providing this dataset are highly standardized. The data are easily accessible and therefore provide a good source of information for this investigation.
A limitation of the methodology proposed in this article is the assumption of constant transition intensities, which may not be met in real data applications. However, our proposed approach based on parametric bootstrap allows in principle an extension to different parametric distributions of event times. This requires further extensive investigations, which are beyond the scope of this work. We therefore leave it for future research.
Comments
There are no comments yet.