1 Introduction
Networks are collections of interconnected entities, e.g. social networks of communicating actors, ecological networks of flora and fauna commensalism, and computer networks.[31, 26] Network-graphs, or simply graphs, are mathematical objects consisting of a vertex set, e.g., one vertex per network entity, and an edge set, e.g., a set of pairs of vertexes involved in a network connection, that represent the arrangements of pairwise relationships in the network.[9]
Methodology developed in the fields of social networks, network science, and graph theory provides for the analysis of relational data generated from a variety of measurements across scientific disciplines.[22, 8] Community detection is the process of identifying exceptionally dense subnetworks of mutually well-connected network entities, known as communities, that often have functional meaning in the network.[12] Notable approaches to community detection include the clique percolation method [24], spectral partitioning [5], degree-corrected stochastic block models [18], modularity optimization [21], and multi-slice network community detection [20]. These approaches are designed for the unsupervised partitioning of the vertex set of a graph into unusually cohesive subsets of vertexes, and with varied applications in sociology [33], computer architecture [13], and biology [2], illustrate that myriad methodology in community detection procedures are applied broadly in scientific research.
Community detection procedures commonly integrate network connectivity exclusively, without regard for other quantities of interest, e.g., auxiliary measurements on vertexes.[10]. We refer the reader to recent expositions on the state of the science of community detection in networks including [16, 30, 34, 27] that detail existing methods with respect to certain applications. It is essential to note that constraints are not commonly imposed on the community structure in networks by existing methods.
We are motivated to partition a nation-wide network of hospitals into subnetworks of hospitals that (i) exhibit a high level of within-group patient sharing as quantified by the network modularity quality function and (ii) consist of at least one hospital that has hosted a minimum number of implantable cardioverter defibrillator (ICD) surgeries to define as health care communities (HCCs) that provide a comparable level of cardiac care. We utilize data acquired from health insurance claims made to the Medicare national social insurance program during the period 2006-2011 in addition to the quantity of ICD surgeries at the major cardiovascular referral centers known as cardiac care facilities (CCFs). The preeminent work in this domain is the Dartmouth Atlas in which the then Center for the Evaluative Clinical Sciences, Dartmouth Medical School [28] assigned hospitals to one of 306 health referral regions (HRRs) representing markets for tertiary medical care. The significant contributions made by the Dartmouth Atlas to health services research motivates our work but our methodology, which is based on network-graph topology, is a departure from the HRRs strict adherence to geographic proximity that continues to be pursued by some efforts to modernize the designations. [29] Related work in defining regions according to network topology for the purposes of measuring health care variation across regions are nonetheless fundamentally based on geography. [14, 17, 15] Note that geographic proximity could be an alternative or additional constraint to our network-based method but is not necessary. In particular, if the analysis health care services provided by telemedicine or remote monitoring is desired then geographic information may be ignored intentionally in the discovery of network communities.
We develop in the following a recursive backtracking procedure for greedy search of high modularity communities that are each constrained by the requirement to contain at least one of the so-called special vertexes, which in the present application are the network-graph representatives of the hospital network cardiac care facilities at which a minimum number of ICD surgeries were performed and apply the method to approximate the optimal health care community assignment for each hospital in the network.
The inclusion of constraints in community detection, particularly a constraint of the nature under present consideration, is pertinent across many domains and applications. The community structure in any social network consisting of entities possessing distinct classes or attributes, e.g., standard entities versus noteworthy entities, may be more accurately characterized by our methodology. For example, a researcher seeking to partition the individuals of social communication networks that are the result of correspondences among senior and junior members of an organization, e.g., the Enron corpus [19], may enforce through our methodology the constraint that each community of organization members in the network consist of at least one senior member. (A reason for doing this is that the researcher knows how the organization structures its workforce but does not know which senior employee is grouped with which junior employees). This example is retrospective in that latent groups are being rediscovered by the researcher. The same example could apply in a prospective form. For example, given a network of professional relationships among its employees of different ranks, a company could use the algorithm to form optimal groups with at least one senior employee per group. Analogous examples might also arise in peer mentoring situations – a teacher in a classroom might use a friendship network among the students to form workgroups such that each group consists of at least one student who is performing very well (on course for the top possible grade) who can help the other students understand the material. On the other hand, a researcher investigating the community structure in a trophic network [11] may desire a partition of species in which each community contains at least one member of the Canidae taxonomic family. We emphasize that applications of our methodology abound in networks consisting of heterogeneous entities.
2 Background
We represent the nation-wide hospital network with the weighted, undirected graph $\mathcal{G}=(\mathbf{V},\mathbf{E})$ which designates one vertex $v\in \mathbf{V}$ for each hospital and a positively weighted edge $\{u,v,{w_{uv}}\}\in \mathbf{E}$ for each pair of vertexes $\{u,v\}\in {\mathbf{V}^{2}}$ that represents interacting entities in the network, e.g., patient-sharing hospitals, where ${\mathbf{V}^{2}}=\mathbf{V}\times \mathbf{V}$ is the set of vertex pairs. Note that if ${w_{uv}}=0$ then $\{u,v,0\}\notin \mathbf{E}$. The weight ${w_{uv}}$ of edge $\{v,u,{w_{uv}}\}$ reflects the quantity of shared patient visits recorded in Medicare claims data between the hospitals represented by vertexes $u,v\in \mathbf{V}$.
Suppose that the vertex subset ${\mathbf{V}^{\prime }}\subseteq \mathbf{V}$ contains the vertexes that are called special vertexes and represent network entities of a distinguishing nature. In the present scenario, ${\mathbf{V}^{\prime }}$ consists of the vertexes that represent the cardiac care facilities at which at least τ ICD surgeries, for some $\tau \ge 0$,2 were performed and define ${\mathbf{V}^{\prime }}={\mathbf{V}^{\prime }_{\tau }}$ accordingly. The information contained in the labeling of these special vertexes in ${\mathbf{V}^{\prime }_{\tau }}$ is information not encoded in the graph topology induced by the edge set E itself and, therefore, its integration into a standard community detection procedure must be made explicitly.
A useful mathematical representation of the weighted, undirected network-graph $\mathcal{G}$ in a variety of approaches to network science applications is the non-negative, symmetric $p\times p$ matrix W in which ${W_{ij}}={w_{{v_{i}}{v_{j}}}}$, for ${v_{i}},{v_{j}}\in {\mathbf{V}^{2}}$, i.e. $i,j\in \mathbb{Z}$ such that $1\le i\lt j\le p$ corresponding to some ordering of the vertexes ${v_{k}}\in \mathbf{V}$. The network modularity quality function
where $\mathbf{s}\in {_{p}}={\{1,2,\dots ,p\}^{p}}$ is a community assignment vector consisting of at most p unique community membership labels, ${d_{j}}={\textstyle\sum _{i=1}^{p}}{W_{ij}}$ is the degree of vertex ${v_{j}}$, and $2m={\textstyle\sum _{j=1}^{p}}{d_{j}}$ is the total degree of the network-graph $\mathcal{G}$. Note that ${W_{ij}}$ is the observed (true) weight of the edge connecting vertexes ${v_{i}}$ and ${v_{j}}$ whereas ${E_{ij}}=\frac{{d_{i}}{d_{j}}}{2m}$ is the expected weight of the edge connecting these vertexes under randomization via the configuration model. [3]
(2.1)
\[\begin{aligned}{}Q(\mathbf{s}|\mathbf{W})& =\frac{1}{2m}{\sum \limits_{i,j}^{p}}\left({W_{ij}}-\frac{{d_{i}}{d_{j}}}{2m}\right)1\{{s_{i}}={s_{j}}\},\end{aligned}\]The process of unconstrained modularity optimization involves the identifying of a partition $\{{\mathbf{V}_{1}},\dots ,{\mathbf{V}_{n}}\}\subseteq \mathbf{V}$ of size $n\le p$ corresponding to the community assignment vector s such that ${s_{k}}=r$ if ${v_{k}}\in {\mathbf{V}_{r}}$, for some $r\in \{1,2,\dots ,n\}$. For a partition of size $n\le p$, the Stirling number of the second kind [1]
with $n\le {p^{\prime }}$, which displays asymptotics similar to the number of unconstrained partitions. The special case when $n={p^{\prime }}$, for which $C(p,{p^{\prime }},n)={n^{p-n}}\in O({n^{p}})$, amounts to the unconstrained optimization of the modularity quality function over the set of non-special vertexes $v\in \mathbf{V}\setminus {\mathbf{V}^{\prime }}$ with each of the special vertexes $v\in {\mathbf{V}^{\prime }}$ each belonging to a unique community. This observation accordingly supports a bottom-up approach that is a part of our forthcoming procedure.
\[\begin{aligned}{}S(p,n)& =\frac{1}{n!}{\sum \limits_{k=0}^{n}}{(-1)^{k}}\left(\genfrac{}{}{0.0pt}{}{n}{k}\right){(n-k)^{p}}\end{aligned}\]
counts the number of unique community assignments. Note that if the ${p^{\prime }}=|{\mathbf{V}^{\prime }_{\tau }}|$ special vertexes are partitioned into n vertex sets then the number of feasible community assignments is
(2.2)
\[\begin{aligned}{}C(p,{p^{\prime }},n)& ={n^{p-{p^{\prime }}}}S({p^{\prime }},n),\end{aligned}\]On the extreme end of the spectrum, if $\tau =\omega $, where ω equals the maximum number of ICD surgeries performed at any cardiac care facility in the network, then ${p^{\prime }}$ achieves its minimum tenable value. In particular, if the maximum number ω of ICD surgeries performed at any hospital in the network is unique then ${p^{\prime }}=1$. Moreover, if $\tau \gt \omega $ then no feasible solution to the constrained optimization problem exists. In general, ${p^{\prime }}$ is a non-increasing function of $\tau \ge 0$.
The community assignment vector that optimizes the modularity quality function
where ${_{p}}={\{1,2,\dots ,p\}^{p}}$, is the community assignment vector which, on average, labels similarly vertexes that are more well-connected than expected to the same community among the $n\le p$ different communities.
(2.3)
\[\begin{aligned}{}{\mathbf{s}_{opt}^{\tau }}& =\underset{\mathbf{s}\in {_{p}}}{\operatorname{arg\,max}}\hspace{5.0pt}Q(\mathbf{s}|\mathbf{W}),\end{aligned}\]Suppose that $R(\mathbf{s}|{\mathbf{V}^{\prime }_{\tau }})$ is the boolean function that returns true if the community assignment vector s satisfies the desired property that each community contain at least one special vertex ${v^{\prime }}\in {\mathbf{V}^{\prime }_{\tau }}$ and define the restricted community assignment vector that optimizes the modularity quality function
where, in the present scenario,
(2.4)
\[\begin{aligned}{}{\mathbf{s}_{R}^{\tau }}& =\underset{\mathbf{s}\in {_{p}}}{\operatorname{arg\,max}}\hspace{5.0pt}\left\{Q(\mathbf{s}|\mathbf{W}):R(\mathbf{s}|{\mathbf{V}^{\prime }_{\tau }})\right\},\end{aligned}\]
\[\begin{aligned}{}R(\mathbf{s}|{\mathbf{V}^{\prime }_{\tau }})& =\left\{\begin{array}{l@{\hskip10.0pt}l}\text{True}\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}\mathcal{U}(\mathbf{s}|{\mathbf{V}^{\prime }_{\tau }})=\mathcal{U}(\mathbf{s}|\mathbf{V})\\ {} \text{False}\hspace{1em}& \hspace{2.5pt}\text{otherwise,}\end{array}\right.\end{aligned}\]
where $\mathcal{U}(\mathbf{s}|{\mathbf{V}^{\prime }_{\tau }})$ is, for example, the set of unique community labels for those vertexes $v\in {\mathbf{V}^{\prime }_{\tau }}$. Defined as such, $R(\mathbf{s}|{\mathbf{V}^{\prime }_{\tau }})$ returns $\text{True}$ when each community is constituted by at least one special vertex ${v^{\prime }}\in {\mathbf{V}^{\prime }_{\tau }}$.We define a health care community assignment as the designation of hospitals to communities in a community assignment vector that optimizes the network modularity quality function $Q(\mathbf{s}|\mathbf{W})$ in Equation (2.1). The quantity $Q(\mathbf{s}|\mathbf{W})$ is proportional to the difference between the fraction of (weighted) edges within communities and the expected fraction if edges were randomly distributed according to the configuration model, i.e. edge randomization with degree distribution conserved, subject to the constraint that to each community belongs at least one special vertex, e.g. a vertex representing a cardiac care facility where at least $\tau \ge 0$ ICD implantations occurred. Specifically, we seek to maximize modularity while requiring that the number of cardiac care facilities per health care community is at least $\tau \ge 0$: a restriction we encode with Boolean variable $R(\mathbf{s}|{\mathbf{V}^{\prime }_{\tau }})$ indicating the feasibility of a community assignment.
Existing work in this realm considers incorporating additional information in the form of individual entity labels and pairwise constraints, i.e., that two vertexes must be labeled similarly or differently. [7] A restriction of the type we consider here has, to our knowledge, remained unstudied. In the context of the hospital network, the HRRs defined previously associated local health care markets to the tertiary care facilities where the plurality of the residents were referred for major cardiac procedures. An HRR is a reflection of its regional health care market and, because necessarily within each is a hospital specialized in cardiac surgery, a comparison across regions is facilitated. In an effort to standardize cardiac care among health care subnetworks, we present an initial undertaking towards developing a paradigm of constrained community detection. In particular, because constraint satisfaction in optimization problems provides context, the health care communities identified by our constrained community detection approach have real-world utility.
The organization of this article is as follows. In Section 3, we mathematically formulate the constrained optimization problem. In Section 4, we define a greedy, recursively-backtracking procedure for identifying high-quality, constrained communities. Utilizing our procedure in Section 5, we estimate the health care communities of the nation-wide hospital network and illustrate in Section 6 our method on a network-graph of well-known community structure.
3 Problem Specification
Suppose that $\mathcal{G}=(\mathbf{V},\mathbf{E})$, with $p=|\mathbf{V}|$, is a network-graph and that ${\mathbf{V}^{\prime }}\subset \mathbf{V}$ is a subset of vertexes. Moreover, suppose that $\mathbf{s}\in {\{1,2,\dots ,n\}^{p}}$, for some $n\in {\mathbb{Z}_{+}}$, is a vector of labels such at each label is applied at least once. We define the collection of $\left(\genfrac{}{}{0.0pt}{}{p}{2}\right)$ binary variables ${x_{ij}}=1\{{s_{i}}={s_{j}}\}$, for $i,j\in \{1,2,\dots ,p\}$, along with the corresponding collection of edge values ${b_{ij}}={w_{ij}}-{d_{i}}{d_{j}}/2m$ and the vertex values ${u_{k}}=1\{{v_{k}}\in {\mathbf{V}^{\prime }}\}$, for $k\in \{1,2,\dots ,p\}$. Let $\mathbf{B}=[{b_{ij}}]$.
The constrained optimization problem under consideration is described as follows.
\[\begin{aligned}{}\text{Maximize:}\hspace{2.5pt}& f(\mathbf{s},\mathbf{B})={\sum \limits_{ij}^{p}}{b_{ij}}{x_{ij}}\hspace{2.5pt}\text{over}\hspace{2.5pt}\mathbf{s}\in {\{1,2,\dots ,n\}^{p}}\\ {} & \hspace{10.0pt}\hspace{2.5pt}\text{and}\hspace{2.5pt}n\in {\mathbb{Z}_{+}}\\ {} \text{Subject to:}\hspace{2.5pt}& {x_{ij}}=1\{{s_{i}}={s_{j}}\}\hspace{2.5pt}\text{for}\hspace{2.5pt}i,j\in \{1,2,\dots ,p\}\\ {} & \hspace{10.0pt}\sum \limits_{\{h:{s_{h}}=r\}}{u_{h}}\ge \eta ,\hspace{2.5pt}\text{for}\hspace{2.5pt}r\in \{1,2,\dots ,n\}\\ {} & \hspace{10.0pt}\eta \ge 0\\ {} & \hspace{10.0pt}{u_{h}}=1\{{v_{h}}\in {\mathbf{V}^{\prime }}\}\hspace{2.5pt}\text{for}\hspace{2.5pt}h\in \{1,2,\dots ,p\}.\end{aligned}\]
This discrete optimization problem is, in general, at least as difficult as the corresponding, unconstrained NP-hard decision problem which seeks an answer to whether or not a partition of the vertex set exists with a modularity quality function value of at least some minimum value. In particular, the boundary value $\eta =0$ corresponds to the special case of an unconstrained optimization problem in the description above.A greedy unconstrained method may readily agglomerate communities toward achieving a local maximum of the network modularity function $f(\mathbf{s},\mathbf{B})$. On the other hand, satisfying the constraint requires the consideration of many permutations of such mergers and any recursive backtracking procedure for identifying high-quality, with respect to the network modularity function, has potentially intractably-many paths in the search space to traverse. Note that if the graph $\mathcal{G}$ happens to be unweighted then the adjacency matrix W is binary and, up to quantities involving a $2m$ denominator, Equations (4.1) and (4.2) involve integers on similar scales. Accordingly, the various sequences of community assignment label updates through the recursive backtracking procedure are reduced. In the application of our method toward the defining of health care communities in the nationwide social network of cardiac-related Medicare referrals between hospitals we make this binary transformation $\mathbf{A}=1\{g(\mathbf{W})\gt \zeta \}$, where $g:{\mathbb{R}^{p\times p}}\mapsto {\mathbb{R}^{p\times p}}$ is defined in Section 5.1, the indicator function $1\{P\}=1$ if proposition P is true and $1\{P\}=0$ otherwise, and $\zeta \gt 0$ is determined from the data, to facilitate tractable computations.
4 Greedy, Recursive-backtracking Procedure
We begin this section by presenting the existing work on unconstrained network modularity quality function optimization upon which our generalized procedure, which seeks to maximize the network modularity quality function over the space of feasible community assignment vectors, is based. Subsequently we outline our method for constrained optimization of the network modularity quality function and provide pseudocode for pertinent procedures.
4.1 Louvain Method
The Louvain method [4] is a greedy modularity optimization procedure that proceeds in two fundamental and repeating steps: (i) the local optimization of the modularity quality function and (ii) the folding of communities into super-vertexes to create a super-graph in which the newly-created super-vertexes are representatives in the super-graph of the communities in the former graph and the newly-created edge weights between super-vertexes are the sums of the edge weights between pairs of communities in the original graph. This process is continued until the super-graph resulting from repeated local optimizations and folds no longer warrants a subsequent fold, at which point, the folding process is reversed to recover the vertex-level community assignment vector, see [4] for details.
4.2 Modifications to Base Louvain Procedures
We modify the local optimization procedure of the Louvain method to cycle through the vertex set V in turn, with the exception of vertexes $v\in \mathbf{U}$, and assigning its community label ${s_{k}}$ to
\[\begin{aligned}{}{s_{k}}& \gets \underset{{s_{k}}\in \{1,\dots ,n\}}{\operatorname{arg\,max}}\hspace{5.0pt}Q({s_{k}}|{\mathbf{s}_{-k}},\mathbf{W}).\end{aligned}\]
This process is continued until no community label has been modified in one complete pass through the vertex set. The modification to restrict the updating of vertex community assignment labels to vertexes $v\in \mathbf{V}\setminus \mathbf{U}$ permits, in the present context, the community labels of special vertexes ${v^{\prime }}\in {\mathbf{V}^{\prime }_{\tau }}$ to be updated while the community labels of non-special vertexes $v\in \mathbf{V}\setminus {\mathbf{V}^{\prime }_{\tau }}$ are held constant and vice versa. Moreover, we modify the folding procedure to trace a subset of vertexes $\mathbf{U}\subseteq \mathbf{V}$ through to the super-vertexes which ultimately represent them as part of the community folding process. This permits, in the present context, super-vertexes to inherit the special designation.4.3 Constraint Corrected Louvain Method
Perhaps the most straightforward approach toward constrained greedy optimization of the modularity quality function leads through the unconstrained greedy optimization procedure of the Louvain method. That is, one may consider first estimating the unconstrained community assignment vector and then, subsequently, modify the unconstrained solution to a feasible state. We provide pseudocode in Procedure 1 for this recursive-backtracking procedure that modifies an infeasible community assignment vector in an agglomerative approach, i.e. by merging communities, in a manner that least reduces modularity until a constraint-satisfying solution is achieved. Of course, given enough community mergers, this procedure is guaranteed to eventually halt.
Note that within Procedure 1, the Correct local function is simplified by the fact that the merging of two communities that are respectively constituted by vertexes with indices in ${\mathbf{I}_{1}}$ and ${\mathbf{I}_{2}}$ results in the modularity gain proportional to
More specifically, if vertex ${v_{j}}$ currently belongs to the same community as vertexes with indices in ${\mathbf{I}_{0}}$ and vertex ${v_{j}}$ is to be relabeled to belong to the same community as vertexes with indices in ${\mathbf{I}_{1}}$ then the change in modularity is proportional to
These two formulas provide, in general, substantial computational savings over direct $O({p^{2}})$ computation of the network modularity quality function in Equation (2.1), see Appendix A for derivations.
(4.1)
\[\begin{aligned}{}{\Delta _{comm}}Q& \propto \sum \limits_{{i_{1}}\in {\mathbf{I}_{1}}}\sum \limits_{{i_{2}}\in {\mathbf{I}_{2}}}{W_{{i_{1}}{i_{2}}}}-\frac{1}{2m}\left(\sum \limits_{{i_{1}}\in {\mathbf{I}_{1}}}{d_{{i_{1}}}}\right)\left(\sum \limits_{{i_{2}}\in {\mathbf{I}_{2}}}{d_{{i_{2}}}}\right).\end{aligned}\](4.2)
\[\begin{aligned}{}{\Delta _{vert}}Q& \propto \left({W_{jj}}-\frac{{d_{j}^{2}}}{2m}\right)+\left(\sum \limits_{{i_{1}}\in {\mathbf{I}_{1}}}{W_{j{i_{1}}}}-\sum \limits_{{i_{0}}\in {\mathbf{I}_{0}}}{W_{j{i_{0}}}}\right)\\ {} & -\frac{{d_{j}}}{2m}\left(\sum \limits_{{i_{1}}\in {\mathbf{I}_{1}}}{d_{{i_{1}}}}-\sum \limits_{{i_{0}}\in {\mathbf{I}_{0}}}{d_{{i_{0}}}}\right).\end{aligned}\]4.4 Modified Core Procedures
Many discrete optimization problems, including the present one, are computationally difficult and frequently intractable. That is, the size of the solution space is, in general, prohibitively large for a complete search and the objective function is, in general, not monotonic in a mathematically-useful manner. For these reasons, an exact solution to the modularity optimization problem is not often sought but, instead, a satisfactory solution that is encountered by an algorithmic procedure is frequently reported as is the case with, for example, the Louvain method described above.
Prior to proposing our approach to the constrained optimization problem described in Section 3, we present our modified versions of the greedy, local optimization (Procedure 2) and community folding (Procedure 3) procedures. In particular, we provide for the flexibility to exclude a subset of vertexes from the local optimization process in Procedure 2 so that, in the present context, the community labels corresponding to special vertexes ${v^{\prime }}\in {\mathbf{V}^{\prime }_{\tau }}$ may be held fixed while the community labels corresponding to regular, i.e. non-special, vertexes $v\in \mathbf{V}\setminus {\mathbf{V}^{\prime }_{\tau }}$ are locally and greedily optimized. Moreover, in the process of folding, we provide the adapted Procedure 3 which traces a set of vertexes through the folding process and reports which super-vertexes, i.e. the vertexes resulting from folded communities, are representative of any vertexes from that original set. In the present context, this permits the tracing of special vertexes through the folding process and reporting on the status of the super-vertexes that result from the folding process. Pseudocode for Procedures 2 and 3 are provided in the following.
4.5 Initialization Generalizations
The default community assignment vector ${\mathbf{s}^{(0)}}=(1,2,\dots ,p)$ in Procedure 2 is consistent with the initialization of the Louvain method and represents a bottom-up approach in the greedy optimization process. Experimentally, we have found that a top-down approach is often more effective in the context of the constraint that each community must consist of at least one special vertex.. We provide a summary of this procedure in the following pseudocode for Procedure 4.
4.6 CMOP
We finally present the pseudocode for the main Constrained Modularity Optimization Procedure (CMOP) in Procedure 5.
Among the three constrained, high-modularity community assignment vector ${\mathbf{s}_{td}}$, ${\mathbf{s}_{mr}}$, and ${\mathbf{s}_{null}}$, as computed in Algorithm 5, the vector corresponding to the greatest modularity value is ultimately returned.
5 Application: Definition of Health Care Communities
The nationwide hospital network we consider consists of $p=4734$ hospitals, as depicted in Figure 1a, among which 1388 (29.3%) are CCFs hosting at least $\tau =1$ ICD surgeries and correspond to special vertexes in ${v^{\prime }}\in {\mathbf{V}^{\prime }}$. We begin this analysis by processing the data.
Figure 1
Hospital Network-graph: (a) plot of log histogram density vs log degree for all hospitals (black points) and only CCF hospitals (green points). Lines of best fits for the head (red line) and tail (blue line) of the log degree distribution of all hospitals. (b, c) The edges retained after pruning, see Appendix B. All hospitals are marked as small yellow points whereas, in (b) the vertexes with degrees in the tail of the degree distribution are marked with red points with radii proportional to their degree and in (c) the vertexes representing cardiac care facilities (CCFs) are marked in green points with radii proportional to the total number of ICD implantations that took place during the study period at that facility. (d) Using a database of all United States zip codes, along with their respective latitude and longitude coordinate, we indicated on the map the community of the nearest hospital to that zip code, with the understanding that a patient in that zip code is likely to travel to the nearest hospital in case of cardiac emergency.
5.1 Data Processing
The weighted degree distributions of all hospitals and the CCFs in Figure 1a illustrate the different orders of magnitude in their respective quantities. We compute the Pearson chi-square test statistic for independence
\[\begin{aligned}{}{\chi _{ij}^{2}}& =\frac{{\left({W_{ij}}-{E_{ij}}\right)^{2}}}{{E_{ij}}},\end{aligned}\]
where ${E_{ij}}=\frac{{d_{i}}{d_{j}}}{2m}$, for $i,j\in \{1,2,\dots ,p\}$, and note that in Figure 1c, a rather clear trend reflecting the exponential decay of the ${\chi _{ij}^{2}}$ quantities in the nationwide hospital network-graph. We set $\zeta =266.4843$ as the 0.995 quantile threshold of the collection of $\left(\genfrac{}{}{0.0pt}{}{p}{2}\right)$ chi-square quantities above and define the unweighted adjacency matrix A of the pruned nationwide hospital network as $\mathbf{A}=1\{{\boldsymbol{\chi }^{2}}\gt \zeta \}$, where ${\boldsymbol{\chi }^{2}}\in {\mathbb{R}^{p\times p}}$ with elements ${\chi _{ij}^{2}}$ for $i,j\in \{1,2,\dots ,p\}$ and the indicator function $1\{\cdot \}$ is applied element-wise.The argument s of the objective function $Q(\mathbf{s}|\mathbf{W})$ in Equation (2.1) is of principal concern, as opposed to the weighted network itself. Accordingly, we reduce the network dataset by pruning the weighted network edges that are relatively inconsequential in the evaluation of the modularity quality function in Equation (2.1). If, for instance, ${W_{ij}}-{E_{ij}}\approx 0$, i.e., the observed weight of the edge connecting vertexes ${v_{i}}$ and ${v_{j}}$ is approximately as expected under the configuration model, then whether vertexes ${v_{i}}$ and ${v_{j}}$ have the same or different community assignments results in a small marginal change in $Q(\mathbf{s}|\mathbf{W})$. Conversely, if the observed weight ${W_{ij}}\ll {E_{ij}}$ or ${W_{ij}}\gg {E_{ij}}$ then the community assignments of vertexes ${v_{i}}$ and ${v_{j}}$ has a relatively greater marginal impact on $Q(\mathbf{s}|\mathbf{W})$. Accordingly, a large value of ${({W_{ij}}-{E_{ij}})^{2}}$ implies that the correspondence of the community assignments of vertexes ${v_{i}}$ and ${v_{j}}$ are important. The ${\chi _{ij}^{2}}$ value standardizes this value so that comparisons among the magnitudes of ${({W_{ij}}-{E_{ij}})^{2}}$ across all vertex pairs $\{({v_{i}},{v_{j}}):i,j\in \{1,2,\dots ,p\}\}$ are meaningful.
The United States map in Figure 1b indicates that the bulk of ICD surgeries occur in hospitals located in the Eastern States and have a higher frequency of shared patients with physicians associated with other hospitals compared to the entire population of hospitals in the network. Note that in Figure 1a the degree distributions of all hospitals and, separately, that of cardiac care facilities only are similar and, moreover, Figure 5c of the quantiles of respective degree distribution of regular (non-special) vertexes against special vertexes indicates that there is no significant difference between the two distributions, see Appendix A. The consequence of this fact is that the distribution of cardiac care facilities within communities is not expected to be an artifact of the network modularity quality function.
5.2 Unconstrained Hospital Network Communities
To estimate unconstrained communities in the hospital network we used the cluster_louvain function that belongs to the igraph R package. [25, 6] The resulting communities are mapped to local zip codes, and displayed in Figure 2, to reflect the hospital community into which a resident of each zip code would likely be entered upon a cardiac emergency.
Figure 2
Unconstrained Hospital Communities: (a) Histogram of base ten logarithm of the pruned network degree distribution using 0.995 quantile threshold on original patient-sharing edge weights in the hospital network. (b) Selection of threshold ${\tau _{\alpha }}=36$, where the subscript reflects the $\alpha =0.265$ quantile of the distribution of ICD procedures performed at CCF hospitals, see Figure 5a, as the change point in the trend of number of communities in violation of the constraint, i.e., communities that do not possess a hospital where at least ${\tau _{\alpha }}$ ICD implantations were performed, appears to change. Note that this trend and the corresponding location of interest depends on the graph topology. (c) The result of first employing the Louvain method on the nationwide network of hospital referrals for cardiac care and subsequently extending the communities to local zip codes via the $K=1$ nearest neighbor criterion. The unconstrained communities exhibit some contiguity. However, because the communities are not constrained to be geographically contiguous, some of the communities have highly elongated shapes as they include at least one hospital far apart from the others. (d) Zip codes labeled as belonging to a community in violation of the constraint (red) or as belonging to a feasible community (blue).
The geographical proximity of unconstrained hospital communities, as quantified by the network modularity quality function and estimated via the Louvain method, is relatively expected. In fact, if we consider the five most geographically-proximal hospitals to each zip code in the United States then we find that $72.1\% $ of zip codes are closest to five hospitals all of the same unconstrained community and $24.5\% $ are closest to five hospitals from two different unconstrained communities.
5.3 Defining Heath Care Communities
We now apply Procedure 1 to the nationwide unweighted hospital network-graph to identify the health care communities. As it turns out, the unconstrained community assignment vector ${\mathbf{s}_{opt}}$ results in a single network community that is in violation of the constraint that each community have at least one cardiac care facility (CCF) where at least one implantable cardioverter defibrillator (ICD) procedure was performed during the study period.
In order to more completely and accurately illustrate the utility of our methodology, we restrict the special vertex set to consist of vertexes corresponding to hospitals at which at least ${\tau _{0.265}}=36$ ICD procedures were performed, where ${\tau _{\alpha }}$ is the α quantile of the distribution of ICD procedure counts across all CCFs in the nationwide network, see Figure 2. This restriction tightens the constraint of the optimization problem and, practically, corresponds to the requirement that each health care community discovered by Procedure 1 has a greater lower-bound on the quantity of ICD procedures performed therein.
Our procedure considers each of the initial conditions of the recursive-backtracking Procedure 1 and ultimately selects ${\mathbf{s}_{opt}}$. Subsequently, two options are considered: (i) should a special vertex be relabeled according to the community in violation and then subsequently update the elements of the community assignment vector by applying an alternating sequence of Procedures 2 and 3 until the local optimum ${\mathbf{s}_{hcc}}={\mathbf{s}_{R}}$ is identified via many computations of the forms in Equations (4.1) and (4.2). We consider the length $p=4734$ vector ${\mathbf{y}_{0}}$ containing the number of ICD procedures taking place at each corresponding hospital, e.g. ${y_{k}}=0$ if ${v_{k}}$ does not represent a hospital where any ICD procedures were performed and otherwise ${y_{k}}\gt 0$. Define the vector ${\mathbf{y}^{(\alpha )}}$ with elements
\[\begin{aligned}{}{y_{k}^{(\alpha )}}& =\left\{\begin{array}{l@{\hskip10.0pt}l}{y_{k}}\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}{y_{k}}\gt {\tau _{\alpha }}\\ {} 0\hspace{1em}& \hspace{2.5pt}\text{otherwise,}\end{array}\right.\end{aligned}\]
for some ${\tau _{\alpha }}\ge 0$. Note that, as displayed in Figure 2b, with ${\tau _{0.96}}=\mathtt{quantile}({\mathbf{y}_{0}},0.96)$ and ${\mathbf{V}^{\prime }_{\alpha }}=\{{v_{k}}:{y_{k}}\gt {\tau _{\alpha }},\hspace{2.5pt}\text{for}\hspace{2.5pt}k\in \{1,2,\dots ,p\}\}$ then, with this subset ${\mathbf{V}^{\prime }_{\alpha }}\subset {\mathbf{V}^{\prime }}$ of special vertexes, the number of communities in violation of the constraint has risen to seven. We subsequently executed Procedure 1 with $\mathbf{W}=\mathbf{A}$ and ${\mathbf{V}^{\prime }}={\mathbf{V}^{\prime }_{\alpha }}$. Please see Appendix B for more details.The role of ${\tau _{\alpha }}$ in this application is that of τ in the preceding description of the general procedure. Note that this is the only parameter that modifies the contents of the auxiliary information contained in ${\mathbf{V}^{\prime }}\subseteq \mathbf{V}$. Other parameters, including ζ, which control the topology of the unweighted network-graph A, are pertinent to the general problem of modularity quality function optimization and are related to the base Louvain method.
Figure 3
Depiction of Health Care Communities (a) The nationwide-network of hospitals is partitioned into a constraint-satisfying communities of hospitals such that each community contains at least one of the CCFs exceeding the defined ICD procedure threshold. Because the communities are not constrained to be geographically contiguous, some of the communities have highly elongated shapes as they include at least one hospital far apart from the others. (b) A bipartite graph reflecting the overlap between unconstrained communities (bottom row) and constrained communities (top row). The width of the line segment between vertexes, each of which is representing a community consisting of a number of vertexes that is proportional to its dot radius on the plot, reflects the relative overlap of the communities in each class. The number below or above each vertex equals the number of CCFs contained within each community.
5.4 Results Using a Subset of Special Vertices
By applying Procedure 1 to the unweighted adjacency matrix A and the reduced special vertex set ${\mathbf{V}^{\prime }_{\alpha }}$, we approximate the maximizer community assignment vector ${\mathbf{s}_{hcc}}={\mathbf{s}_{R}}$ of the network modularity quality function subject to the constraint that each community include at least one vertex ${v^{\prime }}\in {\mathbf{V}^{\prime }_{\alpha }}$ and note that it corresponds to a network modularity quality function value of $Q\approx 0.6814$, whereas the unconstrained modularity value of $Q\approx 0.6805$. It turns out that the marginal adjustments subsequent to the constraint-satisfaction elements of Procedure 1 may indeed identify a yet greater local maximum than the community assignment vector identified by the Louvain procedure. We do not consider this marginal improvement as worthwhile, however, due to the computational time requirements of such adjustments. We nevertheless note that, importantly, our strategy for identifying high-quality constrained communities is able to do so effectively. A plot of the health care communities identified by Procedure 1 is provided in Figure 3. We note that $\text{nmi}({\mathbf{s}_{opt}},{\mathbf{s}_{R}})\approx 0.9432$, implying that the relative mutual correspondence between the two vertex community assignments. Note that this procedure requires approximately 14.35 minutes to halt compared with the near instantaneous computation of unconstrained communities on the same network with the Louvain method.
It is imperative for the reader to recognize the Louvain method as agnostic to vertex attributes and, in particular, to the accumulation of vertexes with particular attributes in communities discovered by the method. The modularity quality function, which the Louvain method optimizes in a greedy manner, is exclusively a function of the edge set of a graph. Accordingly, modifying the designation of special vertexes, as in the present context, does not modify the composition of the resulting communities discovered by the method. Our proposed method, by contrast, is specifically devised to address the composition of discovered network communities, that is, the assignment of special vertexes to each community. It follows that, if the number of special vertices is diminished, as was demonstrated in the present application, then the constraint that each community contain at least one special vertex becomes more stringent and the community structure discovered by our procedure is modified. We have depicted, in the context described in this section, precisely how the community structure of the network is modified by including a strict constraint. Although the computational time required to compute the constraint-satisfying community structure exceeds that of the time required to compute an unconstrained community structure, the leveraging of additional information related to vertex attributes is worthwhile, in this context and in many other contexts as discussed in the Introduction section of the present article, to facilitate meaningful comparisons across communities that are standardized by the constraint.
6 Simulation on Zachary Karate Club
In the following, we consider each k-tuple, for $k\in \{2,3,4,5,6\}$ as the set of special vertexes ${\mathbf{V}^{\prime }}$, among the set of $p=34$ vertexes in the unweighted Zachary Karate Club social network-graph. [32] We apply Procedure 1 to determine, for each of these tuples of special vertexes, to record the modularity of each constraint-satisfying community assignment vector returned by Procedure 1, the initial position selected by the procedure, and the relative length of the computational time for the procedure to halt. The results of the simulation study are presented in Figure 4.
The quantity $C(p,{p^{\prime }},n)$ in Equation (2.2) counts the number of feasible community assignments for a given $p=|\mathbf{V}|$, ${p^{\prime }}=|{\mathbf{V}^{\prime }}|$, and number of communities n. While this number reflects the number of community assignments necessary to brute-force check and, therefore, guarantee that the optimum defined in Equation (2.4) has been obtained, our greedy procedure is guided by the network topology and halts in many fewer iterations. Since the number of communities n is automatically chosen by the procedure, the computational time required for our procedure to halt is a function of (i) the number ${p^{\prime }}$ of special vertexes and (ii) the distribution of the special vertexes within the network-graph. For example, if ${n_{opt}}$ is the number of unique labels, i.e. communities, represented in ${\mathbf{s}_{opt}}$ in Equation (2.3) and ${p^{\prime }}\lt {n_{opt}}$ or if ${p^{\prime }}\ge {n_{opt}}$ but the special vertexes are frequently labeled similarly in ${\mathbf{s}_{opt}}$ then some work is necessary to compute ${\mathbf{s}_{R}}$ of in Equation (2.4).
Figure 4
Simulation: Zachary Karate Club (a) The optimal community assignments in this social network of $p=34$ vertexes. (b) The marginal relative frequency that each initial condition was selected by Procedure 5 over all $\left(\genfrac{}{}{0.0pt}{}{p}{d}\right)$ choices of special vertexes, for $d=2,3,4,5,6$. (c) The relative gain in modularity (modularity of constrained community assignment)/(modularity of unconstrained community assignment) - 1 by number of special vertexes in the network $d=2,3,4,5,6$. (d) Logarithm base 10 of the relative (to the median computational time with two special vertexes in the network) amount of computational time for Procedure 5 to halt. We did not include instances when the unconstrained community assignment vector satisfied the constraint.
7 Conclusion
Our method for identifying network communities optimizes a quality function while adhering to constraints. The results in this paper establish the utility of penalized optimization in community detection. Our method is versatile and amenable to many types of constraints on the composition of communities. We note that our procedure is valid for any constraint which is an increasing function of the variable of interest, e.g., number of CCF hospitals belonging to a community, number of cardiac surgeons, quantity of cases involving improper medical procedures, etc. The key requirement of our constrained optimization procedure is that the merging of two communities must not be the basis for the resulting community to be in violation of the constraint. A constraint that imposes a maximum ICD volume is, for example, not of this type.
There exists a disconnect between network science and health services research due in part to the incongruence between mathematical elegance and real-world constraints. We have provided an illustration of the application of both a pure (unconstrained) method and one with constraints. We solved the practical problem of partitioning a network of hospitals with the constraint that the number of ICD surgical procedures that have taken place at at least one hospital belonging to each community exceeds some threshold. Though our method advances both the community detection and heath services literature, it is not complete from the perspective of a health care policy maker since many real-world constraints remain to be incorporated. Another type of constraint is, for example, given the geographic locations of hospitals, a requirement that communities not exceed a defined geographic maximum diameter or that they satisfy a geographic congruity constraint. On the other hand, one shouldn’t necessarily seek to impose geographic contiguity constraints, for example, if one is interested in analyzing the effect of telemedicine or remote monitoring, for which the organization of health care does not need to conform as much to geography. This article is an initial exploration of a line of thinking that we anticipate will substantially advance the practical utility of community detection.
In terms of health policy, our future research involving an outcomes-based analysis of the communities discovered by our method, as constrained here by minimum ICD surgery volume and subsequently by other factors, will lead to enhanced acuity and potentially greater statistical power for studying variations in health care markets (based on patient referral patterns). Through standardizing the composition of the HCCs, our method provides the tools for such comparisons to be made meaningfully.