This section describes the pipeline for identifying an effective subset selection of sink nodes. The steps involved are as follows:

Propagate the movements of all spacecraft in the Space System Scenarios to create a contact schedule (C).

Generating data transfer networks, including a data transfer network (\(\Lambda \)) and a passedon network (B[g]) for every ground station \(g \in G\). These networks combine to produce an estimated data transfer network (A) for a given subset selection of ground stations.

Identify an Initial eigenvectorbased selection of ground stations (sink nodes) using an eigenvector embedding of a ground station relationship network (\(\Gamma \)).

Perform an Exhaustive search optimisation based on Consensus dynamics for target coverage, where the objective is to rapidly drive source (target) nodes to consensus under the influence of sink nodes (see Problem definition).
Space system scenarios
The space system studied, and the simulation used to create a contact schedule (C), are described in more detail by Clark et al. (2022), but the relevant aspects are summarised here. The space system considered is based on the orbital positions and targets of the Spire Global, Inc. constellation that collects AIS data from ships globally. All 111 spacecraft that as of July 2021 were operated by Spire Global, Inc. are included in this case study, with their Keplerian orbit elements detailed in data set (McGrath and Clark 2021). The spacecraft are in differing orbit planes with 74 in sunsynchronous orbits, 22 at approximately 51.6 degrees inclination, 8 at approximately 37 degrees inclination, 4 in nearpolar orbits, and 3 in nearequatorial orbits.
A representative set of target locations are defined for the case studies, based on data provided by Spire Global, Inc. for the 24h period of 11August2019 14:09 UTC to 12August2019 14:08 UTC. This dataset provides the last reported position of all ships detected in this 24h window. From this, 250 targets are positioned to approximate the locations of ships worldwide that were tracked from space (rather than via groundbased coastal AIS receivers) with these locations visualised in the "Results" section (Fig. 3). These 250 targets define the global targets scenario, while a subset of 16 targets located near the Caribbean are taken as the basis of the Caribbean scenario. Twenty ground station sites are considered in this study, with the locations of these sites also visualised in the "Results" section (Fig. 3).
A fixedstep integrator is used to propagate the motion of spacecraft for a defined period of time (T) and time step (\(\tau \)) to identify contacts (i.e. visible ground stations or targets on the ground). These contacts are collated in a contact schedule (C), which is used to determine the data transfer networks.
Generating data transfer networks
A data transfer network (\(\Lambda \)) is created to capture the data transactions in the space system, with a set of ground targets, spacecraft in given orbits, and a set of candidate ground stations. The network is populated by propagating the satellites’ motion and simulating data transfer in the system for a defined period of time (T), during which the movement of data packets is monitored. The process of generating \(\Lambda \) is described in detail in Alg. 1 (in black text), and is summarised as follows:

Each spacecraft in the system is assigned a data buffer (db1), where data is inserted when the spacecraft is in contact with a ground target according to the contact schedule (C).

Each packet of data is associated with the target of origin (d) when inserted into the buffer db1.

When the spacecraft is in contact with a ground station, packets in the buffer db1 are removed until a downlink/packet removal limit (\(\delta \)) for a single time step is reached.

For each data packet removal from db1, the data transfer network (\(\Lambda \)) is updated with \(\Lambda _{d,n_D+g} = \Lambda _{d,n_D+g} + 1\), where the d is the target of origin, g is the current ground station (in contact with the spacecraft), and \(n_D\) is the number of targets. Therefore, by the end of the simulation \(\Lambda _{d,n_D+g}\) will equal the number of packets acquired from d and downlinked to g.
In addition to generating the data transfer network (\(\Lambda \)), a passedon network (B) must be created for each ground station to estimate where data will be transferred if that ground station were removed from the system. This allows the importance of each ground station to be better understood, since not all sink nodes in \(\Lambda \) will be present in the final subset selected. This process is intertwined with the generation of \(\Lambda \) and hence is also detailed in Alg. 1 (in blue text), but can be summarised as follows:

Each spacecraft in the system is given a second data buffer (db2), which is populated with dummy data (0 entries) when in contact with targets (i.e matching the data inserted into db1).

In addition to dummy data, a passedon data packet [d, g] is inserted into db2 for every data packet d that is removed (i.e. downlinked) from db1 for the same spacecraft, where g is the current ground station (in contact with the spacecraft).

When a spacecraft is in contact with a ground station, the dummy (0 entry) data is the first to be removed from db2 before any of the passedon data packets associated with ground stations. Only once all the dummy data is removed, then the passedon data packets are removed from db2.

For each passedon data packet \([d,\gamma ]\) removed from db2, while the spacecraft is in contact with ground station g, the entry in the data transfer network \(B[\gamma ]\) is updated as \(B[\gamma ]_{d,n_D+g} = B[\gamma ]_{d,n_D+g} + 1\), where \(\gamma \) identifies the ground station of origin for the passedon data. Therefore, by the end of the simulation \(B[\gamma ]_{d,n_D+g}\) will equal the number of packets that were originally acquired from d, but were passedon from ground station \(\gamma \) to g.

When a passedon data packet \([d,\gamma ]\) is removed from db2, a new packet [d, g] is inserted into db2 that is associated with the current ground station contact (g). The number of times a passedon data packet is reinserted back into db2 is restricted by a packet reinsertion limit (\(\rho \)). The impact of \(\rho \) is discussed below.

As with the removal of data from db1, the downlink limit is monitored for packets removed from db2. However, the count of packets removed and this limit are monitored separately for each ground station of origin (\(\gamma \)) for passedon data.

Note that passedon data packets are not removed from db2 if their ground station of origin (\(\gamma \)) is either the current ground station or in close proximity to the ground station \(\gamma \) (see \(\Omega \) in Alg. 1). This is necessary to avoid the majority of passeddata packets from travelling back and forth between nearby ground stations.
The packet reinsertion limit (\(\rho \)) is an important consideration, as this determines the number of times a data packet is passed from one ground station to another. The most accurate passedon matrices were generated when using a \(\rho \) value that is similar to the average number of unselected ground stations that a spacecraft could expect to pass before connects with a selected selection. In this paper we are considering a subset selection of five ground station from a set of 20 candidates, therefore data packets can be estimated to, on average, pass through three ground stations before alighting at a selected station. Given that a significant portion of data packets could pass through more than three ground stations, \(\rho =4\) was applied.
Estimating data transfer network
The difficulty in identifying effective ground station combinations stems from the impact that one selection has on the value of other ground stations in receiving data and covering targets. For example, a ground station (GS1) may be viewed by a spacecraft that has received data from a target (T1). However, it is possible that the data transfer network (\(\Lambda \)) does not report this connection if, for instance, the spacecraft has already downlinked all of T1’s data to other ground stations prior to overflight of GS1. Therefore, we propose an approach for estimating the data received by a subset selection of ground stations, using the passedon networks (B) to identify where data would go if a ground station was removed. This approach has been formulated for the analysis of space systems, but such an approach is generalisable to combinatorial flow network problems, where the removal of a sink node results in greater traffic arriving at other sinks in the network.
To estimate the data received by a subset selection, the data transfer network (\(\Lambda \)) defined for the full set of ground stations needs to be updated according to the passedon networks (B). This process creates an estimated data transfer network (A) and is detailed in Alg. 2, with the ground station selection represented by a vector \(\mathbf{r }\) where \(r_g=1\) indicates a selected ground station g, and \(r_g=0\) denotes an unselected ground station. The process involves moving data from each unselected ground station, in turn, by using the normalised passedon matrix (K) to determine where the data goes, before removing data from the ground station’s column in the data transfer network A, and then redistributing the removed data according to K.
The logic used to determine a suitable packet reinsertion limit (\(\rho \)) for Alg. 1 is also relevant for selecting a suitable \(n_{pass}\) for Alg. 1. The \(\rho \) value determines how many times a passedon data packet is reinserted into the data buffer (db2), while \(n_{pass}\) represents the number of times data is moved on from unselected ground stations when estimating the data transfer network (A). Since data packets can be estimated to pass through, on average, three ground stations before alighting at a selected station, then \(n_{pass}\ge 3\) could be expected to allow the estimated data transfer network to capture the majority of redistributed data. As will be discussed in the "Results" section, \(n_{pass}\) cannot simply be set as a large value to capture all redistribution of data as this can overestimate the volume of target data received by a subset selection of ground stations.
Consensus dynamics for target coverage
An effective way of evaluating a subset selection of ground stations in terms of target coverage and data throughput, for a network \( G=(V,E) \) of targets and ground stations, is through the use of consensus dynamics. Specifically consensus leadership, where ground station selections are identified by assessing their ability to lead targets to consensus—according to the following consensus protocol—when the connections are defined in the estimated data transfer network (A).
We consider a system where each node \(v_i\) has a state \(x_i \in \mathrm{I\!R}\) and continuoustime integral dynamics, \({\dot{x}}_i[t] = u_i[t]\) where \(u_i \in \mathrm{I\!R}\) is the control input for agent i. The linear consensus protocol is
$$\begin{aligned} u_i(t) = \sum _{j\in N_i} a_{ij}(x_j[t]x_i[t]) \end{aligned}$$
(1)
and describes how node \(v_i\) adjusts its state at time step (t) based on the estimated data transfer matrix (\(A=[a_{ij}]\)) and the node state (x) of its neighbours (\(N_i\)). Given this protocol, the state of the network develops according to \({\dot{x}}[t] = Lx[t]\) with the graph Laplacian matrix, L, defined as \( L = D  A \) where \(D=\)diag(out\((v_1),\ldots ,\)out\((v_n))\) is a diagonal matrix composed of the outdegrees of each node, i.e. out\((v_i)=\sum _j a_{ij}\).
Given the definitions for the continuoustime integral dynamics and \({\dot{x}}_i[t]\), the discretetime agent dynamics are given in Di Cairano et al. (2008) as
$$\begin{aligned} x_i[t+1] = x_i[t] + \epsilon u_i[t] \end{aligned}$$
(2)
provided that \(0<\epsilon <\frac{1}{\text {max}_i \, d_{ii}}\) where \(d_{ii}\) is an element of D. The choice of \(\epsilon \) affects the number of steps required for nodes to reach convergence, therefore setting \(\epsilon = 0.999 \times \frac{1}{\text {max}_i\, d_{ii}}\) allows the number of computational steps to be reduced while still guaranteeing convergence of the system (see Di Cairano et al. 2008). Convergence is defined here as \({\bar{x}}_i>0.99 ~\forall ~i \in D\), where D is the set of all target (source) nodes, when \(x_j=1 ~\forall ~j\in G\) with G the set of all ground station (sink) nodes.
The most effective ground station selections, in terms of target coverage, are those that achieve the fewest steps until all of the targets reach consensus. Such a selection would demonstrate a strong connection to all of the targets in the system. If, in contrast, a selection had no connectivity to a given target then consensus would never be reached.
Problem definition
An objective function is required to optimise the ground station selection. The number of steps to convergence can be used, but it creates a discontinuous search space. Therefore, the mean consensus leadership,
$$\begin{aligned} m = \frac{\sum _{i \in D} (1x_i[t])}{n_D} \end{aligned}$$
(3)
provides a continuous alternative to maximise the mean consensus state of all target nodes, where \(n_{D}\) is the number of targets (source nodes) and D the set of all targets. The target (source) nodes states, \(x_i[t]\), are evaluated according to Eq. 2 where t is taken as a point prior to convergence, defined as the closest step to \(0.9\times s_{ref}\) where \(s_{ref}\) is the number of steps to convergence. The reference number of steps, \(s_{ref}\), is defined using the number of steps to convergence required for the Initial eigenvectorbased selection .
The optimisation can then be defined as follows,
$$\begin{aligned} \begin{aligned} \min \quad&\frac{\sum _{i \in D} (1x_i[t])}{n_D}\\ \text {s.t.} \quad&r_g = 1 ~\forall ~ g \in \Phi \\&r_g = 0 ~\forall ~ g \in G \setminus \Phi \\&\sum _j r_j = n_{select} = \Phi  ~\forall ~ j \in G , ~n_{select} \in {\mathbb {Z}}^+ \\ \end{aligned} \end{aligned}$$
(4)
where \(\mathbf{r }\) is the ground station selection vector, \(\Phi \) is the subset selection of ground station, G is the set of all ground station candidates and \(n_{select}\) the cardinality of the subset \(\Phi \).
Initial eigenvectorbased selection
The optimisation of ground station selections is a highly combinatorial problem and as such susceptible to local optima far from the global optimum. This issue is exacerbated by the need to update the data transfer network for every possible selection. We propose an eigenvector embeddingbased selection to act as an effective initial selection, providing an alternative to a more exhaustive search. The use of bruteforce evaluation of all combinations is often intractable for sufficiently large numbers of candidates and selection sizes.
The relationship of interest, when optimising a system for target coverage, is that between targets and ground stations. However, it is not possible to directly capture this relationship in a static network. Instead a ground station relationship network (\(\Gamma \)) is introduced, based on the passedon networks B, which details the volume of data that each ground station passes on to every other ground station when removed from the system. While the passedon networks, \(B[g] ~\forall ~g \in G\), detail the movement of data from targets to ground station, the network \(\Gamma \) details the connections between ground stations, where
$$\begin{aligned} \Gamma _{g,\gamma } = \sum _{d\in D} B[g]_{d,\gamma }\,. \end{aligned}$$
(5)
The \(\Gamma \) network is useful in identifying influential ground stations. This is despite \(\Gamma \) only detailing the relationships between ground stations, since these relationships are a product of connectivity to spacecraft that have collected target data. Therefore, \(\Gamma \) highlights whether ground stations are connected to spacecraft in similar or different orbits. Differing spacecraft orbits result in different target contacts, where these differences lead to different patterns of target coverage. Hence, selecting ground stations that cover different sets of spacecraft will also likely provide a selection that covers differing communities of targets.
The process of ground station selection takes inspiration from work on communities of dynamical influence (CDI), introduced in Clark et al. (2019), that are shown to highlight effective leadership in networks under consensus dynamics. The selection is based on the eigenvectors of \(\Gamma \), where the dominant eigenvectors (those associated with the largest eigenvalue entries) are used to embed the network in a Euclidean space. The nodes in this space that are furthest from the origin, along the direction of their position vector, are defined as leaders of separate ground station communities. This is assessed by comparing the magnitude of each node’s position vector with the scalar projection onto this vector from all other node position vectors.
The explicit objective of the optimisation is to improve target coverage and data throughput, therefore CDI analysis of \(\Gamma \) can facilitate the selection of ground stations. Specifically, an effective combination of ground stations can be expected to involve nodes in multiple different communities to ensure target coverage, while the nodes with the largest first left eigenvector (\(\mathbf{v }_1\)) entries are more likely to ensure high data throughput. Therefore, an initial selection composed of ground stations from different CDIs, each with the largest \(\mathbf{v }_1\) entry will provide a good initial guess.
Exhaustive search optimisation
An optimal selection of ground stations, in terms of convergence to consensus for the space system modelled using consensus dynamics, can be obtained by simulating all subset combinations from a set of candidates. For the scenarios explored in this paper that involves simulating all combinations of five ground stations from 20 possible options (15503 combinations in total). This is a computationally intensive process that required approximately 10 days (60 seconds per simulation) computation time for the global targets scenario on a desktop machine—Intel Xeon Processor with \(12\times \) 3.39 GHz and 46.7 GB RAM. By contrast, using the presented method, an effective selection can be obtained in minutes through the following steps:

A single simulation of data, including all 20 candidate ground stations.

An initial selection based on eigenvector embedding of the ground station relationship network (\(\Gamma \)).

A simple exhaustive search optimisation, requiring the estimation of data transfer networks as described in Alg. 2.
The simple exhaustive search is described in Alg. 3 and can be summarised as follows:

Identify an initial selection from eigenvector embedding

If necessary, add to the initial selection by performing an exhaustive search for ground stations that minimise the mean consensus leadership (Eq. 3)

Review each selection, in turn, using an exhaustive search until the mean consensus leadership is minimised.