In this section, we will lay out the steps involved in producing the graphs used for analysis, and how we processed the CSVs used for machine learning.

### The USPTO database

Data was downloaded in TSV format from the USPTO PatentsView website^{Footnote 1}, specifically the *patent, ipcr, patent_inventor* tables and tables relating to company assignments.

Using the company assignee data for IBM, all the IBM patents were extracted. They were cross referenced with the other tables and Python’s Pandas package was used to merge the relevant data to produce a final working table which consisted of the following headings: *patent_id, inventor_id,full_class_id*. The full *full_class_id* used is the International Patent Classification^{Footnote 2}.

Our previous application of this method focused on the years from 1976–2011 (Keane et al. 2019). However, the USPTO now provide an updated databases and the records we use now run until August 2019. This has allowed us to test the method on an updated data set including more recent patents.

In an analogous manner, we also extracted Samsung’s patents for the same timeframe from 1976–2019. This allowed us to compare the structures of the collaboration graphs of each company, and to see if our proposed framework, which was originally tested using IBM data, would also work on another company’s collaboration graph.

### Graph creation and investigation

The IBM graph is a collaboration graph *G*=(*V*,*E*), where *V* contains the inventors of IBM patents, and *E* contains edges representing any collaboration between the inventors on a patent. The extracted patent collaboration graph has 61,868 nodes and 285,262 edges with average degree of 9.22 and clustering coefficient of 0.62. The weights were assigned to each edge to correspond to the strength of the collaboration. It was calculated as inverse of the number of co-inventions to indicate the cost between two inventors. Note, these original weights were used mainly for exploring the graph before analysis and are not used in our prediction scheme.

The Samsung collaboration graph is slightly smaller than the IBM graph and had 46,268 nodes, and 252,548 edges. It had an average degree of 10.92, and an average clustering coefficient of 0.55. This compares quite closely to the IBM collaboration graph figures for average degree and average clustering coefficient; 9.22 and 0.62 respectively.

These graphs were created from the table generated by Algorithm 1. Python’s Pandas package was used for grouping the data. As noted above, patent classes are used to represent the skills required on a team. A very small scale example of such a graph is shown in Fig. 3, where the weight is the cost of traversing the edge.

The initial IBM graph has 61,868 nodes and 285,262 edges. As such it would be computationally expensive to perform many calculations with it. Consequently, for efficiency, the graph was reduced by selecting a subgraph relevant to specific patent class ids. This was achieved by inducing the subgraph on all nodes/inventors holding patents in those classes. Again, this procedure was also performed on the Samsung collaboration graph, which is also large enough for computation to be potentially costly. Pseudocode for the method which was used to split the graph is outlined in Algorithm 2.

The left of Fig. 4 shows the distribution of degrees among IBM inventors width a bin width of five. It is clear that the majority of nodes have a small degree, somewhere between 0 and 5. It is worth looking at the whole set of numbers on a log scale, so that we can see what is going on at the tail of the histogram. This is shown on the right of Fig. 4, also with a bin width of five.

Interestingly, the majority of occurrences are of degrees between zero and five, which is significantly below the average degree of 9.2. This suggests that there are, indeed, some IBM inventors with large collaborative networks that are increasing the average figure and pulling it up. A cursory look at the degree data shows us that the minimum degree is 0, predictably, as there will be inventors who work alone, and the maximum is 332. For comparative purposes, the maximum degree for the Samsung graph is 464, and of course, the minimum degree is 0.

It is worth taking a look at the degree distributions of the graph after we have trimmed it by class, as per Algorithm 2.

### Machine learning model training

Next, using the collaboration graphs *G* created in the previous section, new datasets were created for the purpose of training machine learning models. The aim is to produce a set of features that might predict if two inventors are likely to collaborate, even if that is not reflected in the present patent data sets.

Specifically, datasets were created with the columns: *vi, vq, bonding_vi, bridging_vi, bonding_vq, bridging_vq, vi_collab_ratio, vq_collab_ratio, vi_cluster, vq_cluster, vi_patents, vq_patents, common_neighbors, expert_jaccard, resource_allocation, connected*. We will explain each of these in turn, but note that each was selected to help with the prediction of potential collaborations while being relatively inexpensive to calculate from the graph. For each pair of inventors, *vi* and *vq* refer to each inventor of the pair. Some of the measurements relate to each inventor individually while others depend on both *vi* and *vq*. The individual metrics are indicated by having *vi* or *vq* in their name.

The bonding and bridging capital (Sharma et al. 2018) of each inventor are designed to measure homogenity and heterogenity of ties of inventors. They are calculated as follows. First, the communities of the graph *G* were extracted using NetworkX’s best partition function (in the community module). We then considered how many neighbors of each inventor are in the same community. The bonding capital is then given by:

$$ \text{Bonding Capital} = \frac{\left|\text{Neighbors in same community}\right|}{\left|\text{Total number of neighbors}\right|}. $$

(1)

Similarly, the bridging capital is given by:

$$ \text{Bridging Capital} = \frac{\left|\text{Neighbors \textbf{not} in same community}\right|}{\left|\text{Total number of neighbors}\right|}. $$

(2)

The collaborative ratio (Wu et al. 2013) of each inventor is given by:

$$ \text{Collaborative Ratio} = \frac{\left|\text{Collaborative patents of\ inventor}\right|}{\left|\text{Patents of inventor}\right|}, $$

(3)

where a *collaborative patent* is any patent which has more than one inventor. This measure is included to capture inventors who are likely to work collaboratively.

The values *vi_cluster* and *vq_cluster* are simply the clustering co-efficient of each inventor. The *clustering coefficient* of a node or vertex *v* is defined as;

$$ c_{v} = \frac{2T_{v}}{d(v)(d(v)-1)} $$

(4)

where *T*_{v} is the number of triangles formed through node *v*. These measures were included as a measure of collaboration between an inventor’s immediate neighbours. Likewise, *vi_patents* and *vq_patents* are the total number of patents held by each inventor, and give an indication of inventors current success in patenting. The *common_neighbors* are the number of neighbours common to both *vi* and *vq*, and gives a simple measure of how overlapped the inventors’ collaborators are.

The Jaccard Index of two sets is given by:

$$ J(A,B)=\frac{\mid A \cap B \mid}{\mid A \cup B \mid}. $$

(5)

To find the *expert_jaccard* of *vi* and *vq*, we take the Jaccard Index of the set of patent classes in which each inventor held expertise. We define expertise by an inventor being in the top quartile of inventors in a patent class, as measured by number of patents held in that class. We consider this a simplistic measure of the common interests of the inventors.

Resource allocation is the *resource allocation index* of a pair of nodes, *u* and *v* as given by Python’s NetworkX module and is defined as:

$$ \sum_{w\in\Gamma(u)\cap\Gamma(v)} \frac{1}{\mid{\Gamma(w)}\mid} $$

(6)

where *Γ*(*u*) denotes the set of neighbors of *u*. The resource allocation index has previously been identified as useful in missing link prediction (Zhou et al. 2009).

Finally, the connected value is 1 if there is an edge between *vi* and *vq* in the collaborator graph, and 0 if there is no edge.

We now have a dataset for IBM and Samsung which we can use to predict if two nodes are connected based on the features described above.

### Machine learning model

We now use Python’s XGBoost module to create models, with the *connected* column providing the binary classification target for prediction. Figure 5 shows the parameters used with XGBoost to create the model. We experimented with various machine learning models, such as LR, SVM, and CART. We used XGBoost as it provided the most accurate model.

As there is no link between most pairs of inventors, the connection column exhibits substantial skew that might impact predictions. To counteract this, SMOTE oversampling (Chawla et al. 2002) was employed to ensure a more accurate model was trained. SMOTE works by creating synthetic points on lines between points in the minority class, and using this to provide more training points.

Following the training of this model, a second graph split between random domains was created as before, and a dataset of the same format was created. The model was tested on this new, unseen, dataset and the predicted *connected* column was compared against the actual one. From this, the accuracy of the model was measured. This was repeated for the Samsung data.

Figure 6 (left) shows that the model provides good accuracy across all classifications for the IBM data. The model is predicting an edge between inventors, with a 92% accuracy. Conversely, it is incorrectly predicting that there is a zero, or no edge between inventors, 8% of the time. Of particular interest to us is the top right quadrant. Our model predicts that there is no edge between inventors, 99% of the time, and predicting incorrectly that there is an edge between inventors just 1% of the time. We will use this 1% of *false positives* to create additional links in our graph. The results show similar accuracy for the Samsung graph, as shown in Fig. 6 (right). These results are similarly promising, with 8% of instances where the model incorrectly predicted an edge provides us with the edges we will use to augment our Samsung graph.

### Creating new links

Having built a model to predict links, we are now ready to introduce the graph that is augmented with extra edges that are predicted by that model. This is to overcome a limitation of using the Enhanced Steiner Tree Algorithm (Lappas et al. 2009) directly on the collaboration graph, in that it will not consider collaborations that sit in unconnected components.

As such, we propose that artificial links are created on the graph between pairs of inventors for which our machine learning model “incorrectly” predicted there was a link, corresponding to the top right quadrant of the confusion matricies shown in Fig. 6. The machine learning model predicted links, based on the parameters discussed in “Machine learning model training” section, that these two inventors had attributes that indicated a potential collaborative relationship. Consequently, we will add these potential links using the procedure in Algorithm 3.

The graph was augmented by creating an edge between every pair of inventors for which the model predicted an edge, but there didn’t already exist such an edge. We also need to provide a weight for the edge. This was set as 1−*P* where *P* was the probability of an edge existing as predicted by the machine learning model. Note that this weight is applied to *all* edges, including those included in the original collaboration graph. This ensures that all weights are comparable.

This provides us with our augmented graph on which we will run the Enhanced Steiner Tree Algorithm to identify a team.

To show how this changes the graph, we looked at the degree distribution of the collaboration graph, as we did for the original graph in “Graph creation and investigation” section. We will look at the degree of the graph trimmed for a subset of classes, both before and after augmentation. This is shown in Fig. 7. It can be seen that the average degree value increases, which is obvious. However, in the augmented graph, the base of the histogram is much wider, i.e., there are more occurrences of degrees across a wide range.

### Testing and evaluation

Given the relation between team size and cost described above, our main criterion for testing will be the team size required to cover a particular set of skills.

This raised an interesting point that, having added in edges predicted by the machine learning model, the cardinality of the team would almost certainly be smaller, as the augmented collaboration graph would now be better connected. In fact, even if we added edges at random, we would expect teams to be smaller. We take advantage of this intuition to evaluate our scheme. For comparison, we use teams generated from a randomly augmented graph. To generate this randomly augmented graph, we add edges at random between pairs of nodes until the number of edges of this randomly augmented graph equals the number of edges of the machine learning augmented graph. This is outlined in Algorithm 4.

We note that this produces a substantially different distribution of edges to that produced by our machine learning scheme. Figure 8 shows the degree distribution for a randomly augmented graph. When compared to the distributions shown in Fig. 7, we see that the randomly augmented graph has a larger number of high degree nodes. The edge weights of all graphs, original, ML augmented and randomly augmented are set to 1−*P* where *P* is the probability of there being an edge as per our machine learning model. This ensures an even comparison between all three graphs where edge weights are considered.

In our evaluation, we will compare the average team size required for different skillsets (i.e. sets of required skills). Since there are many skillsets, we compare the average team size as a function of the number of skills.

Specifically, a task *T* is a set of patent classes that could be interpreted as the skillset required for the task at hand. For this task, the Enhanced Steiner Tree Algorithm (Lappas et al. 2009) can be run on (1) the original graph, (2) the machine learning augmented graph, and (3) the randomly augmented graph. For completeness, we restate the Enhanced Steiner Tree Algorithm from (Lappas et al. 2009) in Algorithms 5 and 6. The line, *“EnhanceGraph(G,T)”* in Algorithm 6 makes a pass over the graph G. An additional node *Y*_{j} is created for every skill in *T* and these nodes are edge connected to an inventor if and only if that inventor has expertise in that skill. The distance between inventors and these new nodes is set to be greater than the sum of pairwise distances of nodes in G.