In the previous section we introduced the theory and metrics behind the measurement of “cities”, “formal employment” and “industry complexity”. We showed that large cities generate more formal employment and have higher levels of industry complexity. Our task now is to explain the mechanism that produces such relationships. For that purpose, here we review and extend the model developed by O’Clery et al. (2016) to measure the “skill-distance” between a city’s current industrial base and new complex industries needed to increase its formality rate. This approach relies on a methodology proposed by Neffke and Henning (2013) to estimate the similarity between industry pairs in terms of skills or capabilities.

### Labour flow industry network

An industry is considered skill-related to another industry if the number of job switches (i.e., worker flows) between these industries is larger than what would be expected from randomising all switches among all pairs (Neffke and Henning 2013).

Formally, if *ϕ*_{i,j} is the number of job switches between industry *i* and industry *j* (during a given time period), then the “skill-relatedness” can be computed as a matrix with entries:

$$ S_{i,j}=\frac{\phi_{i,j}/\sum_{j}{\phi_{i,j}}}{\sum_{i}{\phi_{i,j}}/\sum_{i,j}{\phi_{i,j}}}. $$

(1)

The skill-relatedness captures whether a higher flow of workers is observed between industry *i* and *j* or between *j* and *i* and so the matrix *S*_{i,j} is made symmetric by averaging with its transpose, and re-scaling the values so that they range from -1 to 1:

$$ A_{i,j}=\frac{S_{i,j}+S_{j,i}-2}{S_{i,j}+S_{j,i}+2}. $$

(2)

We can consider *A*_{i,j} as the adjacency matrix of an undirected weighted network where the nodes are the set of industries and the edge weight between nodes *i* and *j* represent how relatively far the labour flow between those two industries are from the random expectation, given by the value of *A*_{i,j}. Note: only positive values of this matrix (that is, more job switches than expected) are preserved^{Footnote 3}. Full detail on these methodological considerations can be found in Neffke and Henning (2013).

The industry network is visualized in Fig. 6, where industries (the nodes of the network) which belong to the same sector (official industry classification) have the same colour, and pairs of industries which have more switches than expected (*A*_{i,j}>0) are connected with an edge (so that the edges of the network represent pairs of industries with a high number of job switches or high skill-relatedness). The size of the node is proportional to its industry complexity. We observe natural clustering of industries based on shared skills or required know-how, in many cases along sectoral lines. For example, on the left-hand side, it is apparent that both social services (purple) and financial services (red) tend to share workers, while manufacturing industries (blue) exhibit a number of distinct clusters.

We can quantify the observed clustering of industries within sectors by comparing edge connectivity within sectors relative to what would be expected if edges were distributed randomly. The positive entries of the adjacency matrix *A* are shown in Fig. 7. Nodes are ordered in terms of sector, and sectors delineated by vertical and horizontal lines. The diagonal blocks correspond to within-sector edges. We observe clear within-sector clustering (high density of edges) in many cases, e.g., social services, transport and communications. Our measure of the density of edges between (distinct) sectors *G* and *G*^{′} is given by:

$$ \mu(G, G') =\frac{e_{G,G'}}{e}\frac{n(n-1)}{n_{G} n_{G'}}, $$

(3)

where *n* and *n*_{X} are the total number of industries and the number in sector *G* respectively, *e* and *e*_{X,Y} are the total number of edges and the number between nodes in sectors *G* and *G’* respectively. In essence, this computes the ratio of actual edges to possible edges within the subgraph induced by the industries in sectors *G* and *G*^{′}, divided by the ratio of total edges to all possible edges (i.e., a complete graph). A value of *μ*(*G,G*^{′})>1 means that there are more switches between sectors *G* and *G*^{′} than expected.

For a single sector *G*, this expression is slightly modified:

$$ \mu(G) =\frac{e_{G,G}}{e}\frac{n(n-1)}{n_{G} (n_{G}-1)}. $$

(4)

As before, a value of *μ*(*G*)=*μ*(*G,G*)>1 indicates that switches within sector *G* are more frequent than expected.

We find that *μ*(Construction)=39.6, and hence workers are significantly more likely to switch between different construction industries than other industries. On the other hand, *μ*(Manufacturing)=0.8 and hence manufacturing industries do not form a single tightly connected cluster, but instead form connections with other industries that share specific skills and competencies.

Considering off-diagonal blocks, we observe significant connections between sectors. For example, switches between construction and mining and oil industries are denser than average connectivity in the network. It is clear that, while some of the structure of the network follows sector groupings of industries, much relatedness between industries is not captured by official sector classifications.

### Local industrial structure

We can think of the industry network as a ‘economic landscape’ that describes the possible diversification paths open to a city. Specifically, the location of the set of industries present in a city constrains its future development: cities tend to move into new economic activities that are proximate (in a skill - and consequently network - sense) to those already present (Nelson and Winter 1982; Hausmann et al. 2007a; Frenken and Boschma 2007; Neffke et al. 2011). Before we explore growth paths, we need to formally define “industry presence” with respect to changing city definitions.

For a given city definition, we will use a commonly deployed measure which captures the relative importance of an industry in the city given its overall distribution across the country. The *location quotient* *LQ*_{c,i}, also known as revealed comparative advantage, is defined as:

$$ {LQ}_{c,i} = \frac{femp_{c,i}/{wpop}_{c}}{\sum_{c} {femp}_{c,i}/ \sum_{c} {wpop}_{c}}. $$

(5)

An industry *i* is “present” in city *c* if the corresponding location quotient *LQ*_{c,i}>1, which means that a larger share of the population of city *c* work in industry *i* than at a national level (more in the Appendix).

As a case study, we will explore the industrial structure of Medellín, and its surrounding municipalities. Medellín is the second largest city in Colombia and it is located in the Aburrá Valley. Several municipalities can be considered part of Medellín, including Envigado, Itagüí, La Estrella and Bello, which are connected to the core centre of the city with the Medellín Metro system. A clear definition of the limits of Medellín, however, is not obvious. For example, Rionegro is a municipality located to the east of Medellín. Although there is a considerable distance between Rionegro and Medellín, Rionegro is host to the *José María Córdova International Airport*, the second busiest airport of Colombia, which functions as an aerial hub for Medellín.

The set of industries present in Medellín is shown in Fig. 8. This set changes as the definition of the city changes (as we increase the commuting time threshold *τ*). Municipalities such as Copacabana, La Estrella, Envigado, Bello and Rionegro become part of the city and, as a result, some new industries develop location quotients higher than 1, thus becoming “present”.

Several examples are worth mentioning. The glass industry becomes present when Envigado joins Medellín. The firm in question is Peldar which employs around 200 in Envigado (and nationwide 1,200), and is owned by one of the largest business groups in Colombia. Peldar manufactures crystal products and glass containers highly regarded in Colombia. The vehicle industry is added to the industrial portfolio of the city also as a result of Envigado, which hosts the Colombian subsidiary of Renault, the only big exporter of cars from Colombia. Renault has 1 100 workers making it the second largest employer in Envigado - and by far the largest exporter: US 317m in 2017 (or nearly 85% of the municipality’s exports). The electric motors and transformers industry is added when the small municipality of Copacabana is included in the city. This is due to one firm (Rymel) that produces electrical equipment and installations. Not all of the high complexity industries that are added produce manufactured goods. When the commuting radius is extended to *τ*=49 minutes, Rionegro becomes part of the city and “activites of airports” are added to the industry mix. Although direct employment generated by the Rionegro airport is small (180 workers), it makes the Medellín area a hub for air travel and, more significantly, gives the city an important advantage for exports of high-value products.

### Complexity potential

As Fig. 8 suggests, clusters of skill-related industries tend to be present in a city due to the fact that industries that use similar skills tend to co-locate, and new industries emerge that share similar skills to existing competencies. The question arises: can we capture, in a measure, the likelihood that a city moves into new industries more complex than those already present?

The *complexity potential* of a city, introduced by O’Clery et al. (2016), is a measure of the possibilities for the city to move to more complex industries that are not yet present in the city, taking into account the existing skills of the local labour force. To compute the complexity potential in the initial period (2008), we need to estimate the distance between city *c* and each missing industry *i*. This distance weighting factor *d*_{c,i}, also known as “density” in the literature (Hausmann et al. 2007b; Hausmann et al. 2007a; Hausmann et al. 2014a), is defined as

$$ d_{c,i}=\frac{\sum_{j \in N_{c}} A_{i,j}}{\sum_{j} A_{i,j}}, $$

(6)

where *N*_{c} is the set of industries that is present in city *c*. Effectively, it is the ratio of edge weights connecting *i* to industries present in city *c* to the total edge weight (for edges connected to node *i*). Computed below for the set of currently missing industries, this quantity can be thought of as the likelihood of an industry appearance based on current presences in industries with similar skills. The density of a city-industry pair varies with commuting threshold *τ* through variation in the set of present industries in a city, *N*_{c}.

The complexity potential for a city is the weighted mean of the complexity of its missing industries, where the weight corresponds to the density above:

$$\begin{array}{*{20}l} {CP}_{c} = \frac{1}{|M_{c}|} \sum_{i\in M_{c}}d_{c,i} C_{i}, \end{array} $$

(7)

where *M*_{c} denotes the set of “missing” industries for city *c* (where *LQ*<1), and the *C*_{i}∈[0,1] is the normalised complexity of industry *i*. The complexity potential varies with commuting threshold *τ* through variation in the density (and not the complexity which we fix).

The complexity potential of a city varies when more municipalities are added and the set of industries considered present changes. Continuing with our Medellín case study, Fig. 8 shows that the complexity potential increases when first Bello and then Envigado, Itagüí and La Estrella are added (the three municipalities located in the South of Medellín). Finally, it increases again at *τ*=49 which is when Rionegro (which hosts the airport) is added. The fact that the complexity potential of Medellín grows as more municipalities are added implies that the industries that are added increase the density - or likelihood of an appearance - of complex industries (i.e., the added industries increase proximity to complex industries in the industry network).

In the next section, we will investigate the extent to which the skills embedded in surrounding municipalities, such as those observed in Medellín, are predictive of a cities’ ability to grow its formal employment rate.