The nested structure of urban business clusters

Although the cluster theory literature is bountiful in economics and regional science, there is still a lack of understanding of how the geographical scales of analysis (neighbourhood, city, region) relate to one another and impact the observed phenomenon, and to which extent the clusters are industrially bound or geographically consistent. In this paper, we cluster spatial economic activities through a multi-scalar approach following percolation theory. We consider both the industrial similarity and the geographical proximity of firms, through their joint probability function which is constructed as a copula. This gives rise to an emergent nested hierarchy of geoindustrial clusters, which enables us to analyse the relationships between the different scales, and specific industrial sectors. Using longitudinal business microdata from the Office for National Statistics, we look at the evolution of clusters which spans from very local groups of businesses to the metropolitan level, in 2007 and in 2014, so that the changes stemming from the financial crisis can be observed.


Introduction
According to Malmberg and Maskell (2002, p.430-1), "there are several reasons to take the issue of spatial clusters seriously. One is that spatial clustering is at the very core of what research in economic geography is all about. [...] There is a lot to learn about the role of proximity and place in economic processes by trying to pinpoint the driving forces that make for the agglomeration in space of similar and related economic activities [...] Second, this task has obvious policy relevance today". Interestingly though, the economic drivers and rationale behind the clustering of businesses might be at odds with the policy incentives to promote particular locations for institutionalised clusters. For example, the eastern Fringe of the City in London has witnessed the rapid clustering of start-ups and businesses from the 'digital creative', tech and advertisement industries 1 in the 2008 crisis aftermath, around Shoreditch and Old Street (Foord, 2013). However, from the moment the digital cluster was recognised, labelled and institutionalised as 'Tech City' by local actors and eventually by the government (in 2011), the hype and investments by big players of the sector (Google, Cisco, Vodafone) contributed to push away the endogenous small actors of the cluster, who started relocating to (cheaper) neighbouring locations (Nathan and Vandore, 2014), following the spatial development of key amenities such as semipublic spaces and a diverse mix of building types and empty sites (Martins, 2015). Moreover, if the "current vitality emerges from the risky experimentation across co-located sectors in which hitherto unrelated knowledge and activities (for example, software and advertising) are being combined" (Foord, 2013, p.52), it suggests that any successful sectoral combination at present might not be so successful in the future, which instead should benefit newer risky combinations. This highlights the need for a better understanding of the inner (industrial and spatial) dynamics of clusters and the overarching organisation of urban economies driving individual firms' relocation strategies, for analytic purposes as well as for policy efficiency. In particular, identifying clusters and drawing cluster policies has become mainstream since the influential contribution of Michael Porter in the 1990s Porter (1998). Nevertheless, there is no unique way to define a cluster, and the fuzziness of its original definition has made it "confusing" (Martin and Sunley, 2003). Within the literature, the term is used to refer to very local phenomena (e.g. eastern Fringe of the City in London) as well as their regional counterparts (e.g. the South-East of the UK, which includes Greater London and the surrounding local authorities). On the one hand, there is either no systematic way to define clusters, and on the other, the methods employed might contain hidden assumptions. Our contribution thus aims at rendering explicit and transparent the delineation process, but also at introducing other tools outside the traditional ones, such as percolation and network theory, allowing us to bridge the scale gap. Within the framework of the economic geography literature, one can identify two recurrent elements referring to the definition of clusters listed below.
The first one refers to considering clusters as a network of inter-dependent firms and industries. Iammarino andMcCann (2016, p.1023) summarize this idea by stating that industrial clusters are distinguished in terms of the nature of firms in the clusters and of their relations and transactions undertaken within clusters. In more classical definitions, we find similar descriptions of networks of firms. For example, Porter (1998, p.199) mentions interconnected companies and associated institutions as the core of clusters. Rosenfeld (1997, p.4) talks about the interdependence of firms, Feser (1998, p.26) and Swann et al. (1998, p.139) talk about their relatedness. Simmie and Sennett (1999, p.51) insist on service companies being interconnected, while Roelandt et al. (1999, p.9) and Van den Berg et al. (2001, p.187) use the figure of the network to define clusters, even though they refer to producing firms in the first case (Roelandt et al., 1999) and to specialised organisations in the second case (Van den Berg et al., 2001). All in all, the element of networked firms of similar or interrelated industries is a constant of most definitions of industrial clusters.
The second broad element that we find in most definitions of clusters is a spatial reference. However, the concrete specification of this spatial reference is all but precise and homogeneous across authors. For example, some definitions of clusters mention geographical proximity of the connected firms (Porter, 1998;Rosenfeld, 1997;Enright, 1996), or the fact that they are closely located (Crouch and Farrell, 2001, p.163). Swann et al. (1998, p.139) define clusters as "a large group of firms in related industries, at a particular location", thus avoiding any precision about the scale and spatial extent of this agglomeration. Finally, the question of scale is also avoided by Van den Berg et al. (2001, p.187), as they allow networks of firms to have a "local or regional dimension". To get the picture a little more confused, Bergman and Feser (1999, p.2) "make a key distinction between clusters in economic space and clusters in geographic space". However, in the dominant majority, spatial industrial clusters tend to be identified first by the co-location of a set of interdependent firms or activities of a given industry, and second by the enclosing geographical unit in which they are located. More precisely, if this network happens to correspond to a territorial entity, the cluster becomes a local or regional cluster, otherwise it is left to other branches of economics to study. Unfortunately, these practices are not systematic and do not constitute reproducible methods.
In Park et al. (2019), the authors propose an ambitious systematic approach to look at hierarchical firm clustering, using labour flows estimated by LinkedIn profiles over the past twenty years in the US. They are thus able to compare the geographical and industrial aspects of firm clustering through labour flows. In general, they show that homogeneity regarding the dominant industrial specialisation of firms tends to be stronger than their dominant geographical location, although both are significant. They can also match market capitalisation at the firm level and skills at the individual level to assess the profiles of dynamic clusters. However, what they call "geoindustrial clusters" diverges from our own acception, since these refer to network communities of firms that have geographical and industrial attributes attached to them, whereas we call "geoindustrial clusters" a group of geographical units which are close in terms of industry mix as well as in terms of travel proximity. In addition, their approach focuses on networks of firms given by the labour transitions, which defines a very precise subset of firms, while in our case, we are interested in the spatial evolution of economic activities, and hence we consider all firms on local units. We use exhaustive administrative data at the establishment level, allowing us to refine the industrial description of businesses in London. The data consists of plant-level business organisations, which contain a large diversity of activities, from which only the domi-nant industry is considered and scaled down to the establishment (plant) of each firm. Finally, through the percolation method, unlike clustering using community detection, local units can be dropped rather than included in a loose cluster, and the multi-scalar economic organisation of businesses in cities can be revealed.
Approaches within network theory can be widely found in the literature. Among those, Catini et al. (2015) suggest a graph-based method of cluster definition which "takes into account the relational patterns among co-located activities". Using geolocated PubMed scientific publications as an indication of activity for the biomedical sector, they apply the City Clustering Algorithm (CCA) Rozenfeld et al. (2008) at a fixed distance of 1km, and then identify clusters through k-shell decomposition. This amounts to a percolation process based on a single dimension (physical distance) between firms of a single industry (the biomedical sector), using a single threshold (1km). In this paper, we present a networkbased method which is similar to this framework but which extends it to all sectors and all relevant thresholds for a multi-scalar approach. The percolation process is applied to small geographical units based on the copula of two distances: the travel time distance and the similarity of the industrial composition between each pair. This allows us to consider different resolutions of clusters which give rise to a nested structure across geographical and industrial scales. Our approach provides a powerful insight on the relation between different cluster scales, which can ease the process of understanding better spillover effects for policy making. This piece of research is developed using longitudinal business microdata for London (see section 1), focusing on two years: before and after the financial crisis of 2008.
1 Materials and Methods 1.1 London's microdata and economic geography According to the Greater London Authority, i.e. the metropolitan institution which comprises the City of London, 13 inner boroughs and 19 outer boroughs, there were 8.825 million residents in London in 2017 2 , about 5.5 million jobs and short of half a million local establishments. "Across London, the vast majority (86 per cent) of workplaces are part of very small firms; "micro-enterprises" employing less than 10 employees. [...] The London economy has specialisations in Professional, scientific and technical services; Finance and insurance; and Information and communication. Employment in these three industries is particularly concentrated in inner London, accounting for more than 33 per cent of jobs in Camden, Islington, Southwark and Westminster, almost 50 per cent of jobs in Tower Hamlets and over 70 per cent of jobs in the City of London in 2014. [...] By drawing in workers, tourists, and other visitors, central London areas also support jobs in accommodation, food, arts, entertainment, and retail services in the surrounding areas of inner London. In 2014, the combined Retail, and Accommodation and food services sectors for example accounted for around one in three employee jobs in Kensington and Chelsea, around one in four jobs in Newham and one in five in Haringey, with some evidence of recent growth in the number of jobs around the shopping centre developments in Stratford. " (Girardi and Marsden, 2017, p.2-35). In order to draw a finer picture of the London economy, at the level of workplaces across the city at different points in time, we turned to a micro dataset recording business organisations in the UK, as well as their different establishment if they are based in multiple sites. The Business Structure Database 3 (BSD) "is derived primarily from the Inter-Departmental Business Register (IDBR), which is a live register of data collected by HM Revenue and Customs via VAT and Pay As You Earn (PAYE) records. [...] In 2004 it was estimated that the businesses listed on the IDBR accounted for almost 99 per cent of economic activity in the UK" 4 . This database is provided by the Office for National Statistics (ONS) for free, although under secure access to protect anonymity and non-disclosure. In the context of the ONS non-disclosure rule, it means that no information which can allow the identification of a particular enterprise can be extracted from the secure environment. Most of the time, it means that information needs to be aggregated over at least 10 enterprises or local establishments. However, a significant advantage of this database compared to any other free-access source which allows singling out individual enterprises (such as Companies House for example) is that this dataset is longitudinal, and hence we are able to look at changes in the distribution of firms over time.
In this particular paper, we use all the active local units of Greater London in 2007 and in 2014. They represent 550,000 active units in 2014, from slightly over 400,000 in 2007. These local units were extracted from the 7.5 million units active at one point in time in the UK, using London postcodes as a filter (table 1)  When local units are aggregated into administrative zones such as LSOAs 5 , it becomes possible to compute some diversity and specialization measures. For example, the measure of entropy of local units (described by the 5-digit SIC codes) shows a heterogeneous picture ( fig. 1).
The most diverse places in terms of industries are Central London (including the areas of Temple, the City, the South Bank and the West End) and the subcentres around Heathrow airports, Croydon and Wembley.
In terms of industrial specialisation, the Hirschman Herfindahl Index (HHI) highlights the areas which have an industrial profile that differs strongly from the overall proportion of sectors It means that areas with high values of HHI have very specific profiles and concentrate some sectors in a relatively strong manner. For example, Temple appears as an outlier in the distribution of activities, whereas the other parts of Central London are very representative of the distribution of activities in London overall (figure 2).
In terms of industrial diversification, the Krugman index (K) "calculates the share of employment which would have to be relocated to achieve an industry structure b equivalent to the average structure of the reference group b" Palan (2010). It shows (figure 3) zones, mainly away from the main centres, which would need to relocate a large share of firms to achieve a reference profile. In reality, these zones correspond to residential areas with few firms, whereas dense economic centres have a large diversity of industries and would thus need a lower proportion of changes to match the London profile as a whole, to which they each contribute more.

Defining geoindustrial proximity
In order to account simultaneously for geographical closeness and industrial similarity between LSOAs, we need to pick a measure of geographical distance and a measure of industrial similarity, to apply them to all pairs of LSOAs within a given city and then to combine them into a single measure of proximity.
Geographical distance. There are many different possibilities to account for geographical distance. In the context of a city, the connectivity between two different areas is better represented by the availability of public transport between these two zones, instead of the physical distance. We take the transportation network for the following modes of transport: underground, buses and rail, developed under the project QUANT 6 . The network collapses the three modes on one layer, where the weight for the link is given by the fastest time it takes to go from one LSOA to another. The walking time (5 miles/hour) required when changing modes of transport is also taken into account. If for any reason there is no public transport connecting the LSOAs, we use walking time.
Industrial similarity. In order to compute the industrial similarity, we start by aggregating the business units by 2-digit SIC category (SIC2 level) for each LSOA, and by removing from the analysis all LSOAs containing less than 10 business units (due to the ONS non-disclosure condition). In this sense, we are assuming that "SIC categories are a reasonable measure of relatedness" (Bishop andGripaios, 2007, p.1746). Local units are attributed to one of the 88 distinct 2-digit SIC categories. We then use a measure of cosine similarity as in equation 4 where V i and V j are the vectors of LSOAs i and j respectively, defined in the n = 88 space of the industrial categories.
Geoindustrial proximity. Instead of considering either geographical proximity or industrial similarity, we construct a probability function that takes into account both. Given that we are considering travel time for proximity, we need to transform the time t to x t = 1/t, so that a smaller time reflects a stronger connection. In addition, we normalise the variable, so that both lie within the same interval [0, 1]: Note that by construction, the similarity s given by eq.4 already does.
The joint probability function for s and x t is constructed using a copula, which is widely used in the field of quantitative finance to model multivariate dependencies Low et al. (2013). A copula of random variables (X 1 , ..., X d ) corresponds to the joint cumulative distribution function (CDF) C : [0, 1] d → [0, 1] of the uniformly distributed marginals (U 1 , ..., U d ), see Joe (1997) for details. This means, that we first need to find the uniform distribution as a bivariate vector (U 1 , U 2 ) of (x t , s). Then we proceed to construct the copula: C(u 1 , u 2 ) = P (U 1 ≤ u 1 , U 2 ≤ u 2 ) using the VineCopula package in R. We obtain similar results for both years 7 , see Fig. 4 for 2007. The main network of geoindustrial proximity between LSOAs is constructed as a fully connected network, weighted by the intensity of their geoindustrial proximity, which is encoded in the copula. This is defined as G = (V, L), the nodes V = {n 1 , ..., n N } correspond to the N LSOAs, and the links L = {p ij } for i, j ∈ [1, N ] to the copula probabilities given by the CDF.

Clustering method
One of the main novelties of the proposed approach, is that instead of obtaining a single configuration of clusters in the space, we derive a hierarchical structure by looking at the nested configuration at different scales. We do this by applying percolation theory, which has been successfully used in the past for this purpose. For example, Gallos et al. (2012) used the rates of obesity to cluster US States to identify spatial clusters of similar health behaviour. Arcaute et al. (2016) used the metric distance of road segments to produce a hierarchical clustering of the UK, showing that different distance thresholds highlight different spatial discontinuities in the road network. Molinero et al. (2017) extended the method using the angular distance, and obtained a classification of the importance of streets in the road network, in addition to deriving the main skeleton of urban systems without further assumptions. All these methods are based on the CCA clustering algorithm developed in Rozenfeld et al. (2008Rozenfeld et al. ( , 2011. In this paper, we apply the same algorithm derived in Arcaute et al. (2016) to the network G, and obtain the hierarchy from the multiplicity of transitions. The clusters are the result of a thresholding process such that p ij > p, where the value of the threshold probability p, carries no direct interpretation other than the higher p the stronger the geoindustrial proximity. It is important to note that the copula obtained has an extremely small variance, which can be observed in fig. 4. This causes the main transitions of interest to occur right at the tip, which correspond to values of p > 0.9 for the CDF of the copula. Under this threshold, all LSOAs are close and similar enough to form a single giant cluster for the whole city.

Results
In the following, we look at the structure of businesses and its evolution in London between 2007 and 2014, that is, before and after the financial crisis. We first analyse how clusters are located and structured in London in 2014 and how they nest across scales using different thresholds (section 2.1). We then present the evolution of clusters between 2007 and 2014 (section 2.2). Finally, we turn to show how these clusters specialise in two key sectors of the London economy, namely the knowledge intensive sector and the retail and leisure industry (section 2.3).

The nested structure of businesses in London in 2014
In 2014, the tech sector and the post-Olympic industry were flourishing in the eastern part of central London. They are reflected in the clusters formed at the threshold of 0.994, shown in figure  5B. Among the largest ones in terms of the number of firms included, in blue, the City of London and technological fringes appear as one cluster. So does Stratford in mint green. We can also spot the finance cluster of Canary wharf and the Docklands in apple green. Other clusters feature in central London: the law and administrative cluster of Temple, the area around Kensington and Chelsea or a banana-shaped cluster following the Thames and the train lines in South-West London. With a more restrictive threshold, for example 0.999 (figure 5A), LSOAs have to be very close and very similar to be aggregated into such clusters. Therefore, clusters are much smaller. For example, the areas of Shoreditch and of Aldgate appear as two different clusters (although they will belong to the same cluster above the threshold of 0.994). We also identified London Bridge and Tottenham Court Road as small independent clusters. The exception is the cluster of Temple, which remains pretty much the same size and extent regardless of the threshold chosen, meaning that this cluster is coherent but consistently dissimilar to neighbouring clusters.
The hierarchical tree in figure 6 visualises the nested structure of the clusters, by showing how clusters at one level (of threshold value) are merged into bigger clusters at the level above (with looser thresholds). Interestingly, we can thus relate the clustering of London businesses across scales and identify the proximity not only between LSOAs but also between clusters by looking at which clusters are fused sooner than others. For example, although Aldgate and Shoreditch are neighbouring clusters, they do not merge until the threshold 0.994. Instead, Aldgate merges with the City of London and Shoreditch merges with Farringdon at the threshold of 0.996 (cf. figure 6, left-hand branches). The two merged clusters are fused with the South bank cluster into the East Central cluster at the threshold of 0.994. Further on, this central cluster fuses with Kensington and Chelsea, the West Thames and the Docklands and other smaller clusters at the threshold of 0.993. The Olympic area of Stratford joins this Giant cluster only at the level 0.991. This structure thus highlights proximities between clusters which are usually absent from standard analyses. Let us now look at how this hierarchical organisation changes between 2007 and 2014 in the following section.

The evolution of clusters through the financial crisis
First of all, the 2007 tree has overall a different structure. In figure 6, we could see on the left hand a few branches merge separately before being merged into a single giant cluster. In figure 7, a reduced version of this phenomenon occurs, but then a giant cluster takes over quickly and small clusters gradually get added to it at the threshold of 0.995 and over. Other differences refer to the "Tech cluster"    It is also interesting to notice that Temple has remained a coherent and differentiated cluster for all thresholds in both years, one could call it robust, whereas Canary Wharf just emerges as a top 10 cluster at the threshold of 0.992 in 2007, while it was already at the top at 0.997 in 2014. The redevelopment of the Docklands dates back from the 1980s, becoming a major player in London finance later on, in addition to the City. After the financial crisis, many banks relocated from the City to Canary Wharf taking advantage of the lowering of rents, which also allowed startups in Fintech to setup. In addition, when banks move, they do so bringing with them all the firms that provide them with different services. Such a move generates the relocation of a few dozens of firms. This, together with the fact that existing banks in Canary Wharf dissolved to become financial outfits, explain the structural change of the surge of firms in the area in 2014.
Finally, some clusters which look prominent in 2007 have disappeared from the hierarchical tree in 2014. A notable example of such clusters is that of Croydon, which experienced a decrease in job density, partly caused by the crisis in the finance and insurance sector "including Allianz Global Assistance, RA Insurance Brokers, and AIG Europe " (Girardi and Marsden, 2017, p.), as well as by urban redevelopments (the Nestlé tower for example). These changes are better interpreted by looking at the specialisation of each cluster throughout the percolation tree. The following section present these results.

Industrial specialisation of clusters in the city
In terms of specialisation, we have looked at two broad sectors of the economy. The first sector aggregates knowledge based industries (KBI), that is plants whose dominant industry (in terms of The comparison of all four trees shows two interesting areas. The unique cluster of Temple (which remains unchanged between 2007 and 2014 and throughout the thresholds) shows very low shares of both KBI and RAL sectors (figure 10). Indeed, this cluster is very coherent and similar, but in a different industry to these two. The same is true, to a lesser extent, of other very central areas in West London, around Hyde Park for example). On the other hand, a large mix of KBI and RAL characterises the geoindustrial cluster around Tottenham Court Road, where we find around 30% of KBI businesses and around 20% of RAL businesses in 2014, which is an over-representation of both sectors compared to the London average. This area is historically a retail one, but the presence of universities (among which UCL) has attracted publishing and science services companies.
We map in figure 10 the KBI and RAL concentration level of clusters at the threshold of 0.997, in 2007 and 2014. The highest percentage of KBI firms in clusters in 2007 seems to be found in the first ring around central London (including the clusters related to publishing and edition in Southwark and Camden). In 2014, the clusters specialised in Knowledge-based industries have expanded to Outer London. For example, KBI-dense clusters can be found around Hounslow, Harrow or Richmond. Examples of technology companies in these areas include: IBM, Sega Europe, Cisco Systems and SAP offices at Bedfont Lakes Business Park in Hounslow" [... They] also show a high level of specialisation in Professional, scientific and technical activities. Within the sector, Richmond upon Thames is particularly specialised in scientific research and development (1,700 jobs, IOS = 5.3). Examples of related employment sites in the area include the scientific parks and research centres associated with Kew Gardens, the National Physical Laboratory and LGC Group30" (Girardi and Marsden, 2017, p.22-4). This expansion reflects both the increasing share of KBI firms in London, their new location strategies in the outer boroughs where office space is cheaper, but also the fact that Central London is hosting a more diverse set of companies when it is hosting KBI companies.   Regarding the spatial pattern of Retail and Leisure specialisation (RAL), we find two main areas of high concentration across the years: Kensington on the one hand, and the Stratford/Lea Valley on the other hand. "In Kensington and Chelsea, the main employers are in Retail (23,000 jobs) and Accommodation and Food services (19,000 jobs), likely reflecting the area's role in attracting visitors to London. [...] Examples of major employers in the sector within the borough include the department stores: Harrods, Peter Jones and Harvey Nichols in Knightsbridge" (Girardi and Marsden, 2017, p.12-27). Our method shows that this specialisation holds at the borough level for the lower scale of 0.997 clusters, although with varying intensities between Kensington and Chelsea for example.

Conclusion
With a multidimensional view of proximity which includes time distance and industrial similarity, this paper has offered a renewed take on geoindustrial clusters in London, one that pays particular attention to scales with the use of percolation theory. It has uncovered an evolution of the London economic geography which was not available through other methods, such as the reorganisation of the central London business structure post-crisis, allowing different clusters to co-exist alongside (City-Aldgate, Shoreditch-Farringdon, Notting Hill, Tottenham Court Road for example) rather than a hierarchical central cluster absorbing peripheral extensions as in 2007. We have highlighted changes regarding the structure and the specialisation of clusters in London. It should be noted that this work, through the methodological choices made, is limited firstly to an analysis of aggregate small areas rather than the network of firms through busi-ness links or workforce transition. Secondly, the analysis of the present paper does not include economic links to external places, within national boundaries and more generally within the Global Value Chain Sturgeon et al. (2008): "The processes of dispersal are not confined to the re-location of economic activity to some newly dynamic center where the agglomeration process can begin anew (Storper and Walker, 1989), but also include the unfolding -and perhaps historically novel -dynamics that are presently driving deep functional integration across multiple clusters (Dicken, 2003, 12), a process we refer to as global integration" (Sturgeon et al., 2008, p.299). Finally, we have restricted our view to the main sectors of KBI and RAL, leaving big parts of the service sector untouched by the analysis. Despite these limitations, our hope is that, by providing a methodology for multiscale cluster analysis, we can emulate comparative works in other regional and national contexts, and unveil different nested structures to inform economic analysis.