Comparative analysis of course prerequisite networks for five Midwestern public institutions

Yang, Bonan; Gharebhaygloo, Mahdi; Rondi, Hannah Rachel; Hortis, Efrosini; Lostalo, Emilia Zeledon; Huang, Xiaolan; Ercal, Gunes

doi:10.1007/s41109-024-00637-z

Research
Open access
Published: 26 June 2024

Comparative analysis of course prerequisite networks for five Midwestern public institutions

Bonan Yang¹,
Mahdi Gharebhaygloo³,
Hannah Rachel Rondi¹,
Efrosini Hortis²,
Emilia Zeledon Lostalo⁴,
Xiaolan Huang⁴ &
…
Gunes Ercal¹

Applied Network Science volume 9, Article number: 25 (2024) Cite this article

87 Accesses
1 Altmetric
Metrics details

Abstract

We present the first formal network analysis of curricular networks for public institutions, focusing around five midwestern universities. As a first such study of public institutions, our analyses are primarily macroscopic in nature, observing patterns in the overall course prerequisite networks (CPNs) and Curriculum Graphs (CGs). An overarching objective is to better understand CPN variability and patterns across different institutions and how these patterns relate to curricular outcomes. In addition to computing well known network centrality measures to capture courses of importance in the CPNs studied, we have also formulated some newer methods with specific relevance to the curricular domains and corresponding graph types at hand. We have discovered that a new graph theoretic measure of node importance which we call reach, based on the well-known concept of reachability, is needed to more accurately express the critical nature of some introductory courses in a university. Another analytical novelty that we introduce and apply to the subject of CPNs is the Longest Paths Induced sub-Graph (LPIG) of the CPN, which yields information on relatively constrained programs and pathways. Finally, we have established a new connection between clustering of the CG and meta-majors at Southern Illinois University Edwardsville (SIUE), providing clusterings of the other public institution CGs as useful heuristics of major groupings as well. This work is borne from collaboration between academic units and academic advising with hopes of practical benefits towards aiding student advising.

Introduction

Prospective students of an institution may often find the course prerequisite graphs of Science, Technology, Engineering, and Mathematics (STEM) degree programs provided alongside curricular descriptions (SIUE Civil Engineering Course 2024; SIUE Computer Engineering Course 2024; SIUE Electrical Engineering Course 2024; BYU-Idaho 2024; Macalester College 2024; Wellesley College 2024; Washington and Lee University 2024). Computer Science students may once again come across the course prerequisite graph of required courses in their program in their data structures and algorithms classes in the context of Directed Acyclic Graphs (DAGs) algorithms. Providing their relevant course prerequisite subnetwork as an example not only serves to motivate the students in the topic of DAGs but also often serves as a great visualization and analysis tool to help them navigate their own course planning and scheduling. Despite the prevalent availability of DAGs representing the dependency structure of STEM degree programs, it is surprisingly difficult to find public datasets representing the entire course prerequisite network (CPN) of any institution, much less finding academic works providing analyses of such. To our knowledge, there are two prior works in the academic literature analyzing institutional CPNs: Stavrinides and Zuev (2023) analyzes the CPN of the California Institute of Technology (CalTech), and Aldrich (2015) analyzes the CPN of Benedictine University. Whereas the CalTech CPN is made public by the authors (Stavrinides and Zuev 2023), the Benedictine CPN was not provided publicly by Aldrich (2015).

In Aldrich (2015) the course prerequisite network at Benedictine University is encoded as a DAG visualized in Gephi (Heymann and Le Grand 2013), and some well known network science statistics are presented in relation to corresponding curricular questions. For example, node centralities express the roles of courses acting as hubs (degree centrality) or bridges (betweenness centrality) in the overall curriculum structure, while path lengths of prerequisite chains within a program yield lower bounds for completion time. The work Stavrinides and Zuev (2023) significantly extends CPN analyses for the case of the California Institute of Technology (CalTech) to additionally provide topological stratification of the CPN and interdependence analysis upon the derived curricular networks corresponding to university programs and divisions. Inter-subject relationships within the curriculum graph are implied to correspond to the fundamental relationships between the knowledge areas themselves, with high betweenness subjects appearing more interdisciplinary.

The CalTech and Benedictine CPN analyses of Stavrinides and Zuev (2023); Aldrich (2015) serve as important seminal works demonstrating the effectiveness of graph theoretic methods in understanding curricular questions. Although both CalTech and Benedictine are private institutions, the distinctions between their CPNs highlighted by Stavrinides and Zuev (2023) provide a glimpse of CPN variability. As the vast majority of undergraduate students in the United States are enrolled in public institutions (see Fig. 4 of IES NCES (2024)), a more complete picture of CPN variability and extractions of curricular patterns necessitates consideration of public institutions as well. That is the starting point for the present work.

We analyze the CPNs and derived curriculum graphs for 5 Midwestern public universities: Southern Illinois University Edwardsville (SIUE), Southern Illinois University Carbondale (SIUC), University of Illinois Urbana Champaign (UIUC), Missouri University of Science and Technology (MST), and University of Missouri Kansas City (UMKC). We include the CalTech CPN and curriculum graph in our comparative analyses as well both for context and to include additional, updated analyses of that network.

Our overarching objective is to better understand CPN variability and patterns across different institutions and how these patterns relate to curricular outcomes. As a first step towards that objective, several basic network statistical measures are compared across the different CPNs considered. Some of these measures like degrees, betweenness centralities, and diameter are immediately extracted via graph visualization tools such as Gephi (Heymann and Le Grand 2013) and relate approximately to curricular properties such as critical or important courses and critical course sequences respectively, as noted in Aldrich (2015) and Stavrinides and Zuev (2023). Upon extracting the course nodes achieving highest degrees and highest betwenness centralities across the different CPNs, we illustrate similarities as well as dissimilarities, general patterns.

However, the importance or criticality of a course as expressed via high betweenness is of a different nature than the importance or criticality of a course as expressed via high out-degree, also noted by Stavrinides and Zuev (2023). As a more general objective, we wish to more deeply explore how different notions of node importance in a network (CPN) translate to specific notions of course criticality in the curricular landscape.

In the process of this exploration, we have discovered that a new graph theoretic measure of node importance is needed to more accurately express the critical nature of some introductory courses in a university. This notion, which we call reach, is simply the size of the breadth-first-search tree (reachability set) rooted at a node. In Stavrinides and Zuev (2023), PageRank centrality, which is the PageRank of the transpose network, was noted to better capture the critical nature of fundamental introductory courses compared to out-degrees and betweenness centralities. Whereas the application of PageRank centrality to the CPN has a similar motivation to reach and acts very similarly in many cases, it does not necessarily produce the same importance orderings. Reach is a meaningful notion of node importance in a DAG but less meaningful for general directed graphs and completely meaningless for undirected graphs, perhaps hinting at why reach has not been used as a measure previously. We demonstrate the importance of reach as a measure in extracting well-known critical introductory courses such as College Algebra (Goonatilake et al. 2013), and we compare the node importance rankings yielded by reach with those yielded by the PageRank centrality.

Another analytical novelty that we introduce and apply to the subject of CPNs is the Longest Paths Induced sub-Graph (LPIG) of the CPN. Given a length parameter d, $LPIG_d$ is the subgraph of the CPN induced by all nodes which lie on paths of length d or longer in the CPN. The LPIG is also a structure whose meaningful computation is highly dependent on the acyclic nature of the DAG: Whereas longest paths in general graphs is well-known to be NP-complete (Garey et al. 1974), the longest paths problem is linear-time computable for DAGs Sedgewick and Wayne (2011); Cormen et al. (2022). Given that each course along a prerequisite chain must be completed in a different term, the $LPIG_6$ gives information about highly constrained degree programs in a university. Comparison and contrast of LPIGs across different institutions provide further information about relative constraints of categories of degrees in addition to motivating discussion on corresponding student outcomes.

Our final novel application of graph theoretic algorithms and modeling towards understanding curricular outcomes concerns the structure and distribution of meta-majors. As stated in SIUE’s advising website (SIUE Meta-Majors 2024), instead of declaring a major up front, first-year students are grouped into 8 meta-majors according to their stated interests, for purposes of advising and tracking. As student retention, persistence, and timely graduation are amongst the important issues that the institution continually examines, a hope concerning meta-majors is that there should be sufficient connectedness between majors of a given meta-major so that a student starting out with unofficial declaration in one major of a meta-major may have the opportunity to switch to another major in the same meta-major without great waste of time and credits if the change of heart is detected soon enough. Burke (2020) We model this property in the language of complex networks as the problem of community detection, also called graph partitioning or clustering, in the Curricular Graph of majors derived from the CPN. This modeling is motivated by the fact that the intra-meta-major connectivity requirement is precisely captured by the community detection objective that the connectivity within a community be notably higher than the connectivity between communities (Girvan and Newman 2002). This brings us to our last investigation: Upon applying modularity based clustering to the Curriculum Graph, examine the relationship between the resultant clusters and the meta-major subdivisions.

While we have stated our disparate research objectives, we wish to clarify aspects of the broader motivation for this work prior to proceeding to technical aspects and results. This work represents the first step by the authors towards addressing curricular and institutional questions that have arisen in various departmental committees and university working groups over the years at the authors’ respective institutions. A primary SIUE author chairs the Undergraduate Curriculum Committee in the Computer Science department and another SIUE author directs the SIUE Office of Academic Advising and architected the meta-majors at that institution: This collaboration arose during their work in a university-wide working group on Improving Persistence and Timely Graduation (IPTG). Both the institutional directives which initiated the IPTG working group and the content of the IPTG final report indicated a need to formally study prerequisite structure both within programs and across the institutional landscape with respect to properties of rigidity versus flexibility in addition to analyzing the composition of meta-majors with respect to questions of cohesiveness and minimization of excess credits upon intra-meta-major switching. Prerequisite relationships have a combination of artificially constructed and fundamental aspects, where some dependency relationships might be universally agreed upon inherent knowledge dependencies while other prerequisite dependencies may serve practical institutional advising purposes. Therefore, in the course of our investigations we discovered in the course that it is best to first analyze the pattern and variation in prerequisite structure across relevant institutions, which forms the major emphasis of the present study.

Towards the question of selecting relevant institutions to study: This collaboration further expanded to involve existing collaborators from SIUC Computer Science and MST Mathematics departments, hence including those neighboring institutions as well. Already with SIUE, SIUC, and MST we cover some different institutional characteristics with respect to graduation rates, STEM versus general emphases, graduate versus undergraduate emphases, rural versus suburban environment, size, and selectivity. However, as we wished to include the consistently highest ranked public university across the Illinois and Missouri regions, we include UIUC in our study. The inclusion of UMKC in our study is originally due to the implementation of meta-majors in that institution, though we were subsequently unable to obtain data on specific meta-major composition there. Nonetheless, due to its student composition and persistence problems, meta-majors have generally been used as an advising method at UMKC, yielding some similarity to SIUE despite other institutional differences between the schools with respect to selectivity, graduate research orientation, and urbanicity. Upon selecting SIUE, SIUC, MST, UIUC, and UMKC, in addition to comparing with the previous work on CalTech, our sample incorporates sufficient variation in institutional profiles to form meaningful comparisons. With the caveat that much more work yet remains to answer many of the persistence related questions forming our original motivations, we now attempt to shed some light on broad patterns and variation within and across institutional CPN and curriculum networks for a meaningful sample of Midwestern public institutions.

Description of datasets, definitions, and methods

Datasets

The course information for the public institutions in this work are obtained from each school’s online course catalog. For the CalTech data, we used the dataset provided by Stavrinides and Zuev (2023). Such data has is used to find prerequisites, co-requisites, cross-listing, and other dependency relationships. The outcome of this process is used as raw data towards generating graphs connecting the courses (CPN) and programs (CG).

Definitions and notations

CPN formation

All analyses in this work are based on the Course Prerequisite Networks (CPNs) extracted from the university catalogs mentioned above. As indicated in Stavrinides and Zuev (2023); Aldrich (2015), the CPN graph $G = (V, E)$ essentially captures the prerequisite relationships between courses by including, for each prerequisite $X \in V$ of course $Y \in V$, a directed edge $(X, Y) \in E$. Since prerequisites must be satisfied prior to the course itself, the CPN is a dependency graph and must be acyclic, forming a directed acyclic graph (DAG). Adopting the convention of Stavrinides and Zuev (2023), $X \prec Y$ denotes that course X is a prerequisite for course Y.

In this work, due to the discovery of a sizeable number of co-requisites, cross-listed courses, and other indicators of equivalent courses in the various course catalogs we have parsed, we need to modify and clarify our CPN to allow the vertex set V to be a partition of the course set. The vast majority of members of V are singleton sets whose correspondence with individual courses and the prerequisite relationship is straightforward. However, due to the existence of non-singleton sets in V we must generalize the prerequisite relationship to act between sets of courses in order to now properly define our CPN: Let S and T be disjoint sets of courses. Then

$$\begin{aligned} S \prec T \leftrightarrow \exists s \in S, t \in T \ni s \prec t \end{aligned}$$

(1)

Prior to specifying the graph construction notation, we take a moment to elaborate upon a few issues surrounding the parsing of the course catalog with respect to extracting prerequisite information. First, there is the issue of different conjunctions used in expressing prerequisite information. While the vast majority of courses have standard prerequisite listings connected by the conjunction AND, there are also situations in which the prerequisite list is a more general logical expression involving both AND and OR connectives. We acknowledge the differing semantics induced by OR versus AND connectives acting on the prerequisite courses, as prerequisites connected via the conjunction operator are absolute requirements while the others need not be. Nonetheless, we adopt the convention in Stavrinides and Zuev (2023) in which we do not distinguish between the different types of prerequisites listed for a course in forming the CPN DAG.

As a further detail concerning CPN formation, we note the allowance of corequisites and course equivalencies in the course catalogs. Co-requisites are instances in which a course X is permitted to be taken concurrently with course Y. In many cases, the purpose of stating co-requisites is to allow more scheduling flexibility for students despite the existence of some degree of knowledge dependence between the respective courses. Such situations are signified in the course catalog by the listing of a course Y as “prerequisite or corequisite” for course X without the mention of X in the prerequisite listing of Y. As we are considering the underlying dependence structure without solving scheduling in this work, we treat this type of situation as signifying a directed edge from Y to X but not vice versa, hence maintaining acyclicity. Namely, $Y \prec X$ but not $X \prec Y$.

The other complicating situations involve true corequisites in addition to generalized equivalencies of course sets. The vast majority of true corequisites comprise lecture and corresponding lab pairs which must be taken in the same term such that the courses in the pair share the identical course code excepting an additional “L” following the corresponding lab. For such lecture and lab pairs of corequisites, in our CPN graph we consider the pair as a merged course node with the common course code excluding the “L” suffix of the lab code.

The last situation, which was more difficult to parse automatically from the distinct course catalogs, is the situation of courses which are treated as equivalent or cross-listed as indicated by catalog terms such as “Same as”, “co-listed with”, or “cross-listed with”. In these cases too, consistent with the dependency characterization of the CPN structure, we have adopted the convention of merging sets of courses which are indicated to be equivalent in some catalog context. Given a set of equivalent courses $S = \{ C_1, C_1, \dots , C_3 \}$, we consider the set of courses in S as a single merged course node in the CPN graph.

We note that the merging of course sets in the CPN based on lab-lecture co-requisite relations, cross-listings, co-listings, and other contexts of similarity induce an equivalence relation upon courses which are merged. Therefore, let us denote this relationship with $\equiv _C$ as follows given a pair of courses $C_1$ and $C_2$: $C_1 \equiv _C C_2 \longleftrightarrow$ $C_1$ and $C_2$ are represented by the same merged vertex in the CPN. Letting ${\mathfrak {C}}_I$ denote all the courses in a given institution I, the equivalence relation $\equiv _C$ induces a partition on ${\mathfrak {C}}_I$ which we denote as $V_I$:

$$\begin{aligned} V_I = \{ \{ x \mid x \equiv _C c \} \mid c \in {\mathfrak {C}}_I \} \end{aligned}$$

(2)

Clearly, each member of $V_I$ is an equivalence class $[c]^{\equiv _C}$ of some course c. Courses which were not involved in any merging in the CPN have singleton equivalence classes.

Now we may denote the CPN graph for each institution I, as $CPN_I = (V_I, E_I)$ where $V_I$ is the set of equivalence classes of courses, and the directed, unweighted edge set $E_I$ defines the set-generalized prerequisite relationship $\prec$. As the $CPN_I$ is a directed acyclic graph (DAG):

i.
For any $c \in V_i$, $(c,c) \notin E_i$
ii.
For any $x, y \in V_i$, if path $x \rightsquigarrow y$ exists in $CPN_i$, then there is no path from y to x.

In addition to the above notations for the CPN graph and the corresponding node and edge sets, we will refer to the adjacency matrix of a graph G as A(G) or simply A when the context is clear. When discussing node centralities, especially PageRank centrality, sometimes it will be useful to refer to the transpose CPN which has adjacency matrix $A^T = A^T(CPN_I)$, which is the transpose of the adjacency matrix of $CPN_I$ and defined as follows. Recall that the transpose matrix $A^T$ is defined as follows, for any matrix A:

$$\begin{aligned} A_{rc}^T = A_{cr} \end{aligned}$$

(3)

Likewise, the transpose $E^T$ of an edge set E is simply defined as the set of edges with all arrow directions reversed, which formally may be represented as:

$$\begin{aligned} E^T = \{ (x,y) \mid (y,x) \in E \} \end{aligned}$$

(4)

Curriculum graph formation

As detailed in Stavrinides and Zuev (2023), in order to perform a more macro level analysis of the relationships between departments and units in the curricular landscape, we derive the Curriculum Graph (CG) from the CPN where each node represents a major code M and directed edges $(M_1, M_2)$ between major codes $M_1$ and $M_2$ are weighted according to the number of edges in the CPN from nodes with major code $M_1$ into nodes with major code $M_2$. For example, if exactly 5 courses with major code MATH are immediate prerequisites of courses with major code PHIL, then there is an edge in the CG of weight 5 from node MATH to node PHIL. Note that the CG need not be acyclic although it is derived from an acyclic CPN, as different pairs of courses contribute to the existence and weights of edges in the CG. For example, an introductory computer science (CS) course may be a prerequisite to an upper level mathematics (MATH) course in numerical methods, while other introductory mathematics courses might be prerequisites to intermediate computer science courses, forming anti-parallel edges of different weights from CS to MATH and from MATH to CS separately, inducing a simple cycle in the CG.

As in the case of the CPNs as detailed in the prior section, we must address the treatment of courses that are in the same equivalence class but in different majors. Recall that two courses are only in the same equivalence class if they are either co-requisites, co-listed, crosslisted, or described to be equivalent with statements such as “same as” in the catalog. As such, the existence of courses in the same equivalence class but different majors is an indication of some level of symmetric relationship between the majors. Therefore, every instance of such an equivalent pair of courses $c_1$ and $c_2$ involving distinct majors $M_1$ and $M_2$, respectively, will contribute an additional weight of $+1$ to both the edge $(M_1, M_2)$ and the edge $(M_2, M_1)$.

We may refer to the curriculum graph for institution I as $CG_I = (M_I, {\hat{E}}_I, w_I)$ where $M_I$ is the set of majors with distinct major codes, ${\hat{E}}_I$ is the set of directed, weighted edges between majors derived from $CPN_I$ with weight function $w_I$. It will be useful to overload the notation to also define the major function applied to any class $c \in {\mathfrak {C}}$ as $M_I(c)$, meaning the major code $m \in M_I$ associated with the course. We note that the major codes also necessarily partition the course set ${\mathfrak {C}}$ but not the vertices of the CPN. We may therefore also define another equivalence relation $\equiv _M$ upon courses such that $c_1 \equiv _M c_2$ if and only if $M_I(c_1) = M_I(c_2)$. Similarly, denote by equivalence class $[c]^\equiv _M$ the set of courses in the same major as course c. And, for any $m \in M_I$, and any c such that $m = M_I(c)$, let $C(m) = [c]^\equiv _M$.

For any major code, $m \in M_I$, we also overload our notation to extend to function $V_I(m) \subset V_I$ as the set of vertices in the $CPN_I$ corresponding to m, namely:

$$\begin{aligned} V_I(m) = \{ v \in V_I \mid \exists c \in {\mathfrak {C}} \ni M_I(c) = m \} \end{aligned}$$

(5)

Consider again the situation of courses which are equivalent with respect to $\equiv _C$ but not equivalent with respect to $\equiv _M$: Sometimes pairs of courses in different majors that are nonetheless cross-listed with each other exist. Due to such situations, note that the set of $V_I(m)$ need not be disjoint, and in fact overlap between $V_I(m_1)$ and $V_I(m_2)$ signify a strength of connection between $m_1$ and $m_2$ in the Curriculum Graph.

Now we may exactly define ${\hat{E}}_I$ and $w_I$. For any distinct $m_1, m_2 \in M_I$ such that $m_1 \ne m_2$:

$$\begin{aligned} (m_1, m_2) \in {\hat{E}}_I \Longleftrightarrow (\exists c_1 \in C(m_1), c_2 \in C(m_2), \backepsilon ( (c_1 \prec c_2) \bigvee ( V_I(m_1) \bigcap V_I(m_2) \ne \emptyset ))) \end{aligned}$$

(6)

Regarding weight function $w_I$, $w_I(x,y) = 0$ if and only if $(x,y) \notin {\hat{E}}_I$. For any $m_1, m_2 \in M_I$ such that $(m_1, m_2) \in {\hat{E}}_I$:

$$\begin{aligned} w_I(m_1,m_2) = |\{ (c_1, c_2) \in C(m_1) \times C(m_2) \mid c_1 \prec c_2 \} |+ |(V_I(m_1) \bigcap V_I(m_2)) |\end{aligned}$$

(7)

The adjacency matrix $A = A(CG_I)$ holds both edge and weight information as follows:

$$\begin{aligned} \forall x, y \in M_I, A_{xy} = w_I(x, y) \end{aligned}$$

(8)

Node centralities

Various centrality measures are applied to rank nodes in the CPN and CG to extract information about the relative criticality of courses and majors in the curricular landscape. We follow the standard definitions of centrality measures such as betweenness centrality (BC) Brandes (2001) and out degree centrality (Freeman 1977). In applying PageRank (Page et al. 1999) to analyze the CPN, we follow the convention of Stavrinides and Zuev (2023) in denoting the PageRank centrality of a node (course) in the CPN to be the node’s PageRank in the transpose of the CPN, which we also refer to as the transpose PageRank for clarity. The reason for taking the transpose of the CPN prior to application of PageRank for the purposes of extracting relative node importance is due to the meaning of edges in the CPN versus their meaning in the World Wide Web (WWW) in the original PageRank paper Page et al. (1999): A course Y depends on a course X when X is a prerequisite for Y, denoted by the edge (X, Y) in the CPN. But, a website Y depends on another website X when the direct link (Y, X) exists in the WWW. Therefore, PageRank centralities correspond to the PageRank values of the transpose CPN, namely $CPN^T_I$ as described in Sect. 2.2.1. We elaborate on the computation of PageRank centrality in the Methods Sect. 2.3. Presently, we continue precisely defining other commonly used centrality measures.

Given a directed graph $G = (V, E)$, we use $k_{in}(i)$ and $k_{out}(i)$ to denote the in-degree and out-degree of node i, respectively. In terms of adjacency matrix A, $k_{in}(i)$ and $k_{out}(i)$ are given by (9).

$$\begin{aligned} k_{\text{ in } }(i)=\sum _{j=1}^n A_{j i} \quad and \quad k_{\text{ out } }(i)=\sum _{j=1}^n A_{i j} \end{aligned}$$

(9)

The betweenness centrality of node $i \in V$ can be written as (10), in which $\sigma (s,t)$ is the total number of the shortest paths from node s to node t, and $\sigma (s,t|i)$ is the number of these shortest paths which passing through i.

$$\begin{aligned} \beta (i) = \sum _{s \ne i,t \ne i } \frac{\sigma (s,t|i)}{\sigma (s,t)} \end{aligned}$$

(10)

Reach

A simple measure of importance for a node x is the number of nodes that are reachable from x, where the reachability set is computable in linear time $\Theta (|E|+|V|)$ using breadth-first search (BFS) or depth-first search (DFS) rooted at x Cormen et al. (2022). Node y is reachable from node x in graph $G = (V,E)$ if either $y = x$ or there exists a path $x \rightsquigarrow y$ from x to y in G. We extend this definition naturally towards a useful graph statistic named reach as follows: Given graph $G = (V,E)$ and vertex $v \in V$

$$\begin{aligned} reach(v) = |\{ u \in V \mid \exists v \rightsquigarrow u \} |\end{aligned}$$

(11)

Equivalently, note that,

$$\begin{aligned} reach(v) = |BFS(v) |\end{aligned}$$

(12)

where BFS(v) is the BFS tree rooted at v.

In the context of a CPN, if a course d is reachable from course c, then c lies on a prerequisite chain leading to d. Therefore, the reach of a course c in the CPN is precisely the number of courses for which c is a direct or indirect prerequisite. This precise meaning yields the high relevance of reach as a measure of study in the CPN context.

While reachability is a well-known concept in classical graph theory, we are unaware of any mention of the usage of reach, or any equivalent variation under a different name, as a network statistic or centrality measure. This may be due to the relative emphasis on undirected graphs in network science due to symmetries in many complex networks. In fact, reach is not a distinguishing characteristic of a node in an undirected graph, as any two nodes in the same component will have the same reach, namely their component size. Likewise, for directed graphs that are not DAGs, the Strongly Connected Component (SCC) size of a node is a lower bound for its reach, again relating all nodes in the same SCC. It is really in DAGs that reach is more meaningful as a distinguishing measure of node importance, hence the usage of reach in this work.

We conclude our introduction of the measure reach by noting the uniqueness of the information conveyed by reach in a DAG compared to all other known centrality measures considered. While we shall observe some correlation between the lists of highest reach nodes and highest transpose PageRank (tPR) nodes in some results, it is not difficult to construct an infinite class of DAGs for which the highest reach and highest tPR nodes differ for every setting of the dampening factor. A simple example of such an extremal graph is given in Fig. 1. The highest reach of that network is achieved by node 1 whereas the highest tPR is node 7, independent of dampening factor.

Longest paths induced sub-graph

As paths in the CPN represent pre-requisite chains, the length of the longest path leading to a course corresponds to the number of terms required to complete that course in the curriculum.^{Footnote 1} Courses that are sink nodes of relatively long prerequisite chains constrain the schedules of their respective degree programs. And, degree programs that have higher numbers of such constraining courses (and the chains that lead to them) are likely candidates for further analysis of how to aid student persistence throughout the completion of the curricula. Noting that 4 years is the standard time for undergraduate degree completion at a university, sample advertised curricula for undergraduate degree programs are all based on the 4 year degree goal. Moreover, with the exception of CalTech which is on the quarter system, all of the public institutions involved in this study operate on the semester system, where the standard number of terms per year (excluding the summer term) is 2, leading to 8 term graduation goals. In general, given a standard graduation goal of n terms, we wish to identify and analyze courses and degree programs which are involved in paths of length $t = n-2$ or more. Noting that path lengths are traditionally expressed as the sum of edge weights, which for an unweighted graph is the number of edges along the path, the number of nodes along the path is one more than the path length. We formulate a new graph theoretic construct called longest paths induced sub-graph $LPIG_t$ that permits exactly such identification and analysis. As the $LPIG_t$ is based on the computation of longest paths in a graph, we first discuss the computation of longest paths.

Like reach, the longest paths problem in a graph has limited applicability for general graphs but high relevance for DAGs (like the CPNs). Unlike reach, however, the limited applicability of longest paths in general graphs arises due to computational concerns: The longest path problem in both general directed graphs and in undirected graphs is NP-complete due to a reduction from the Hamiltonian Path problem (Cormen et al. 2022; Sedgewick and Wayne 2011). In DAGs, however, the longest path problem is linear-time solveable by updating path estimates from vertices considered in topological sort order, which results from the reverse finish time order of a depth-first search (DFS) of the graph (Cormen et al. 2022). Likewise, deciding the existence of paths of length at least k (for some given k) is in the class P when restricted to DAGs. However, the problem of computing all sufficiently long paths (of length at least k) is no longer in P even under the DAG restriction due to the potentially exponential number of such paths. Nonetheless, finding all nodes involved in sufficiently long paths (of length at least k) in a DAG is solveable in polynomial time with similar dynamic programming methods that allow the computation of a sufficiently long path in the first place as ties for predecessor nodes can also be memoized to be reconstructed during backtracking. In practice, however, when the number of sufficiently long paths in a DAG is not overwhelming, it may also be useful to simply compute all such paths, as that computation is still polynomial time in the number of such paths. Regardless of the choice of method, we emphasize that computing all nodes involved in sufficiently long paths in a DAG is a feasible problem, and this is precisely the set of nodes in the LPIG.

Notationally: Given an unweighted DAG $G = (V,E)$, with $n = |V|$, let k be given such that $1 \le k \le n-1$. A node $v \in V$ is said to lie on a sufficiently long path with respect to k iff there exists some $x, y \in V$ and a path

$p =<e_1, e_2, e_3, \dots , e_{k+d}> = <(x, u_1), (u_1, u_2), (u_2, u_3), \dots , (u_{k-1+d}, y)>$ such that

$|p| \ge k$, meaning $d \ge 0$, and $v \in \{ x, y, u_1, u_2, u_3, \dots , u_{k-1+d} \}$

Given DAG $G = (V, E)$, let

$$\begin{aligned} V^k = \{ v \in V \mid v \text { lies on a sufficiently long path w.r.t. } k \} \end{aligned}$$

(13)

Then, the induced sub-graph $LPIG_k(G) = (V^k, E^k)$ where

$$\begin{aligned} E^k = \{ (x,y) \in V^k \times V^k \mid (x,y) \in E \} \end{aligned}$$

(14)

Implementation of methods

The pipeline of our methods is as follows: (i) extraction of course catalog information to form the CPN graphs, (ii) construction of the Curriculum Graphs from the CPN and catalog data, (iii) computation of network centrality and importance measures on both types of networks, (iv) construction of the LPIG network from the CPN, and (v) clustering of the Curriculum Graphs. All parts of this pipeline have been implemented in Python, with Gephi additionally used to aid in the visualization and analyses of parts (iii) and (v).

For part (i), the Python libraries request and beautifulshop4 were used to extract each school’s course information from their official websites and organize it into their CPN’s adjacency list. In terms of the graph construction for parts (i), (ii), and (iv) we mainly used the Python networkX library (Hagberg et al. 2008). For part (iii), we used Gephi to compute degree and betweenness centrality distributions and networkX to compute other measures such as reach and transpose PageRank (Page et al. 1999). For part (v), we primarily used Gephi for the clustering computation and visualization using modularity clustering based on the Louvian method (Blondel et al. 2008).

PageRank and modularity clustering are both parametrized methods. As prior work Boldi et al. (2009); Page et al. (1999) indicates $\delta = 0.85$ to be an empirically reliable parametric setting for the dampening factor of PageRank, that is the setting that we adopt. Regarding the resolution parameter for modularity clustering in Gephi, the default setting of the resolution to $r = 1.0$ is commonly used and recommended in general when the number of desired clusters is unknown. That is the setting that we also adopt for our experiments in part (v).

Results

Comparison of basic network statistics

An overview of the basic network statistics is seen in Table 1. This table also yields approximate institutional information about the number of courses and number of majors corresponding to the size of the CPN and the size of the CG, respectively. The size of the CPN is a lower bound on the actual number of distinct courses as equivalent courses are merged into one node as described in Sect. 2.2.1. On the other hand, the size of the CG is an upper bound on the number of distinct majors as it may include some codes for programs that are not currently majors as well. As the percentage of course equivalences and non-major codes are very low, the approximations provided by the CPN size and CG size are very near to the actual number of courses and majors, respectively. Therefore, this table well-encapsulates the immediate variation between the institutional sizes, with UIUC and CalTech standing out as the largest and smallest outliers respectively.

Table 1 General data at a glance

Comparative analysis of course prerequisite networks for five Midwestern public institutions

Abstract

Introduction

Description of datasets, definitions, and methods

Datasets

Definitions and notations

CPN formation

Curriculum graph formation

Node centralities

Reach

Longest paths induced sub-graph

Implementation of methods

Results

Comparison of basic network statistics

CPN centrality results

Longest paths induced graphs

Curriculum graph analysis

Centralities of major fields

Meta-majors and CG clustering

Discussion, conclusion, and future work

Availability of Data and Materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords