--- title: "Getting Started with bibnets" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with bibnets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(bibnets) ``` ## Introduction `bibnets` constructs bibliometric networks from scholarly metadata. It imports the export formats of the major bibliographic databases, converts them internally to a common tabular representation, and projects that representation into networks through a single function per network type. The package covers the standard constructions and adds several that are less commonly available: position-based attention weighting, aggregation of entities into higher-level networks, a range of counting and similarity weights, and temporal construction over time windows. ### Data import `bibnets` reads Scopus, Web of Science, OpenAlex, Lens.org, Dimensions, Crossref, BibTeX, and RIS exports. `read_biblio()` detects the format from file content and dispatches to the corresponding reader; all readers return an identical schema, so records from different databases can be combined without manual reconciliation. Multi-valued fields — authors, references, and keywords — are parsed into list-columns. A data frame already in this form is used directly, by naming the relevant column, without a reader. ### Network builders Dedicated builders construct co-authorship, co-citation, bibliographic coupling, keyword co-occurrence, direct citation, and historiograph networks; a generic builder covers other projections. The builders share one interface and return the same edge list, so the network type is determined by the function name. ### Weighting and aggregation Counting methods determine each publication's contribution to an edge. They range from full and fractional counting to position-aware schemes (harmonic, geometric, golden-ratio, first, last, and first–last), and six similarity measures — cosine, association strength, Jaccard, inclusion, and equivalence — rescale the projected weights. Attention weighting assigns each author a positional weight that sums to one across the byline, so a publication's credit is distributed by byline position rather than equally and a large author list does not dominate the network. Aggregation pools the references or members of a group to construct collaboration and coupling networks among countries, institutions, or sources rather than individuals. Temporal construction applies any builder over fixed, sliding, or cumulative windows, and a disparity-filter backbone retains the edges that are significant relative to each node's strength. ### Implementation The incidence matrix is stored as a sparse `dgCMatrix` and projected with `crossprod()` or `tcrossprod()`; edges are extracted without forming a dense node-by-node matrix, so memory scales with the number of non-zero co-occurrences rather than with the square of the vocabulary. The package imports only Matrix, stats, and utils. ### Export formats Constructed networks are exported to igraph, tidygraph, cograph, Gephi, GraphML, and sparse-matrix representations. ### Output Every builder returns a **`bibnets_network`**: a tidy data frame with four columns — - `from`, `to` — the two endpoints of an edge, - `count` — the raw binary co-occurrence count for that pair, - `weight` — the analytical weight after counting and optional similarity normalization. With `counting = "full"` and `similarity = "none"`, `weight` equals `count`. They diverge once fractional counting or a similarity measure is applied. The builders, at a glance: | Function | Nodes | An edge means | |---|---|---| | `author_network()` | authors | co-authorship, author coupling, or co-citation | | `reference_network()` | cited references | two references cited together | | `document_network()` | documents | shared references, shared citers, or direct citation | | `keyword_network()` | keywords | two keywords appear together | | `source_network()` | journals | sources share references or are co-cited | | `country_network()` | countries | countries collaborate or share references | | `institution_network()` | institutions | institutions collaborate or share references | | `conetwork()` | any field | entities co-occur, or share values of another field | | `local_citations()` | documents | within-corpus citation counts | | `historiograph()` | documents | directed citation history among top-cited papers | | `temporal_network()` | any builder's nodes | the same network over time windows | ## Quick start You do **not** need a special reader. Any data frame with one row per paper works — point a builder at the column that holds the entity and tell it the delimiter: ```{r quick-df} papers <- data.frame( `Author Names` = c("Smith J, Doe A, Lee K", "Smith J, Lee K", "Doe A, Lee K", "Smith J, Doe A"), check.names = FALSE ) author_network(papers, authors = "Author Names", sep = ",") ``` If your data is a scholarly export instead, read it first — the format is detected from the file content — then build with the defaults: ```{r quick-reader, eval = FALSE} data <- read_biblio("scopus.csv") authors <- author_network(data, type = "collaboration") ``` Either way the result is the same four-column edge list, ready to inspect, prune, or export. ## Reading your own data ### Scholarly exports `read_biblio()` accepts a file, a folder, or several files, and detects Scopus, Web of Science, OpenAlex, BibTeX, RIS, Lens.org, and Dimensions from the content: ```{r read-files, eval = FALSE} data <- read_biblio("export.csv") data <- read_biblio("folder_with_exports/") data <- read_biblio(c("part_1.csv", "part_2.csv")) ``` The format-specific readers can also be called directly (`read_scopus()`, `read_wos()`, `read_openalex_csv()`, `read_dimensions()`, `read_lens()`, `read_bibtex()`, `read_ris()`). ### A custom CSV For a CSV that matches no known export, map each source column onto a standard field **by name** — `authors`, `keywords`, `references`, `countries`, `affiliations`, or `journal`. Naming any of them reads the file as a generic CSV, so you do not pass `format` yourself: ```{r read-generic, eval = FALSE} data <- read_biblio( "custom.csv", id = "paper_id", authors = "Author Names", keywords = "Tags", sep = "," ) ``` Each mapped column is split on `sep` into the standard list-column, so afterwards every builder works with its defaults. ### A plain data frame, directly As the quick start showed, you can skip the reader entirely and let the builder split a column for you. The same column arguments are available on every builder: ```{r read-direct, eval = FALSE} author_network(my_df, authors = "Author Names", sep = ",") keyword_network(my_df, keywords = "Tags", sep = ",") ``` The work identifier is the `id` column. You need not supply one: when no `id` column is present each row is treated as one document; pass `id = "paper_id"` to use a differently-named column. Surrounding quotes are stripped by default (`strip_quotes = TRUE`), and in a coupling network the references column takes its own `references_sep`. The companion `vignette("reading-data")` covers every reader and these options in full. ### The standard schema Readers return a common set of columns: ```{r schema} data(scopus_quantum_cloud) sc <- scopus_quantum_cloud names(sc)[1:12] ``` The columns that matter for network construction are `id`, the list-columns `authors` / `references` / `keywords`, and `year` (used by `temporal_network()`). Source-specific extras such as `countries`, `affiliations`, and `keywords_plus` are kept when available. ## Datasets used here ```{r data} data(biblio_data) data(learning_analytics) small <- biblio_data # tiny, synthetic oa <- learning_analytics # 1,508 OpenAlex records on learning analytics c(small = nrow(small), scopus = nrow(sc), openalex = nrow(oa)) ``` ## Author collaboration Two authors are linked when they appear on the same paper: ```{r author-basic} authors <- author_network(oa, type = "collaboration") head(authors, 5) summary(authors) ``` Use `min_occur` to drop rare authors before projection: ```{r author-minoccur} nrow(author_network(oa, type = "collaboration")) nrow(author_network(oa, type = "collaboration", min_occur = 2)) ``` ### Counting methods `counting` controls how much each paper contributes to an edge: ```{r counting} head(author_network(small, type = "collaboration", counting = "full"), 3) head(author_network(small, type = "collaboration", counting = "fractional"), 3) head(author_network(small, type = "collaboration", counting = "harmonic"), 3) head(author_network(small, type = "collaboration", counting = "first_last"), 3) ``` The available methods differ in how they weight the rows or positions before projection: | Method | What it does | Trade-off | When to use | |---|---|---|---| | `"full"` | Leaves the binary incidence matrix unchanged; for positional author weights, every listed entity receives weight 1. | Large teams or long lists create many full-strength pairs. | Use for raw event counts where every observed co-occurrence should count equally. | | `"fractional"` | For symmetric networks, each row contributes `1 / (n - 1)` to pairs when `n > 1`; for coupling it uses `1 / n`; positional use gives each entity `1 / n`. | Reduces large-list dominance but treats all positions equally. | Use when each paper or reference list should have limited influence and position is not meaningful. | | `"paper"` | For symmetric networks, each paper's pair budget is scaled by `2 / (n * (n - 1))`; for coupling it uses `1 / n`. | Normalizes at the paper level, so very large and very small papers can contribute comparable total pair mass. | Use when publications, rather than individual author/entity pairs, should be the main unit of contribution. | | `"strength"` | Multiplies entity columns by the square root of inverse document frequency, `sqrt(log(n_works / entity_frequency))`; row-size scaling for coupling is deferred to projection. | Downweights ubiquitous entities and emphasizes rarer shared entities; values are less like direct counts. | Use for coupling or profile similarity where common references, keywords, or entities should carry less evidence. | | `"harmonic"` | Uses positional weights proportional to `1 / position`, normalized to sum to one. | Strongly favors early positions while still giving every later position some credit. | Use when author order matters and early authorship should dominate without excluding later authors. | | `"arithmetic"` | Uses a linear decline from first to last, proportional to `n - position + 1`, normalized. | Gives a gentler first-author advantage than geometric methods. | Use when byline order matters but credit should decrease steadily rather than sharply. | | `"geometric"` | Uses weights proportional to `0.5^(position - 1)`, normalized. | Concentrates credit heavily at the front of the byline. | Use when the first few positions are expected to carry most of the contribution. | | `"adaptive_geometric"` | Uses a geometric sequence normalized so the first-to-last weight ratio equals `n` (`2/3`, `1/3` for two authors). | Adapts the steepness to team size, making long bylines more front-loaded. | Use when first-author emphasis should increase with the number of authors. | | `"golden"` | Uses golden-ratio decay, proportional to `phi^(n - position)`, normalized. | More front-loaded than arithmetic but less abrupt than fixed halving. | Use as a moderate positional decay when author order matters but geometric halving is too strong. | | `"first"` | Gives weight 1 to the first position and 0 to all others. | Ignores all non-first contributors. | Use for strict first-author analyses. | | `"last"` | Gives weight 1 to the last position and 0 to all others. | Ignores all non-last contributors. | Use where last authorship represents the analytical role of interest, such as senior or PI credit. | | `"first_last"` | With two authors, assigns `0.5` and `0.5`; otherwise gives first and last authors an elevated weight and middle authors a baseline weight, all normalized. | Highlights both endpoints while still retaining middle-author credit. | Use in fields where first and last positions have distinct credit or leadership meanings. | | `"position_weighted"` | Uses the supplied `position_weights` vector, extending the last value to longer bylines, then normalizes. | Puts the burden of choosing defensible weights on the analyst. | Use when you have field-specific or study-specific positional weights. | ### Attention weights Standard bibliometric co-authorship networks treat every byline position as equivalent: a first author who conceived and drove the work is weighted identically to a fifteenth contributor who provided a single instrument reading. On hyper-authored papers this produces dense, low-meaning co-authorship ties that drown out the focused two- or three-author collaborations that often signal the sharpest intellectual kinship. The attention weighting feature in `bibnets` is designed to correct this. The name is an honest analogy to the attention mechanism in large language models: just as a transformer assigns a normalized probability distribution across the tokens in a sequence — concentrating weight on what matters, spreading it thin over the rest — `bibnets` assigns each author on a paper a positional weight that sums to one across the full byline. A fifty-author paper therefore contributes exactly one unit of connection budget, the same as a two-author paper, and the distribution of that budget reflects authorship conventions: `"lead"` concentrates weight on the first author, `"last"` on the senior or PI position, `"proximity"` rewards the central authors, and `"circular"` rewards both ends jointly. The weights are a fixed positional prior, not learned content-based attention, but they carry real scholarly meaning, and activating them requires nothing more than passing `attention = "lead"` (or any of the three alternatives) to any of the author, keyword, country, or institution network functions. | `attention` | Weight vector | Scholarly assumption | When it fits | |---|---|---|---| | `"lead"` | Quadratic drop from the first position: the first position has raw weight `n^2`, inner positions decline as the byline advances, and the last position has raw weight `1`, then all weights are normalized. | The lead author is the main intellectual driver. | Use in first-author-oriented fields or questions about lead contribution. | | `"last"` | Quadratic rise to the last position: the first position has raw weight `1`, inner positions rise across the byline, and the last position has raw weight `n^2`, then all weights are normalized. | The last author represents senior, supervisory, or PI contribution. | Use in disciplines where last authorship marks lab leadership or supervision. | | `"proximity"` | Pyramid profile using `min(position, n + 1 - position)`: first and last positions have raw weight `1`, inner positions increase toward the middle, and central positions are highest. | Central byline positions deserve the most attention. | Use when the question treats middle-position contributors as the focal group. | | `"circular"` | Edge profile using `max(position, n + 1 - position)`: first and last positions have the largest raw weight, while inner positions decline toward the center. | Both ends of the byline are prominent. | Use where lead and senior positions jointly matter more than middle positions. | `attention` applies a smooth positional profile instead of a named counting scheme (available for author, keyword, country, and institution networks): ```{r attention} head(author_network(small, attention = "lead"), 3) ``` ## Reference co-citation Two references are linked when a paper cites both: ```{r cocitation} refs <- reference_network(sc, min_occur = 2) head(refs, 5) ``` A similarity measure offsets the advantage of very frequently cited works: ```{r cocitation-cosine} head(reference_network(sc, min_occur = 2, similarity = "cosine"), 3) ``` ## Document coupling and citation Coupling links two documents that share cited references: ```{r coupling} head(document_network(sc, type = "coupling", similarity = "cosine"), 5) ``` Direct citation is directed — `from` cites `to` — and only within the corpus (the cited work must also be a row in the data): ```{r citation} head(document_network(sc, type = "citation"), 5) ``` ## Keyword co-occurrence ```{r keywords} kw <- keyword_network(sc, min_occur = 2) head(kw, 5) ``` Labels are trimmed and upper-cased during construction, so `machine learning`, `Machine Learning`, and ` MACHINE LEARNING ` are one node. Association strength is a common choice for co-occurrence maps: ```{r keywords-assoc} head(keyword_network(sc, min_occur = 2, similarity = "association"), 3) ``` ## Countries, institutions, and sources ```{r geo} head(country_network(oa, counting = "fractional"), 5) head(institution_network(oa, counting = "fractional", min_occur = 2), 5) head(source_network(sc, type = "coupling", min_occur = 2), 5) ``` For coupling networks, `min_occur` is applied to the aggregated entity before the network is built. ## Generic co-networks `conetwork()` covers projections without a dedicated wrapper. One field links entities that co-occur; a second field (`by`) links them through a shared value: ```{r conetwork} head(conetwork(sc, "keywords", min_occur = 2), 3) head(conetwork(sc, "authors", by = "keywords", min_occur = 2), 3) ``` The second result links authors through shared keywords — a thematic similarity network, not a co-authorship one. ## Normalization The same raw counts support different similarity scores; only `weight` changes, `count` does not: ```{r normalize} none <- keyword_network(sc, min_occur = 2, similarity = "none") cos <- keyword_network(sc, min_occur = 2, similarity = "cosine") head(none[, c("from", "to", "weight", "count")], 3) head(cos[, c("from", "to", "weight", "count")], 3) ``` `normalize()` uses the diagonal of the projected matrix as each node's total occurrence count: | Similarity | Denominator | Meaning | When to use | |---|---|---|---| | `"none"` | No denominator; the projected matrix is returned as raw weighted co-occurrence, with the diagonal removed by the network builder unless self-loops are requested. | `weight` stays on the same scale as the counted projection. | Use when absolute co-occurrence or counted edge strength is the quantity of interest. | | `"cosine"` | Square root of the product of the two node totals. | Symmetric size correction; pairs are high when their overlap is large relative to both nodes' frequencies. | Use as a general-purpose correction for very frequent nodes while preserving a familiar similarity scale. | | `"association"` | Product of the two node totals. | Symmetric association-strength normalization; strongly penalizes pairs involving very frequent nodes. | Use for co-occurrence maps where you want rare, unexpectedly tight pairings to stand out. | | `"jaccard"` | Sum of the two node totals minus their observed edge value. | Symmetric overlap over a union-like total. | Use when the edge should represent shared occurrence as a share of either node's combined footprint. | | `"inclusion"` | The smaller of the two node totals. | Symmetric containment-oriented score; it reaches high values when the smaller node mostly appears with the larger one. | Use when subset or specialization relationships are more important than balanced overlap. | | `"equivalence"` | Product of the two node totals, with the edge value squared before division. | Cosine-like normalization with stronger penalty for weak or occasional overlap. | Use when following equivalence-index conventions or when only consistently paired nodes should remain strong. | ## Reducing large networks ```{r reduce} edges <- author_network(oa, type = "collaboration") c(all = nrow(edges), threshold = nrow(prune(edges, threshold = 2)), top_n = nrow(prune(edges, top_n = 5)), top_nodes = nrow(filter_top(edges, n = 50))) ``` - `prune(threshold = x)` — absolute edge-weight cutoff. - `prune(top_n = k)` — keep each node's strongest edges. - `filter_top(n = k)` — keep edges among the most-connected nodes. `backbone()` applies the disparity filter, which keeps edges that are strong relative to a node's local strength distribution — not a global cutoff: ```{r backbone} bb <- backbone(edges, alpha = 0.05) nrow(bb) ``` ## Temporal networks `temporal_network()` runs any builder over time windows (fixed, sliding, or cumulative): ```{r temporal} tn <- temporal_network(oa, author_network, "collaboration", window = 3) names(tn) ``` Each window's edge list carries a `window` column. Windows with fewer than two records, or no surviving edges, are dropped; a builder error inside a window becomes a warning labelled with that window. ## Local citations and historiographs `local_citations()` counts how often each document is cited by others in the same corpus; `historiograph()` builds the directed citation graph among the top-cited documents: ```{r historiograph} head(local_citations(sc), 5) h <- historiograph(sc, n = 10) h$nodes head(h$edges, 5) ``` Both require reference strings or IDs to match document IDs in the data; if the cited works are external, local counts stay low. ## Author-name normalization `parse_names()` reorders and splits author names (it recognizes `Last, First`, `SURNAME Initials`, and `First Last`). Because node identity is fixed when a network is built, normalize *before* building so that two spellings of one author merge: ```{r parse-names} parse_names(c("Saqr, Mohammed", "WANG Y", "Mohammed Saqr")) ``` See `vignette("parsing-author-names")` for the full treatment. ## Exporting The edge list is already usable; converters cover the common targets: ```{r export} edges <- keyword_network(sc, min_occur = 2) m <- to_matrix(edges) # sparse adjacency matrix m[1:4, 1:4] gephi <- to_gephi(edges) # Gephi node/edge tables head(gephi$edges, 3) cat(substr(to_graphml(edges), 1, 200)) # GraphML, no XML dependency ``` `to_igraph()`, `to_tbl_graph()`, and `to_cograph()` are available when their (suggested) packages are installed. ## Reading a `bibnets_network` The object records how it was built, as attributes: ```{r attrs} edges <- author_network(oa, type = "collaboration", counting = "harmonic") c(type = attr(edges, "network_type"), counting = attr(edges, "counting"), sim = attr(edges, "similarity")) summary(edges) ``` `print()` reports the network type, node and edge counts, and the counting and similarity methods — so a saved edge list always says how it was made. ## References The methodology implemented in `bibnets` is described in: López-Pernas, S., Saqr, M., & Apiola, M. (2023). Scientometrics: A Concise Introduction and a Detailed Methodology for Mapping the Scientific Field of Computing Education Research. In M. Apiola, S. López-Pernas, & M. Saqr (Eds.), *Past, Present and Future of Computing Education Research: A Global Perspective* (pp. 79–99). Springer Nature Switzerland AG. Saqr, M., López-Pernas, S., Conde, M. Á., & Hernández-García, Á. (2024). Social Network Analysis: A primer, a guide and a tutorial in R. In M. Saqr & S. López-Pernas (Eds.), *Learning Analytics Methods and Tutorials: A Practical Guide Using R* (pp. 491–518). Springer, Cham.