Drupal Migration Taxonomy Audit, Part 2: How to Analyze Your Data and Build a Migration-Ready Taxonomy Plan

A practical guide to exporting terms, spotting duplicates, and enforcing structure with SQL and spreadsheets
Drupal Migration Taxonomy Audit, Part 2:  How to Analyze Your Data and Build a Migration-Ready Taxonomy Plan

In Part 1 of this series, we established why planning your taxonomy cleanup before a Drupal 7 to Drupal 10 or 11 migration significantly reduces QA and rework. This second part of the series explains how to perform the taxonomy audit, interpret the findings, and express your decisions in the migration layer of the new Drupal system. Although these examples use Drupal 7 for illustration, the analysis process applies to any CMS that uses taxonomy or categorical metadata.

1. Exporting Your Taxonomy Data

The first step is to export a complete set of taxonomy data from Drupal 7. This includes vocabulary names, term IDs, term names, and usage counts. These exports feed the analysis workbook described later in this article.

Export all taxonomy terms with usage count


SELECT 
  v.name AS vocabulary,
  t.tid,
  t.name AS term,
  COUNT(r.nid) AS usage_count
FROM taxonomy_term_data t
LEFT JOIN taxonomy_index r ON t.tid = r.tid
JOIN taxonomy_vocabulary v ON v.vid = t.vid
GROUP BY t.tid
ORDER BY v.name, t.name;

DDEV and Drush version:


ddev drush sqlq "
SELECT 
  v.name AS vocabulary,
  t.tid,
  t.name AS term,
  COUNT(r.nid) AS usage_count
FROM taxonomy_term_data t
LEFT JOIN taxonomy_index r ON t.tid = r.tid
JOIN taxonomy_vocabulary v ON v.vid = t.vid
GROUP BY t.tid
ORDER BY v.name, t.name;
" > taxonomy-export.csv

Load this CSV into your taxonomy audit workbook. The next section explains that workbook and how it supports the full analysis workflow.

2. The Taxonomy Audit Workbook

The workbook provides a structured way to evaluate inconsistencies and plan your canonical taxonomy. It typically includes the following worksheets:

Vocabulary Summary

  • Vocabulary name
  • Number of terms
  • Terms with zero usage
  • Whether the vocabulary is flat or hierarchical

Normalization Worksheet

  • The original term
  • The problem identified, such as duplication or inconsistent naming
  • The recommended canonical term
  • The planned action, such as merge or retire

Mapping Worksheet

This sheet defines how old terms and vocabularies map to new canonical values. It later becomes a lookup table used in the Drupal 10 or 11 migration process.

Usage Heatmap

This worksheet highlights high-value and low-value terms. High usage indicates a term that should not change significantly. Low or zero usage indicates that a term may no longer be relevant.

3. SQL Toolkit for Taxonomy Analysis

The queries below form a complete taxonomy audit toolkit. Each query includes a description and guidance on how to interpret the results.

3.1 Vocabulary Overview


SELECT 
  v.vid,
  v.name AS vocabulary,
  COUNT(t.tid) AS term_count
FROM taxonomy_vocabulary v
LEFT JOIN taxonomy_term_data t ON t.vid = v.vid
GROUP BY v.vid
ORDER BY v.name;

3.2 Terms by Usage Count


SELECT 
  t.tid,
  t.name AS term,
  v.name AS vocabulary,
  COUNT(r.nid) AS usage_count
FROM taxonomy_term_data t
LEFT JOIN taxonomy_index r ON r.tid = t.tid
JOIN taxonomy_vocabulary v ON v.vid = t.vid
GROUP BY t.tid
ORDER BY usage_count DESC, term;

3.3 Unused Terms


SELECT 
  t.tid,
  t.name AS term,
  v.name AS vocabulary
FROM taxonomy_term_data t
JOIN taxonomy_vocabulary v ON v.vid = t.vid
LEFT JOIN taxonomy_index r ON t.tid = r.tid
WHERE r.tid IS NULL
ORDER BY v.name, term;

3.4 Similar Terms


SELECT
  t1.tid AS tid_1,
  t1.name AS term_1,
  t2.tid AS tid_2,
  t2.name AS term_2,
  v.name AS vocabulary
FROM taxonomy_term_data t1
JOIN taxonomy_term_data t2
  ON t1.vid = t2.vid
  AND t1.tid < t2.tid
JOIN taxonomy_vocabulary v ON v.vid = t1.vid
WHERE LEFT(t1.name, LENGTH(t1.name) - LOCATE(' ', REVERSE(t1.name))) =
      LEFT(t2.name, LENGTH(t2.name) - LOCATE(' ', REVERSE(t2.name)))
ORDER BY vocabulary, term_1, term_2;

3.5 Parent and Child Hierarchies


SELECT
  parent.tid AS parent_tid,
  parent.name AS parent_term,
  child.tid AS child_tid,
  child.name AS child_term,
  v.name AS vocabulary
FROM taxonomy_term_hierarchy h
JOIN taxonomy_term_data child ON child.tid = h.tid
LEFT JOIN taxonomy_term_data parent ON parent.tid = h.parent
JOIN taxonomy_vocabulary v ON v.vid = child.vid
ORDER BY vocabulary, parent_term, child_term;

3.6 Orphan Terms


SELECT
  h.tid AS child_tid,
  child.name AS child_term,
  h.parent AS missing_parent_tid,
  v.name AS vocabulary
FROM taxonomy_term_hierarchy h
JOIN taxonomy_term_data child ON child.tid = h.tid
JOIN taxonomy_vocabulary v ON v.vid = child.vid
LEFT JOIN taxonomy_term_data parent ON parent.tid = h.parent
WHERE h.parent IS NOT NULL
  AND parent.tid IS NULL
ORDER BY vocabulary, child_term;

3.7 Terms Used Only on Unpublished Content


SELECT
  t.tid,
  t.name AS term,
  v.name AS vocabulary,
  COUNT(n.nid) AS unpublished_usage
FROM taxonomy_term_data t
JOIN field_data_field_tags f
  ON f.field_tags_tid = t.tid
JOIN node n
  ON n.nid = f.entity_id
JOIN taxonomy_vocabulary v
  ON v.vid = t.vid
WHERE n.status = 0
GROUP BY t.tid, t.name, v.name
HAVING COUNT(n.nid) > 0
ORDER BY unpublished_usage DESC, term;

4. How to Analyze the Results

With your data exported and the SQL queries complete, the next step is to interpret the results and record decisions in your audit workbook.

4.1 Identify Duplicate Concepts

Use the similar terms query to find duplicate or inconsistent concepts. Assign one canonical term and record the mapping in the normalization worksheet.

4.2 Decide Which Terms to Retain or Retire

Use the usage count and unused terms worksheets. Terms with little or no usage may not be needed in the new site. Rarely used terms should be validated with content owners.

4.3 Normalize Naming Conventions

Apply naming rules such as singular nouns, consistent capitalization, and clear terminology. The normalization worksheet provides a controlled way to document this.

4.4 Evaluate Vocabulary Structure

Use the vocabulary overview and hierarchy queries to determine whether vocabularies should remain separate or be consolidated. Document your planned structure in the mapping worksheet.

5. When Cleanup Occurs and Why It Happens in the Migration Layer

This is a key point that often causes confusion. The taxonomy audit identifies what should change, but the changes themselves are not applied in Drupal 7. They are executed in the Drupal 10 or 11 migration layer.

  • It would require new Drupal 7 code that is immediately discarded after migration.
  • Directly modifying the Drupal 7 database is unsafe and risks corrupting the legacy system.

Instead:

  • All cleanup decisions are recorded in the audit workbook.
  • The Drupal 10 or 11 migration process applies those decisions while importing terms and nodes.
  • Migration mapping enforces canonical names, merges duplicate terms, retires unused terms, and consolidates vocabularies.

6. Creating the Canonical Manifest for Migration

Once your decisions are complete, represent them in a canonical manifest file. This file is used by the migration process to interpret old terms and apply the new taxonomy structure.


vocabularies:
  topic:
    canonical: "Topic"
    merge:
      - "Topics"
      - "Subject"
    retire:
      - "Deprecated Topic"

terms:
  "policy update":
    canonical: "Policy Update"
  "policy updates":
    canonical: "Policy Update"
  "test term 1":
    canonical: "Test Term"
  "test term 2":
    canonical: "Test Term"

7. Final Summary

The taxonomy audit consists of analysis, decision making, and documentation. Cleanup is possible on the Drupal 7 side, but it is not recommended because it requires new code in an obsolete environment and introduces risk to the legacy database. The correct and modern approach is to document all cleanup decisions, and then execute those changes in the migration process itself.

With the workbook complete and the manifest in place, your migration will produce a consistent and maintainable taxonomy structure while sharply reducing the amount of QA needed on the destination system.

Note: The vision of this web portal is to help promote news and stories around the Drupal community and promote and celebrate the people and organizations in the community. We strive to create and distribute our content based on these content policy. If you see any omission/variation on this please reach out to us at #thedroptimes channel on Drupal Slack and we will try to address the issue as best we can.

Related Organizations

Upcoming Events

Latest Opportunities