Module 3: Analyzing text content with natural language processing

AI-aided content analysis of sustainability communication

Lesson 3.3: Interpreting the results of NLP analysis

lecture text

Quantitative content analysis

Quantitative content analysis systematically evaluates content features in texts, such as frequency of themes, sentiment, or specific terms, to derive measurable insights. Text-level features, like overall tone or narrative structure, contrast with word-level features, such as keyword counts or lexical diversity. Traditionally, human coders offer flexibility in interpreting nuanced meanings but face limitations in scalability due to time and resource constraints. Conversely, AI-aided coding, leveraging natural language processing (NLP), sacrifices some interpretive flexibility for enhanced scalability, enabling rapid analysis of large datasets. This approach excels in identifying patterns, such as sustainability-related terms across corporate reports. Quantitative content analysis complements qualitative content analysis by providing numerical rigor to validate or challenge qualitative interpretations. For instance, while qualitative analysis might explore the context of sustainability claims, quantitative methods count their occurrences or measure their prominence, offering a robust foundation for comparing communication strategies across organizations. Integrating both approaches ensures a comprehensive understanding of sustainability communication, balancing depth with breadth in analyzing textual data.

Operationalizing sustainability

Operationalizing sustainability in NLP analysis involves defining metrics to distinguish authentic sustainability communication from greenwashing, addressing the research question of communication integrity. Authentic sustainability communication is characterized by specific, measurable indicators, such as detailed environmental impact metrics or commitments to renewable energy. In contrast, greenwashing may feature vague terms, exaggerated claims, or lack of verifiable data. NLP techniques like named entity recognition (NER) identify key entities (e.g., organizations, policies) and their relationships, revealing networks of accountability or obfuscation. Part-of-speech (POS) analysis further dissects texts by examining nouns (e.g., “emissions”), verbs (e.g., “reduce”), and adjectives (e.g., “sustainable”), which signal intent or emphasis. For example, authentic communication might use precise verbs like “implemented” rather than ambiguous ones like “aimed.” By quantifying these linguistic elements, researchers can systematically evaluate the credibility of sustainability claims. This approach enables a structured comparison of corporate narratives, ensuring that sustainability communication is not only rhetorically compelling but also substantively grounded in actionable and transparent practices.

Comparison across organizations

Comparing sustainability communication across organizations, such as Preem (fossil fuel) and Vattenfall (renewable energy), reveals distinct rhetorical strategies shaped by their operational contexts. Fossil fuel companies like Preem may emphasize mitigation efforts, such as carbon capture, to counter environmental criticism, while renewable energy firms like Vattenfall might highlight innovation and clean energy achievements. These differences manifest in word choice, thematic focus, and narrative tone, detectable through NLP analysis. For instance, Preem’s texts might feature terms like “efficiency” or “transition,” whereas Vattenfall’s could prioritize “renewable” or “zero-emission.” Quantitative analysis of these features, such as term frequency or sentiment scores, allows researchers to map organizational priorities and assess alignment with sustainability goals. Expected differences also stem from public perception pressures: fossil fuel companies face greater scrutiny, potentially leading to defensive or compensatory messaging. By systematically comparing these patterns, NLP analysis provides evidence-based insights into how organizational type influences sustainability communication, enabling stakeholders to evaluate authenticity and strategic intent across diverse energy sectors.

Summarizing results of text analysis

Summarizing NLP results transforms complex token-level analysis dataframes, which detail individual words or entities, into accessible insights. These raw dataframes, often dense with columns like token frequency or entity type, are challenging to interpret directly. Generating a new dataframe with summary statistics simplifies this by aggregating key metrics, such as content category counts (e.g., “sustainability” vs. “innovation” mentions) or sentiment measurements. The dependent variable, such as the frequency of a content category, is analyzed against independent variables like word type (e.g., nouns vs. adjectives) or organizational type (e.g., fossil fuel vs. renewable). For example, a summary dataframe might reveal that renewable energy firms use more positive adjectives than fossil fuel companies. This aggregated view highlights trends and differences without overwhelming stakeholders with granular data. By focusing on high-level patterns, summarizing ensures that findings are actionable, facilitating communication of results to diverse audiences, from researchers to corporate decision-makers, while retaining the analytical rigor of the original NLP process.

Select, filter, aggregate

Selecting, filtering, and aggregating data are critical steps in refining NLP datasets for meaningful interpretation. Selecting relevant columns, such as token entity (e.g., “Preem” or “carbon”) or part-of-speech tags (e.g., “adjective”), focuses analysis on variables pertinent to sustainability communication. Filtering removes irrelevant or incomplete data, such as rows with null values or tokens below a minimum frequency threshold, ensuring data quality. For instance, excluding low-count tokens reduces noise from rare or insignificant terms. Aggregation then computes summary metrics, such as category counts (e.g., frequency of “renewable” mentions) or measurement means (e.g., average sentiment score) per organization. This process might reveal, for example, that Vattenfall’s texts contain higher counts of “sustainability” than Preem’s. By systematically narrowing and synthesizing the dataset, these steps transform raw NLP outputs into structured insights. This enables researchers to address specific research questions, such as comparing authentic sustainability claims, with clarity and precision, supporting robust conclusions about organizational communication strategies.

Visualizing results of text analysis

Data visualization enhances the interpretability of NLP results, offering a more intuitive alternative to summary tables. By representing complex data—such as term frequencies or sentiment scores—visually, charts and graphs make patterns immediately apparent. Options include bar plots, word clouds, or heatmaps, each suited to different insights (e.g., word clouds for keyword prominence, heatmaps for correlations). Simple visualizations, like bar plots comparing sustainability term counts across organizations, are often more effective than complex ones, as they avoid overwhelming viewers. AI-aided data analysis streamlines visualization by automating data processing and integrating tools like Python’s Matplotlib or Seaborn, reducing manual effort. For instance, a bar plot might vividly contrast Preem’s focus on “efficiency” with Vattenfall’s on “renewable energy,” making differences accessible to non-technical stakeholders. Visualizations thus bridge the gap between raw NLP outputs and actionable insights, enabling researchers to communicate findings effectively while highlighting key trends in sustainability communication with clarity and impact.

Stacked bar plots

Stacked bar plots are an effective tool for visualizing NLP results, particularly for comparing sustainability communication across organizations. For a single organization, simple bar plots can display metrics like the frequency of content categories (e.g., “sustainability” vs. “innovation”). Stacked bar plots extend this to bivariate or multivariate analysis by showing multiple variables within each bar, such as the proportion of nouns, verbs, and adjectives in sustainability-related texts across organizations like Preem and Vattenfall. Each segment of the bar represents a variable (e.g., word type), with the total bar height indicating overall frequency. This approach highlights differences, such as Vattenfall’s higher use of positive adjectives compared to Preem’s noun-heavy focus on “emissions.” Stacked bar plots thus provide a clear, comparative view of complex data, making it easier to identify patterns and trends. By visualizing multivariate relationships intuitively, they support researchers in communicating nuanced findings about organizational sustainability narratives to diverse audiences, from academics to industry stakeholders.