Module 3: Analyzing text content with natural language processing

AI-aided content analysis of sustainability communication

Lesson 3.1: Reading text into dataframes

lab notebook

  • Open In Colab

lab video

lab text

Convert PDF to plain text

!pip install -q pdfminer.six
import os
from pdfminer.high_level import extract_text

# Directories containing the PDFs
directories = ['organization1', 'organization2']
directories = ['/content/osm-cca-nlp/res/pdf/preem', '/content/osm-cca-nlp/res/pdf/vattenfall']

for directory in directories:
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_path = os.path.join(root, file)
                text_path = os.path.splitext(pdf_path)[0] + '.txt'

                try:
                    text = extract_text(pdf_path)
                    with open(text_path, 'w', encoding='utf-8') as f:
                        f.write(text)
                    print(f"Converted {pdf_path} to {text_path}")
                except Exception as e:
                    print(f"Failed to convert {pdf_path}: {e}")

Importing Necessary Libraries

The code begins by importing essential modules. It imports os for interacting with the operating system’s file system and extract_text from pdfminer.high_level for extracting text content from PDF files.


Defining the Directories Containing PDFs

Two lists named directories are defined. The first is a placeholder with ['organization1', 'organization2'], and the second specifies the actual paths to the directories containing the PDF files: - /content/osm-cca-nlp/res/pdf/preem - /content/osm-cca-nlp/res/pdf/vattenfall


Iterating Over Each Directory

The code uses a for loop to iterate through each directory specified in the directories list. This allows the program to process multiple directories sequentially.


Walking Through Directory Trees

Within each directory, the os.walk(directory) function traverses the directory tree. It yields a tuple containing the root path, a list of dirs (subdirectories), and a list of files in each directory.


Identifying PDF Files

For every file in the files list, the code checks if the file name ends with .pdf (case-insensitive) using file.lower().endswith('.pdf'). This ensures that only PDF files are processed.


Constructing File Paths

The full path to the PDF file is constructed using os.path.join(root, file). The corresponding text file path is created by replacing the .pdf extension with .txt using os.path.splitext(pdf_path)[0] + '.txt'.


Extracting Text from PDFs

A try block is initiated to attempt text extraction. The extract_text(pdf_path) function reads the content of the PDF file and stores it in the variable text.


Writing Extracted Text to Files

If text extraction is successful, the code opens a new text file at text_path in write mode with UTF-8 encoding. It writes the extracted text into this file and then closes it, ensuring the text is saved next to the original PDF.


Logging Successful Conversions

After successfully writing the text file, the code prints a message indicating the PDF file has been converted, using:

print(f"Converted {pdf_path} to {text_path}")

Handling Exceptions

An except block catches any exceptions that occur during the extraction or writing process. If an error occurs, it prints a failure message with the path of the PDF file and the exception details:

print(f"Failed to convert {pdf_path}: {e}")

Read plain text to Pandas Dataframe

import os
import pandas as pd
import re
import string

# Directories containing the text files
directories = ['organization1', 'organization2']
directories = ['/content/osm-cca-nlp/res/pdf/preem', '/content/osm-cca-nlp/res/pdf/vattenfall']

data = []
text_index = 1

# Allowed characters: alphabetic, punctuation, and whitespace
allowed_chars = set(string.ascii_letters + string.punctuation + string.whitespace)

for directory in directories:
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.lower().endswith('.txt'):
                file_path = os.path.join(root, file)
                folder_name = os.path.basename(root)

                with open(file_path, 'r', encoding='utf-8') as f:
                    raw_text = f.read()

                # Keep only allowed characters
                clean_text = ''.join(c for c in raw_text if c in allowed_chars)

                # Replace sequences of whitespace with a single space
                clean_text = re.sub(r'\s+', ' ', clean_text)

                # Trim leading and trailing whitespace
                clean_text = clean_text.strip()

                data.append({
                    'text_index': text_index,
                    'file_path': file_path,
                    'folder_name': folder_name,
                    'raw_text': raw_text,
                    'clean_text': clean_text
                })

                text_index += 1

# Create DataFrame
df_texts = pd.DataFrame(data, columns=['text_index', 'file_path', 'folder_name', 'raw_text', 'clean_text'])

# Save DataFrame to TSV file
df_texts.to_csv('df_texts.tsv', sep='\t', index=False)
df_texts.head()