Module 3: Analyzing text content with natural language processing
AI-aided content analysis of sustainability communication
Lesson 3.1: Reading text into dataframes
lab notebook
lab video
lab text
Convert PDF to plain text
!pip install -q pdfminer.six
import os
from pdfminer.high_level import extract_text
# Directories containing the PDFs
= ['organization1', 'organization2']
directories = ['/content/osm-cca-nlp/res/pdf/preem', '/content/osm-cca-nlp/res/pdf/vattenfall']
directories
for directory in directories:
for root, dirs, files in os.walk(directory):
for file in files:
if file.lower().endswith('.pdf'):
= os.path.join(root, file)
pdf_path = os.path.splitext(pdf_path)[0] + '.txt'
text_path
try:
= extract_text(pdf_path)
text with open(text_path, 'w', encoding='utf-8') as f:
f.write(text)print(f"Converted {pdf_path} to {text_path}")
except Exception as e:
print(f"Failed to convert {pdf_path}: {e}")
Importing Necessary Libraries
The code begins by importing essential modules. It imports os
for interacting with the operating system’s file system and extract_text
from pdfminer.high_level
for extracting text content from PDF files.
Defining the Directories Containing PDFs
Two lists named directories
are defined. The first is a placeholder with ['organization1', 'organization2']
, and the second specifies the actual paths to the directories containing the PDF files: - /content/osm-cca-nlp/res/pdf/preem
- /content/osm-cca-nlp/res/pdf/vattenfall
Iterating Over Each Directory
The code uses a for
loop to iterate through each directory specified in the directories
list. This allows the program to process multiple directories sequentially.
Walking Through Directory Trees
Within each directory, the os.walk(directory)
function traverses the directory tree. It yields a tuple containing the root
path, a list of dirs
(subdirectories), and a list of files
in each directory.
Identifying PDF Files
For every file in the files
list, the code checks if the file name ends with .pdf
(case-insensitive) using file.lower().endswith('.pdf')
. This ensures that only PDF files are processed.
Constructing File Paths
The full path to the PDF file is constructed using os.path.join(root, file)
. The corresponding text file path is created by replacing the .pdf
extension with .txt
using os.path.splitext(pdf_path)[0] + '.txt'
.
Extracting Text from PDFs
A try
block is initiated to attempt text extraction. The extract_text(pdf_path)
function reads the content of the PDF file and stores it in the variable text
.
Writing Extracted Text to Files
If text extraction is successful, the code opens a new text file at text_path
in write mode with UTF-8 encoding. It writes the extracted text into this file and then closes it, ensuring the text is saved next to the original PDF.
Logging Successful Conversions
After successfully writing the text file, the code prints a message indicating the PDF file has been converted, using:
print(f"Converted {pdf_path} to {text_path}")
Handling Exceptions
An except
block catches any exceptions that occur during the extraction or writing process. If an error occurs, it prints a failure message with the path of the PDF file and the exception details:
print(f"Failed to convert {pdf_path}: {e}")
Read plain text to Pandas Dataframe
import os
import pandas as pd
import re
import string
# Directories containing the text files
= ['organization1', 'organization2']
directories = ['/content/osm-cca-nlp/res/pdf/preem', '/content/osm-cca-nlp/res/pdf/vattenfall']
directories
= []
data = 1
text_index
# Allowed characters: alphabetic, punctuation, and whitespace
= set(string.ascii_letters + string.punctuation + string.whitespace)
allowed_chars
for directory in directories:
for root, dirs, files in os.walk(directory):
for file in files:
if file.lower().endswith('.txt'):
= os.path.join(root, file)
file_path = os.path.basename(root)
folder_name
with open(file_path, 'r', encoding='utf-8') as f:
= f.read()
raw_text
# Keep only allowed characters
= ''.join(c for c in raw_text if c in allowed_chars)
clean_text
# Replace sequences of whitespace with a single space
= re.sub(r'\s+', ' ', clean_text)
clean_text
# Trim leading and trailing whitespace
= clean_text.strip()
clean_text
data.append({'text_index': text_index,
'file_path': file_path,
'folder_name': folder_name,
'raw_text': raw_text,
'clean_text': clean_text
})
+= 1
text_index
# Create DataFrame
= pd.DataFrame(data, columns=['text_index', 'file_path', 'folder_name', 'raw_text', 'clean_text'])
df_texts
# Save DataFrame to TSV file
'df_texts.tsv', sep='\t', index=False) df_texts.to_csv(
df_texts.head()