This page contains supplementary materials for the article “Remediation and the Development of Modernist Forms in The Western Home Monthly” in Reading Modernism with Machines, edited by Shawna Ross and James O’Sullivan (Palgrave Macmillan 2016).

 

Appendix A

The following command line prompt was used to combine various files throughout the project.

cat *.[extension of files to concatenate] >> [name of concatenated file].[extension of concatenated file]

 

The following PHP script was used to extract the text content from the ALTO files.

<?php

foreach ($argv as $value) {

 

$xmlDoc = new DOMDocument();

$xmlDoc->load($value);

 

$searchNode = $xmlDoc->getElementsByTagName(“String”);

 

foreach( $searchNode as $searchNode )

{

$value_string = $searchNode->getAttribute(‘CONTENT’);

$file = fopen($value, “a”);

fwrite($file, $value_string.’ ‘);

fclose($file);

}

}

 

?>

 

The following command line prompt was used to run the above PHP script across all 24,170 ALTO files. Many thanks to Matt Bouchard for his help developing this prompt.

find . -xdev -name “*.xml” -exec php [name of above php script] {} \;

 

The following Perl script was used to combine the individual page text documents into their issues (appropriately named based on their respective directories). Many thanks again to Matt Bouchard for this.

#!/usr/bin/perl

 

use strict;

use warnings;

 

use File::Find;

 

find (\&dirs_to_combine, “[path to files]“);

 

sub dirs_to_combine {

if ((-d $_) && ($_ =~ [file name pattern])) {

my $conquer = “cat $File::Find::dir/”.$_.”/*.txt >> [path to files]“.$_.”.txt”;

`$conquer`;

}

}

 

 

Appendix B

Dr. Harvey Quamen wrote the following PHP script to extract material from our RapidMiner spreadsheets and convert it into a format suitable for processing in R.

<?php

 

/*****************************************************************

*

*          Harvey Quamen

*          hquamen@ualberta.ca

*

*          This script parses a CSV word frequency file and extracts

*          frequencies for selected words. It produces a CSV file suitable

*          for import in R for data visualization.

*

*****************************************************************/

 

 

/*****************************************************************

*          This is the word frequency file to read; if you use a

*          different file later, change the name here:

*****************************************************************/

$file = ‘[name of spreadsheet containing RapidMiner data]‘;

 

/*****************************************************************

*          A couple of settings — labels for R axes and such.

*****************************************************************/

$legend_label = ‘Word’;

$x_axis = ‘Year’;

$y_axis = ‘Frequency’;

 

//          If the user requested no words, then output a usage message and exit.

 

if (empty($argv[1])) {

echo “\nThis script parses a word frequency file and extracts\n”;

echo “each word’s frequency to produce a CSV file for import into R.\n\n”;

echo “Usage:\n\n”;

echo “\tphp  ” . $argv[0] . ”  word_list  [> result_file]\n\n”;

echo “…where word_list is a space-separated list of words to fetch\n”;

echo “frequencies for.\n\n”;

echo “The results will get printed to the screen, but you can choose to\n”;

echo “route the results into a result file with the usual Unix `>` character\n”;

echo “followed by a filename.\n\n”;

exit;

}

 

//          if we’re still here, then we have at least one word

$words = array();

array_shift($argv);      //          element 0 is this script’s name; axe it

$words = $argv;

//          clean up stray spaces

$words = array_map(‘trim’, $words);

 

//          open the word frequency CSV file

$handle = fopen($file, ‘r’);

 

//          if we couldn’t open it, exit with an error

if (! $handle) {

echo “Could not open ‘{$file}’ for reading.\nExiting.\n”;

exit;

}

 

//          open the output stream as a file to write to;

//          we’ll use PHP’s inherent ability to understand

//          the CSV output format.

$csv_output = fopen(‘php://output’, ‘w’);

 

//          read the header row; we’ll parse dates out of this. Looks like:

//          Word,Total Occurences,Document Occurrences,1901,1903,1904,1905, …

 

$header = fgetcsv($handle, 1000);

//          remove the first three columns; we really want only the dates

$header = array_slice($header, 3);

 

//          output a new header row for us; variables are defined up top.

//          If any of them contain spaces, e.g., PHP will quote them for us.

fputcsv($csv_output, array($legend_label, $x_axis, $y_axis));

 

//          Read the data file as a comma-separated file; iterate through

//          the data, extracting yearly frequency counts for the years we

//          want. Each column in the original file generates one point in R,

//          so we need to output one new row for each data point.

//          We’ll output this as CSV to the terminal window so that PHP

//          can handle any quoted strings by itself.

 

 

while ($row = fgetcsv($handle, 2000)) {

$word = trim(array_shift($row));

$total_count = array_shift($row);       //          ignore this

$doc_count = array_shift($row);                    //          ignore this too

if (in_array($word, $words)) {

for ($index = 0; $index < count($row); $index++) {

$date = (int) $header[$index];

//          use PHP’s inherent ability to write to CSV format

fputcsv($csv_output, array($word, $date, $row[$index]));

}

}

}

 

fclose($handle);

 

//                      END OF SCRIPT

 

 

?>

 

 

The following PHP script, also from Harvey Quamen, extracts data from Mallet’s topic modeling output and converts it into a format suitable for processing in R.

<?php

 

/*****************************************************************

*

*          Harvey Quamen

*          hquamen@ualberta.ca

*

*          This script parses a Mallet composition file and extracts

*          composition percentages for selected topics. It produces

*          a CSV file suitable for import in R for data visualization.

*

*****************************************************************/

 

 

/*****************************************************************

*          This is the Mallet composition file to read; if you use a

*          different file later, change the name here:

*****************************************************************/

$file = ‘[name of file]‘;

 

/*****************************************************************

*          A couple of settings — labels for R axes and such.

*****************************************************************/

$legend_label = ‘Topic’;

$x_axis = ‘Date’;

$y_axis = ‘Weight’;

 

/*****************************************************************

*          More settings — you might never need to change these.

*          The idea here is to output topic numbers and dates as

*          strings rather than as numbers so R displays them more

*          correctly.

*****************************************************************/

define(‘TOPICS_AS_STRINGS’, true);

define(‘DATES_AS_STRINGS’, true);

 

//          If the user requested no topics, then output a usage message and exit.

 

if (empty($argv[1])) {

echo “\nThis script parses a Mallet topic composition text file and extracts\n”;

echo “one topic’s composition percentages and sorts them chronologically.\n\n”;

echo “Usage:\n\n”;

echo “\tphp  ” . $argv[0] . ”  topic_number  [> result_file]\n\n”;

echo “…where topic_number is the numeral corresponding to a topic that has been\n”;

echo “generated by Mallet.\n\n”;

echo “The results will get printed to the screen, but you can choose to\n”;

echo “route the results into a result file with the usual Unix `>` character\n”;

echo “followed by a filename.\n\n”;

exit;

}

 

//          if we’re still here, then we have at least one topic number

$topic_numbers = array();

array_shift($argv);      //          element 0 is this script’s name; axe it

$topic_numbers = $argv;

 

//          open the Mallet composition file

$handle = fopen($file, ‘r’);

 

//          if we couldn’t open it, exit with an error

if (! $handle) {

echo “Could not open ‘{$file}’ for reading.\nExiting.\n”;

exit;

}

 

//          open the output stream as a file to write to;

//          we’ll use PHP’s inherent ability to understand

//          the CSV output format.

$csv_output = fopen(‘php://output’, ‘w’);

 

//          read (and abandon) the useless header row

$header = fgetcsv($handle, 1000, “\t”);

 

//          output a new header row for us; variables are defined up top.

//          If any of them contain spaces, e.g., PHP will quote them for us.

fputcsv($csv_output, array($legend_label, $x_axis, $y_axis));

 

//          Read the data file as a tab-delimited file; iterate through

//          the data, extracting the parts we want. Remember that Mallet

//          produces data in pairs: topic number followed by tab followed

//          by composition percentage. So when we encounter a topic number

//          we’re interested in, also grab the following element which

//          contains the composition percentage.

 

while ($row = fgetcsv($handle, 2000, “\t”)) {

$row_num = array_shift($row);

$file = array_shift($row);

//          ignore results for .DS_Store; Mallet should have ignored it.

if (stripos($file, ‘.DS_Store’) !== false) continue;

for ($index = 0; $index < count($row); $index += 2) {

if (in_array($row[$index], $topic_numbers)) {

$topic = TOPICS_AS_STRINGS ? “‘” . $row[$index] . “‘” : $row[$index];

$date = DATES_AS_STRINGS ? “‘” . parseDate($file) . “‘” : parseDate($file);

echo $topic . ‘,’ . $date . ‘,’ . (float) $row[$index + 1] . “\n”;

}

}

}

 

fclose($handle);

 

//                      END OF SCRIPT

 

 

?>

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>