Contact us

Creating dynamic digitization progress charts

The Content, Context, & Capacity grant used open source tools and the data exporting capabilities of large-scale digitization scanning equipment/software to produce dynamic charts that detailed the project's progress.

Figure 1. Screenshot example of a live progress update chart

Figure 2. Screenshot example of a live progress update table

Tools used to produce charts

  • PHP: used to process logfile data and write it back out in webpages.
  • HighCharts: an interactive JavaScript charting library used to create the visualizations. As of 2012, HighCharts is free for non-commertial use. Read about HighCharts on the product webpage.
  • Google Chart Tools (or Google Visualization API): a free tool provided by Google that allows you to create dynamic tables and visual displays of data, used to create the dynamic table. Read about Google Chart Tools on the product webpage.
  • Technical metadata files: our scanners are set up to automatically write metadata to text files as they scan. These text files are the source of all data for charts and tables; they are read by the PHP scripts.

Detailed description of dynamic chart production

Popular library scanning equipment uses software (such as the Zeutschel's Omniscan or the Phase One's Capture One) that is capable of automatically writing technical metadata to logfiles as scans are created. Such metadata exports typically must be set up manually. See our detailed guidelines for setting up the Zeutschel metadata export here.

Figure 3. Screenshot example of the Zeutschel metadata export file

The metadata export creates a structured data file in text format that is machine readable. One line of metadata can be created per scan. While different software is capable of creating different metadata, technical information such as the timestamp for a scan and its filename will almost definitely be options. With as little information as a timestamp and well-structured filename, scripts can process these logfiles to produce useful statistics.

In the case of the CCC project, the main statistics of interest are:

  1. What percentage of each collection has been digitized?
  2. How many folders and how many unique scans have been created for each collection?
  3. How many total scans have been produced and what percentage of our project goal is that?
  4. How many total scans have been produced for each of the project partners?
These are all fairly simple statistics to produce based on the Zeutschel metadata files that we are outputting, though the first requires a second source of data to tell us the total quantity of material in each collection. Because our metadata output includes user-created fields that list the institution that owns the material being scanned and the institution that is performing the scanning, we're able to answer question 4. Because the filenaming schema for CCC institutions include information about the collection, the box number, the folder number, and the scan number, these filenames can be parsed to provide the information necessary for answering questions 1 and 2.

Each Digital Production Center transfers technical metadata files to the CCC project librarian on a weekly basis. This can be done via FTP or email. Metadata files are placed in a web directory from which the PHP script reads them. Additionally, a text file containing supplementary information is placed in the directory with the metadata files. This text file (called "collectionnames.txt") contains the collection ID, a human-readable version of the collection name, and the total number of containers slated for digitization from that collection. A "container" could be a folder, an oversized folder, an audiocassette, etc. Because the majority of CCC material is manuscript, most "containers" are folders for our purposes. Without information about the total number of containers to be digitized, the PHP scripts would not be able to produce statistics on what percentage of a collection has been completed to date. The PHP script that processes logfiles matches log data against the collection IDs listed in collectionnames.txt to get a human readable name for each collection and to determine the percentage of the collection that has been digitized.

Figure 4. Screenshot of the supplemental text file "collectionnames.txt"

Project staff authored a PHP script to read in data from the logfiles, process it, and output the required JavaScript and HTML code to produce the dynamic charts and tables. While this script was written solely with the intention of processing the necessary data for CCC, project staff have received requests to share the script as a sample. The script is available below, but we would like to emphasize that the exact setup in use for TRLN would not work at another institution. An institution would need someone who can program in PHP to write a script that would work for your institution's technical metadata files and data visualization needs.

PHP is not required to produce either a HighCharts or Google chart; it is only used by CCC to process very large data files and produce summary statistics. If you have statistics that you want to manually enter in HTML to make a visualization, you can create a chart with no programming skills at all. See examples of how to plug and play with the HTML and JavaScript code either on the Google Charts examples page or on the HighCharts demo page.

CCC's PHP script to pull production stats from logfiles


<?php
    $INPUT_LOGFILE_PATH = "unc_zeutschel.txt";
	$INPUT_LOGFILE_PATH3 = "collectionnames.txt";
	$INPUT_LOGFILE_ERRORS = "ZeutschelErrors.txt";
	$INPUT_LOGFILE_DUKE = "duke_metadata.txt";


    // Get input logfiles ready for processing
    $input_logfile = @fopen($INPUT_LOGFILE_PATH, "r");
	$input_logfile3 = @fopen($INPUT_LOGFILE_PATH3, "r");
	$input_logfile_errors = @fopen($INPUT_LOGFILE_ERRORS, "r");
	$input_logfile_duke = @fopen($INPUT_LOGFILE_DUKE, "r");
	
	error_reporting(0);

    $collections = array();
    $referencenames = array();
    $project_totalscans = 0; // Total scans completed to date for project
    $project_totalfolders_complete = 0; // Use to find average scans per folder
    $mean_scans_per_folder = 0;  // Used to calculate the ave number of items in
    a folder by institution (requested by Digital Production Center staff for
    estimating how much material to request in a delivery)
    $total = 0;
    
   // Put abbreviated collection names and ID numbers in an array
	  if ($input_logfile3) {
        do {
            $log_input_line3 = fgets($input_logfile3);

            // List takes an array and splits into variables 
					list($collnum, $shortname, $totalfolders) = explode("\t", $log_input_line3);
					
				$referencenames[$collnum]['name'] = $shortname;
				$referencenames[$collnum]['totalfolders'] = $totalfolders;
				$referencenames[$collnum]['folder_count'] = 0;

				
		 } while (!feof($input_logfile3));
			
		// Close input and output logfiles 
		fclose($input_logfile3);
		
	} else {
		print "Unable to open logfiles\n";
		return;//this ends the script
	}   
	        
	
    // PROCESS LOGFILE FROM DUKE
    if ($input_logfile_duke) {
        do {
            $log_input_line = fgets($input_logfile_duke);

            // List takes an array and splits into variables 
            list($filename, $filepath, $clipNumber, $omniscanID, $scanNumber, 
            $barcode, $widthInPixels, $heightInPixels, $widthInMM, $heightInMM, 
            $widthInInch, $heightInInch, $resolution, $bitsPerPixel, $scannerID, 
            $scannerType, $scannerSerial, $ScannerScanCounter, $scannerInfo, 
            $dongle, $filter, $username, $computerName, $date, $time, $jobName, 
            $paramfile, $collection_title, $target_id, $home_institution, 
            $scanning_institution) = explode(";", $log_input_line);
           
            $collectionID = substr($filename, 0, 3);
            $category = substr($filename, 3, 2);
            $box_number = substr($filename, 5, 2);
            $folder_number = substr($filename, 7, 3);
                        
  	         // Only do anything if the collection is listed in the external file collectionnames.txt
			if (isset($referencenames[$collectionID])) {
				if ($category == 'ms') {
					// Make an array of collection numbers with internal arrays 
					of unique folders, each with count of scans (duke's folder 
					numbers repeat so we have to concat with box #)
					$referencenames[$collectionID]['folders'][$box_number . '_' . $folder_number]++;
				
					// Total number of scans for each collection 
					$referencenames[$collectionID]['scan_count']++;
					}
			}
        } while (!feof($input_logfile_duke));
    
        // Close input and output logfiles 
        fclose($input_logfile_duke);
    } else {
        print "Unable to open logfiles\n";
        return;//this ends the script
    }  
    
    
    // PROCESS LOGFILE FROM UNC ZEUTSCHEL
    if ($input_logfile) {
        do {
            $log_input_line = fgets($input_logfile);

            // List takes an array and splits into variables 
            list($filename, $filepath, $clipNumber, $omniscanID, $scanNumber, 
            $barcode, $widthInPixels, $heightInPixels, $widthInMM, $heightInMM, 
            $widthInInch, $heightInInch, $resolution, $bitsPerPixel, $scannerID, 
            $scannerType, $scannerSerial, $ScannerScanCounter, $scannerInfo, 
            $dongle, $filter, $username, $computerName, $date, $time, $jobName, 
            $paramfile, $scanningTech, $home_institution, $scanning_institution)
            = explode(";", $log_input_line);
 
            // COLLECTION PRODUCTION STATS
            // Break the filename into the collection number, folder number, and scan
            if ($home_institution == 'unc' || $home_institution == 'nccu') {
				$pieces = explode("_", $filename);
			} else {
				$pieces = explode("-", $filename);
			}
			
			// Only do anything if the collection is listed in the external file collectionnames.txt
			if (isset($referencenames[$pieces[0]])) {
			
				if ($home_institution == 'unc' || $home_institution == 'nccu') {
					// Make array of collection numbers --> folders --> scan counts
					$referencenames[$pieces[0]]['folders'][$pieces[1]]++;
					// Total number of scans for each collection
					$referencenames[$pieces[0]]['scan_count']++;
				} else {
					$referencenames[$pieces[0]]['folders'][$pieces[2] . '_' . $pieces[3]]++;
					$referencenames[$pieces[0]]['scan_count']++;
				}
				
				
			}
  	         
        } while (!feof($input_logfile));
    
        // Close input and output logfiles 
        fclose($input_logfile);
    } else {
        print "Unable to open logfiles\n";
        return;//this ends the script
    }    
      
    // Get scan count data from the UNC Zeutschel Error file, 
    // where they manually list the number of scans lost from Zeutschel logs due to errors
if ($input_logfile_errors) {
        do {
            $log_input_line3 = fgets($input_logfile_errors);

            // List takes an array and splits into variables 
					list($date, $scantech, $scancount, $foldercount, $scanner, $collectionID, 
					$collectionName, $reason) = explode("\t", $log_input_line3);

			$referencenames[$collectionID]['folder_count'] += $foldercount;
 			$referencenames[$collectionID]['scan_count'] += $scancount;
				
		 } while (!feof($input_logfile_errors));
			
		// Close input and output logfiles 
		fclose($input_logfile_errors);
	} else {
		print "Unable to open logfiles\n";
		return;//this ends the script
	}  
	
	// HIGHCHARTS
    $highchartsData = "data.csv";
	$fh = fopen($highchartsData, 'w') or die("can't open file");
	$stringData = "Collection\n";
	fwrite($fh, $stringData);
	
	// Use "$total" to make sure there's no \n after the last entry
	foreach($referencenames as $collectionID => $array) {
		if ($array['scan_count']!=0) {
			$total++;		
		}
	}
 	
 		foreach ($referencenames as $collnum => $collection_array) {
 			$referencenames[$collnum]['folder_count'] += count($collection_array['folders']);
 		}


 	$i=1;
	foreach ($referencenames as $collnum => $collection_array) {
		
		if ($i < $total) {
		$stringData = $collection_array['name'] . ',' .
		round(($collection_array['folder_count']/$collection_array['totalfolders'])*100, 1) . "\n";
		fwrite($fh, $stringData);
		
		} elseif ($i == $total) {
		$stringData = $collection_array['name'] . ',' .
		round(($collection_array['folder_count']/$collection_array['totalfolders'])*100, 1);
		fwrite($fh, $stringData);
		}
		
		$i++;
	}
	
	fclose($fh);

    foreach ($referencenames as $collnum => $collection_array) {
    	$project_totalscans += $collection_array['scan_count'];
    	$project_totalfolders_complete += $collection_array['folder_count'];
    }
    
    $mean_scans_per_folder =  round($project_totalscans/$project_totalfolders_complete, 1);

?>


    <script type="text/javascript">
    function drawVisualization() {
      
      // Create and populate the data table.
      var data = new google.visualization.DataTable();
      data.addColumn('string', 'Collection');
      data.addColumn('number', 'No. folders scanned');
      data.addColumn('number', 'No. scans');
      data.addColumn('number', '% complete');

data.addRows(<?php print $total ?>);
		
	  <?php 
		$j = 0;
		foreach ($referencenames as $collection_array) {
		if ($collection_array['scan_count'] >=1) {
			$i = 0;
			   print "data.setCell("
			   . $j
			   . ","
			   . $i
			   . ", '"
			   . $collection_array['name']
			   . "');\n";
			   $i++;
			   
			    print "data.setCell("
			   . $j
			   . ","
			   . $i
			   . ", "
			   . $collection_array['folder_count']
			   . ");\n";
			   $i++;
			   
			     print "data.setCell("
			   . $j
			   . ","
			   . $i
			   . ", "
			   . $collection_array['scan_count']
			   . ");\n";
			   $i++;
			   
			   print "data.setCell("
			   . $j
			   . ","
			   . $i
			   . ", "
			   . round($collection_array['folder_count']/$collection_array['totalfolders']*100, 1)
			   . ");\n";
			   
			   $j++;
			   }
		   }
	?>	   	

      
      // Create and draw the visualization.
      visualization = new google.visualization.Table(document.getElementById('table'));
      visualization.draw(data, null);
    }
    

    google.setOnLoadCallback(drawVisualization);
    </script>
    
    <script type="text/javascript"> 
      google.load('visualization', '1', {packages: ['table'], "callback": drawVisualization});
    </script>
    
    <?php
 print '<div style="text-align: center;"><p><b>Total project production to date:</b> <br/>' . number_format($project_totalscans) . 
    	  " scans (out of an estimated 400,000 scans)</p><p>
    	  Or approximately <b>" . round( ($project_totalscans/400000)*100, 1) . '%</b> of the estimated project total</p></div>';

?>
>


<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js" type="text/javascript"></script>
<script src="Highcharts-2.1.6/js/highcharts.js" type="text/javascript"></script>
<script type="text/javascript" src="Highcharts-2.1.6/js/themes/gray.js"></script>

 
		<script type="text/javascript"> 
		$(document).ready(function() {
 
			var options = {
				chart: {
					renderTo: 'highcharts',
					defaultSeriesType: 'column'
				},

				title: {
					text: 'Digitization progress'
				},

				xAxis: {
					categories: [],
				 title: {
					text: 'Collection'
				 }
				},
				yAxis: {
				min: 0,
				max: 100,
				title: {
					text: 'Percentage complete'
					}
				},
				 tooltip: {
						 formatter: function() {
							return ''+
								this.series.name +': appr. '+ this.y +'% complete';
						 }
					  },
				 legend: {
					layout: 'vertical',
					align: 'right',
					verticalAlign: 'top',
					x: -10,
					y: 20,
					borderWidth: 0
				},
				series: []
			};

			$.get('logs/data.csv', function(data) {
				// Split the lines
				var lines = data.split('\n');
				$.each(lines, function(lineNo, line) {
					var items = line.split(',');
					
					// header line containes categories
					if (lineNo == 0) {
						$.each(items, function(itemNo, item) {
							if (itemNo > 0) options.xAxis.categories.push(item);
						});
					}
					
					// the rest of the lines contain data with their name in the first position
					else {
						var series = { 
							data: []
						};
						$.each(items, function(itemNo, item) {
							if (itemNo == 0) {
								series.name = item;
							} else {
								series.data.push(parseFloat(item));
							}
						});
						
						options.series.push(series);
 
					}
					
				});
				
				var chart = new Highcharts.Chart(options);
			});

			

			
		});
		</script> 
		

<div id="highcharts" style="width: 100%; height: 400px"></div>
 
    
 

Triangle Research Libraries Network  CB#3940 Wilson Library, Suite 712 Chapel Hill, NC 27514-8890
Phone: (919) 962-8022  Fax: (919) 962-4452

Page maintained by Joyce Chapman
last updated 01/30/14 03:20