Processors - Directory Scanner

Introduction

The purpose of the Directory Scanner processor is to analyse a directory structure in a filesystem and to produce an XML document containing metadata about the files, such as name and size. It is possible to specify which files and directories to include and exclude in the scanning process. The Directory Scanner is also able to optionally retrieve image metadata.

Inputs and outputs

Type Name Purpose Mandatory
Input config Configuration Yes
Output data Result XML data Yes

The Directory Scanner is typically called this way from XPL pipelines:

<p:processor name="oxf:directory-scanner">
<!-- The configuration can often be inline -->
<p:input name="config">
...
</p:input>
<p:output name="data" id="directory-scan"/>
</p:processor>

Configuration

The config input configuration has the following format:

<config>
<base-directory>file:/</base-directory>
<include>**/*.x?l</include>
<include>**/*.xhtml</include>
<include>**/*.java</include>
<exclude>example-descriptor.xml</exclude>
<case-sensitive>false</case-sensitive>
</config>
Element Purpose Format Default
base-directory Directory under which files and directories are scanned, referred to below as the search directory.

A file: or oxf: URL. The URL may be relative to the location of the containing XPL file.

NOTE:

The oxf: protocol works only with resource managers that allow accessing the actual path of the file. These include the Filesystem and WebApp resource manager.

None.
include Specifies which files are included Apache Ant pattern. None.
exclude Specifies which files are excluded Apache Ant pattern. None.
case-sensitive Whether include and exclude patterns are case-sensitive. true or false. true
default-excludes

Whether a set of default exclusion rules must be automatically loaded. The list is as follows:

  • Miscellaneous typical temporary files
    • **/*~
    • **/#*#
    • **/.#*
    • **/%*%
    • **/._*
  • CVS
    • **/CVS
    • **/CVS/**
    • **/.cvsignore
  • SCCS
    • **/SCCS
    • **/SCCS/**
  • Visual SourceSafe
    • **/vssver.scc
  • Subversion
    • **/.svn
    • **/.svn/**
  • Mac
    • **/.DS_Store
true or false. false
image-metadata/basic-info Whether basic image metadata must be extracted. true or false. false
image-metadata/exif-info Whether Exif image metadata must be extracted. true or false. false
image-metadata/iptc-info Whether iptc image metadata must be extracted. true or false. false

Output format

Basic output

The image format starts with a root directory element with a name and path attribute. The name attribute specifies the name of the search directory, e.g. web. The path attribute specifies an absolute path to that directory.

The root element then contains a hierarchical structure of directory and file elements found. For example:

<directory name="address-book" path="c:\Documents and Settings\John Doe\OPS\src\examples\web\examples\address-book">
<directory name="initialization" path="initialization">
<file last-modified-ms="1101487772375" last-modified-date="2004-11-26T17:49:32.375" size="1250" path="initialization\init-database.xpl" name="init-database.xpl"/>
<file last-modified-ms="1101512191718" last-modified-date="2004-11-27T00:36:31.718" size="2410" path="initialization\init-script.xpl" name="init-script.xpl"/>
</directory>
<file last-modified-ms="1101488200406" last-modified-date="2004-11-26T17:56:40.406" size="5618" path="model.xpl" name="model.xpl"/>
<file last-modified-ms="1101484041437" last-modified-date="2004-11-26T16:47:21.437" size="941" path="page-flow.xml" name="page-flow.xml"/>
<file last-modified-ms="1121104181591" last-modified-date="2005-07-11T19:49:41.591" size="3165" path="view.xsl" name="view.xsl"/>
<file last-modified-ms="1093118707000" last-modified-date="2004-08-21T22:05:07.000" size="934" path="xforms-model.xml" name="xforms-model.xml"/>
</directory>

directory elements contain basic information about a matched directory:

Name Value
path Path to the directory, relative to the parent directory. Includes the current directory name.
name Local directory name.

NOTE: The path attribute on the root element is an absolute path from a filesystem root. The path on child directory element are relative to their parent directory element.

file elements contain basic information about a matched file:

Name Value
last-modified-ms Timestamp of last modification in milliseconds.
last-modified-date Timestamp of last modification in XML xs:dateTime format.
size Size of the file in bytes.
path Path to the file, relative to the parent directory. Includes the file name.
name Local file name.

Image metadata

When the configuration's image-metadata element is specified, metadata about images is extracted.

NOTE: Images are identified by reading the beginning of the files. This means that extracting image metadata is usually more expensive in time than just producing regular file metadata.

When an image is identified, an image-metadata element is available under the corresponding file element:

When image-metadata/basic-info is true in the configuration, a basic-info element is created under image-metadata:

Element Name Element Value
content-type Media type of the file: image/jpeg, image/gif, image/png. Other image/* values may be produced for other image formats.
width Image width, if found.
height Image height, if found.
comment Image comment, if found (JPEG only).

When image-metadata/exif-info is true in the configuration, zero or more exif-info elements are created under image-metadata. Each element has an attribute containing the name of the category of Exif information. Basic Exif information has the name Exif. Other names may include Canon Makernote for a Canon camera, Interoperability, etc. Under each exif-info element, zero or more param elements are contained, with the following sub-elements:

Element Name Element Value
id The Exif parameter id. For example, 271 denotes the make of the camera
name A default English name for the given parameter id, when known, for example Make.
value The value of the parameter, for example Canon.

This is an example of file element with image metadata:

<file last-modified-ms="1120343217984" last-modified-date="2005-07-03T00:26:57.984" size="961130" path="image0001.jpg" name="image0001.jpg">
<image-metadata>
<basic-info>
<content-type>image/jpeg</content-type>
<width>2272</width>
<height>1704</height>
</basic-info>
<exif-info name="Exif">
<param>
<id>271</id>
<name>Make</name>
<value>Canon</value>
</param>
<param>
<id>272</id>
<name>Model</name>
<value>Canon PowerShot S40</value>
</param>
...
</exif-info>
...
</image-metadata>
</file>

When image-metadata/iptc-info is true in the configuration, zero or more iptc-info elements are created under image-metadata. Each element has an attribute containing the name of the category of IPTC information. The children element of iptc-info are the same as for exif-info.

Other metadata

The Directory Scanner does not provide metadata about other files at the moment, but the processor could be extended to support more metadata, about image formats but also about other file formats such as sound files, etc.

Ant patterns

NOTE: This section of the documentation is reproduced from a section of the Apache Ant Manual, with minor adjustments.

Patterns are used for the inclusion and exclusion of files. These patterns look very much like the patterns used in DOS and UNIX:

'*' matches zero or more characters, '?' matches one character.

In general, patterns are considered relative paths, relative to a task dependent base directory (the dir attribute in the case of <fileset>). Only files found below that base directory are considered. So while a pattern like ../foo.java is possible, it will not match anything when applied since the base directory's parent is never scanned for files.

Examples:

*.java matches .java, x.java and FooBar.java, but not FooBar.xml (does not end with .java).

?.java matches x.java, A.java, but not .java or xyz.java (both don't have one character before .java).

Combinations of *'s and ?'s are allowed.

Matching is done per-directory. This means that first the first directory in the pattern is matched against the first directory in the path to match. Then the second directory is matched, and so on. For example, when we have the pattern /?abc/*/*.java and the path /xabc/foobar/test.java, the first ?abc is matched with xabc, then * is matched with foobar, and finally *.java is matched with test.java. They all match, so the path matches the pattern.

To make things a bit more flexible, we add one extra feature, which makes it possible to match multiple directory levels. This can be used to match a complete directory tree, or a file anywhere in the directory tree. To do this, ** must be used as the name of a directory. When ** is used as the name of a directory in the pattern, it matches zero or more directories. For example: /test/** matches all files/directories under /test/, such as /test/x.java, or /test/foo/bar/xyz.html, but not /xyz.xml.

There is one "shorthand" - if a pattern ends with / or \, then ** is appended. For example, mypackage/test/ is interpreted as if it were mypackage/test/**.

Example patterns:

**/CVS/* Matches all files in CVS directories that can be located anywhere in the directory tree.
Matches:
    CVS/Repository
    org/apache/CVS/Entries
    org/apache/jakarta/tools/ant/CVS/Entries
                        
But not:
    org/apache/CVS/foo/bar/Entries
                        
(foo/bar/ part does not match)
org/apache/jakarta/** Matches all files in the org/apache/jakarta directory tree.
Matches:
    org/apache/jakarta/tools/ant/docs/index.html
    org/apache/jakarta/test.xml
                        
But not:
    org/apache/xyz.java
                        
(jakarta/ part is missing).
org/apache/**/CVS/* Matches all files in CVS directories that are located anywhere in the directory tree under org/apache.
Matches:
    org/apache/CVS/Entries
    org/apache/jakarta/tools/ant/CVS/Entries
                        
But not:
    org/apache/CVS/foo/bar/Entries
                        
(foo/bar/ part does not match)
**/test/** Matches all files that have a test element in their path, including test as a filename.

When these patterns are used in inclusion and exclusion, you have a powerful way to select just the files you want.

Comments