Apache Tika

January 21, 2015 - Book notes / Directed Research

Introduction

The purpose of this post is to provide notes and useful tips on Apache Tika. A lot of the content in this post comes from Chris Mattmann and Jukka Zitting’s Tika in Action – so really credit goes to them and their book.

Notes

Introduction to Tika

Tika strives to offer the necessary functionality required for dealing with the heterogeneity of modern information content 1Mattmann/Zitting, 1.1

Tika uses a combination of methods to resolve the identity of unknown files, including:

  • File extension or filename globs (e.g. *.pdf)
  • Magic bytes (e.g. bytes in header of file yielding a unique media type signature)
  • XML root characters (e.g. *.xml files contain some namespaces hinting at RSS)

For a full list of supported media types, check out: http://tika.apache.org/1.7/formats.html

A typical Tika workflow is as follows:

  1. Detect file media (MIME) types
  2. Use parsers to extract textual and metadata content
  3. Process content using tools like language identification

What are good, example use cases for Tika? 2Mattmann, 1.2.3

  • Search engines (e.g. crawl and extract information from many types of web content)
  • Document analysis (e.g. you already have an analysis program dealing with file type A, Tika helps extend that to file type B and beyond)
  • File catalogs (e.g. classifying and organizing a host of heterogeneous files)

Tika Architecture

Tika Facade (i.e. org.apache.tika.Tika) aims to simplify the complexity behind the underlying Tika library, and provide a single mechanism to access MIME detection, parsing interfaces, language deteciton. 3Mattmann/Zitting, 2.3.1

There are four main internal components of Tika:

  • Tika-Core: contains all the main functionality of Tika, including Tika Facade, MIME detection, parsing, language detection etc.
  • Tika-Parsers: a collection of parser wrapper packages the identify but do not implement external parsing libraries
  • Tika-App: the component providing the Tika GUI and command-line tools
  • Tika-Bundle: a bundled export of Tika catered to OGSI (Java beans like equivalent) deployments and integration. Useful for integrating Tika into other OGSI compliant software via versioned bundle means.

Document Type Detection 4Mattmann/Zitting, 4

The generic type application/octet-stream is used as fallback for any documents whose exact type is unknown (the document can only be processed as a stream of bytes) 5Mattmann/Zitting, 4.1.2

The Internet Assigned Numbers Authority (IANA) maintains a list of officially registered media types: http://www.iana.org/assignments/media-types/media-types.xhtml

Good place to look for information on unknown media or MIME types: http://file-extension.net/seeker/

To add custom media types Tika already doesn’t know about, append glob information to the tikamimetypes.xml file within tika-core 

Magic byte patterns allow for identification of file types in many cases, usually specified as a short ASCII or HEX string at the beginning of the file.

Use Tika.detect to detect the media type of a file and then use MediaType.parse to extract useful information from the resulting string 6Mattmann/Zitting, 4.4

import org.apache.tika.Tika
import java.io.File
import org.apache.tika.mime.MediaType

val typeStr = tika.detect(new File("..."))
val mediaType = MediaType.parse(typeStr)
println(mediaType.getSubtype)
println(mediaType.getParameters)

 

Content Extraction 7Mattmann/Zitting, 5

Use new Tika().parse for incremental parsing of files to support files of arbitrary sizes and new Tika().parseToString for smaller files that can sit in memory. 8Mattmann/Zitting, 5.2, 5.3

Consider the combination of using an in-memory LuceneIndexer to take the parsed content that Tika yields, and enable searching on that data.

For more fine-tuned control, use the CompositeParser when you want to mix together the parsing functionality of multiple Tika parsers; this is a good alternative to using a set of if/else statements to select the desired parser for the desired exact filetype. 9Mattmann/Zitting, 5.2.4

Use the AutoDetectParser to automatically detect and use the right parser from the set of  ALL parsers known to Tika

Use a try-finally block whenever using parser.parse to make sure you close the input stream Tika is using for content extraction

When some of your Tika parsers need random-access to files and some parsers can handle a stream just fine, use TikaInputStream. This class provides extended capability over Java’s InputStream to enable both file access (via getFile()) and stream access (via parse()). The Tika facade automatically uses TikaInputStream

Use Tika’s SAX XHTML event API to handle extracted text from a document step-by-step, specifically: BodyContentHandler, LinkContentHandler, TeeContentHandler

When processing Microsoft Excel documents, consider passing in the specific Locale information to properly interpret dates and times within the file:

Consider using ParserDecorator when you need customized low-level parsing of files

Understanding Metadata 10Mattmann/Zitting, 6

To get a quick sense of the metadata properties Tika extracts from different file types, try:

java -jar tika-app-*.jar --list-met-models

 Metadata quality is of prime importance, especially in the case of correlating metadata for files of different types, and most often different metadata models. 11Zitting, 6.2

To extract specific metadata from files, use the tika.parse(InputStream, Metadata) method signature – pass in a reference to your metadata object.

Scientific File Support

As of Tika 1.7, only metadata can be extracted from HDF5 and netCDF files, not content. 12http://tika.apache.org/1.7/formats.html#Scientific_formats

Tika can parse content from HDF5 files but not via streaming, or random-access based extraction due to the nature of HDF5 extraction libraries 13Mattmann/Zitting, 8.1.1

Tika’s use of InputStream helps obfuscate the actual storage mechanism of files and deal with distributed file systems (HDFS), URLs, and local files all the same 14Mattmann/Zitting, 8.1.1

Extending Tika 15Mattmann/Zitting, 11

To easily enable Tika to detect a new custom type of file, create a new tikamimetypes.xml file and load it into the Tika facade detector as follows: 16Mattmann/Zitting, 11.1

val typeDatabase = MimeTypesFactory.create(new URL(CUSTOM_TIKA_MIME_TYHPES_FILE_PATH))
val tika = new Tika(typeDatabase)

Use the Tika Detector interface to fully customize the detection of a completely new media type

To customize the parsing of an existing Tika parser with slightly more functionality, consider extending the specific Tika parser

When creating a brand new Tika parser from scratch, extend the AbstractParser class of Tika and implement the parse and getSupportedTypes methods

To plug-in a new or customized parser: 17Mattmann/Zitting, 11.3.3

  1. Place the parser within Tika’s JAR file
  2. List the fully qualified names of your parser classes within the file META-INF/services/org.apache.tika.parser.Parser

To ensure a new parser is called first or within a specific order of other parsers, use the combination of AutoDetectParser and ParserDectorator as follows: 18Mattmann/Zitting, 11.3.4

val parser = new AutoDetectParser(
  parser.getDetector(),
  ParserDecorator.withTypes(new MyCustomParser(),...)

 

 

References   [ + ]

1. Mattmann/Zitting, 1.1
2. Mattmann, 1.2.3
3. Mattmann/Zitting, 2.3.1
4. Mattmann/Zitting, 4
5. Mattmann/Zitting, 4.1.2
6. Mattmann/Zitting, 4.4
7. Mattmann/Zitting, 5
8. Mattmann/Zitting, 5.2, 5.3
9. Mattmann/Zitting, 5.2.4
10. Mattmann/Zitting, 6
11. Zitting, 6.2
12. http://tika.apache.org/1.7/formats.html#Scientific_formats
13, 14. Mattmann/Zitting, 8.1.1
15. Mattmann/Zitting, 11
16. Mattmann/Zitting, 11.1
17. Mattmann/Zitting, 11.3.3
18. Mattmann/Zitting, 11.3.4

Leave a Reply

Your email address will not be published. Required fields are marked *