Tika in Action
By Jukka L. Zitting and Chris Mattmann
()
About this ebook
Tika in Action is a hands-on guide to content mining with Apache Tika. The book's many examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.
About the Technology
Tika is an Apache toolkit that has built into it everything you and your app need to know about file formats. Using Tika, your applications can discover and extract content from digital documents in almost any format, including exotic ones.
About this Book
Tika in Action is the ultimate guide to content mining using Apache Tika. You'll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich book teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you'll find detailed chapters on features like metadata extraction, automatic language detection, and custom parser development.
This book is written for developers who are new to both Scala and Lift and covers just enough Scala to get you started.
Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.
What's Inside
- Crack MS Word, PDF, HTML, and ZIP
- Integrate with search engines, CMS, and other data sources
- Learn through experimentation
- Many examples
This book requires no previous knowledge of Tika or text mining techniques. It assumes a working knowledge of Java.
==========================================
Table of Contents
-
PART 1 GETTING STARTED
- The case for the digital Babel fish
- Getting started with Tika
- The information landscape PART 2 TIKA IN DETAIL
- Document type detection
- Content extraction
- Understanding metadata
- Language detection
- What's in a file? PART 3 INTEGRATION AND ADVANCED USE
- The big picture
- Tika and the Lucene search stack
- Extending Tika PART 4 CASE STUDIES
- Powering NASA science data systems
- Content management with Apache Jackrabbit
- Curating cancer research data with Tika
- The classic search engine example
Jukka L. Zitting
Jukka Zitting is a core Tika developer with over a decade of experience of open source content management. Jukka works as a Senior Developer for the Swiss content management company Day Software, and is a member of the JCP expert group for the Content Repository for Java Technology API. He is a member of the Apache Software Foundation and the chairman of the Apache Jackrabbit project.
Related to Tika in Action
Related ebooks
Scalatra in Action Rating: 0 out of 5 stars0 ratingsRestlet in Action: Developing RESTful web APIs in Java Rating: 0 out of 5 stars0 ratingsSolr in Action Rating: 3 out of 5 stars3/5Troubleshooting Java: Read, debug, and optimize JVM applications Rating: 0 out of 5 stars0 ratingsSOA Governance in Action: REST and WS-* Architectures Rating: 0 out of 5 stars0 ratingsMahout in Action Rating: 0 out of 5 stars0 ratingsPlay for Java Rating: 0 out of 5 stars0 ratingsSilverlight 5 in Action Rating: 0 out of 5 stars0 ratingsSpring Integration in Action Rating: 0 out of 5 stars0 ratingsDependency Injection: Design patterns using Spring and Guice Rating: 0 out of 5 stars0 ratingsSpark in Action Rating: 0 out of 5 stars0 ratingsReactive Application Development Rating: 0 out of 5 stars0 ratingsReal-time Analytics with Storm and Cassandra Rating: 0 out of 5 stars0 ratingsLocation-Aware Applications Rating: 0 out of 5 stars0 ratingsMastering Apache Cassandra - Second Edition Rating: 0 out of 5 stars0 ratingsTraefik API Gateway for Microservices: With Java and Python Microservices Deployed in Kubernetes Rating: 0 out of 5 stars0 ratingsPro Spring Boot 2: An Authoritative Guide to Building Microservices, Web and Enterprise Applications, and Best Practices Rating: 0 out of 5 stars0 ratingsPractical OneOps Rating: 0 out of 5 stars0 ratingsOpa Application Development Rating: 0 out of 5 stars0 ratingsJBoss Weld CDI for Java Platform Rating: 0 out of 5 stars0 ratingsFlex on Java Rating: 0 out of 5 stars0 ratingsSpark GraphX in Action Rating: 0 out of 5 stars0 ratingsInstant Highcharts Rating: 0 out of 5 stars0 ratingsMVVM Survival Guide for Enterprise Architectures in Silverlight and WPF Rating: 0 out of 5 stars0 ratingsMockito Cookbook Rating: 0 out of 5 stars0 ratingsMastering Eclipse Plug-in Development Rating: 0 out of 5 stars0 ratingsOpenJDK Cookbook Rating: 0 out of 5 stars0 ratingsNoSQL Databases A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsiOS in Practice Rating: 0 out of 5 stars0 ratingsC++ Cookbook: How to write great code with the latest C++ releases (English Edition) Rating: 0 out of 5 stars0 ratings
Computers For You
The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 3 out of 5 stars3/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsCompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsEverybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5An Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5Going Text: Mastering the Command Line Rating: 4 out of 5 stars4/5
Reviews for Tika in Action
0 ratings0 reviews
Book preview
Tika in Action - Jukka L. Zitting
Copyright
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Email:
©2012 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 16 15 14 13 12 11
Dedication
To my lovely wife Lisa and my son Christian
CM
To my lovely wife Kirsi-Marja and our happy cats
JZ
Brief Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Foreword
Preface
Acknowledgments
About this Book
About the Authors
About the Cover Illustration
1. Getting started
Chapter 1. The case for the digital Babel fish
Chapter 2. Getting started with Tika
Chapter 3. The information landscape
2. Tika in detail
Chapter 4. Document type detection
Chapter 5. Content extraction
Chapter 6. Understanding metadata
Chapter 7. Language detection
Chapter 8. What’s in a file?
3. Integration and advanced use
Chapter 9. The big picture
Chapter 10. Tika and the Lucene search stack
Chapter 11. Extending Tika
4. Case studies
Chapter 12. Powering NASA science data systems
Chapter 13. Content management with Apache Jackrabbit
Chapter 14. Curating cancer research data with Tika
Chapter 15. The classic search engine example
Appendix A. Tika quick reference
Appendix B. Supported metadata keys
Index
List of Figures
List of Tables
List of Listings
Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Foreword
Preface
Acknowledgments
About this Book
About the Authors
About the Cover Illustration
1. Getting started
Chapter 1. The case for the digital Babel fish
1.1. Understanding digital documents
1.1.1. A taxonomy of file formats
1.1.2. Parser libraries
1.1.3. Structured text as the universal language
1.1.4. Universal metadata
1.1.5. The program that understands everything
1.2. What is Apache Tika?
1.2.1. A bit of history
1.2.2. Key design goals
1.2.3. When and where to use Tika
1.3. Summary
Chapter 2. Getting started with Tika
2.1. Working with Tika source code
2.1.1. Getting the source code
2.1.2. The Maven build
2.1.3. Including Tika in Ant projects
2.2. The Tika application
2.2.1. Drag-and-drop text extraction: the Tika GUI
2.2.2. Tika on the command line
2.3. Tika as an embedded library
2.3.1. Using the Tika facade
2.3.2. Managing dependencies
2.4. Summary
Chapter 3. The information landscape
3.1. Measuring information overload
3.1.1. Scale and growth
3.1.2. Complexity
3.2. I’m feeling lucky—searching the information landscape
3.2.1. Just click it: the modern search engine
3.2.2. Tika’s role in search
3.3. Beyond lucky: machine learning
3.3.1. Your likes and dislikes
3.3.2. Real-world machine learning
3.4. Summary
2. Tika in detail
Chapter 4. Document type detection
4.1. Internet media types
4.1.1. The parlance of media type names
4.1.2. Categories of media types
4.1.3. IANA and other type registries
4.2. Media types in Tika
4.2.1. The shared MIME-info database
4.2.2. The MediaType class
4.2.3. The MediaTypeRegistry class
4.2.4. Type hierarchies
4.3. File format diagnostics
4.3.1. Filename globs
4.3.2. Content type hints
4.3.3. Magic bytes
4.3.4. Character encodings
4.3.5. Other mechanisms
4.4. Tika, the type inspector
4.5. Summary
Chapter 5. Content extraction
5.1. Full-text extraction
5.1.1. Abstracting the parsing process
5.1.2. Full-text indexing
5.1.3. Incremental parsing
5.2. The Parser interface
5.2.1. Who knew parsing could be so easy?
5.2.2. The parse() method
5.2.3. Parser implementations
5.2.4. Parser selection
5.3. Document input stream
5.3.1. Standardizing input to Tika
5.3.2. The TikaInputStream class
5.4. Structured XHTML output
5.4.1. Semantic structure of text
5.4.2. Structured output via SAX events
5.4.3. Marking up structure with XHTML
5.5. Context-sensitive parsing
5.5.1. Environment settings
5.5.2. Custom document handling
5.6. Summary
Chapter 6. Understanding metadata
6.1. The standards of metadata
6.1.1. Metadata models
6.1.2. General metadata standards
6.1.3. Content-specific metadata standards
6.2. Metadata quality
6.2.1. Challenges/Problems
6.2.2. Unifying heterogeneous standards
6.3. Metadata in Tika
6.3.1. Keys and multiple values
6.3.2. Transformations and views
6.4. Practical uses of metadata
6.4.1. Common metadata for the Lucene indexer
6.4.2. Give me my metadata in my schema!
6.5. Summary
Chapter 7. Language detection
7.1. The most translated document in the world
7.2. Sounds Greek to me—theory of language detection
7.2.1. Language profiles
7.2.2. Profiling algorithms
7.2.3. The N-gram algorithm
7.2.4. Advanced profiling algorithms
7.3. Language detection in Tika
7.3.1. Incremental language detection
7.3.2. Putting it all together
7.4. Summary
Chapter 8. What’s in a file?
8.1. Types of content
8.1.1. HDF: a format for scientific data
8.1.2. Really Simple Syndication: a format for rapidly changing content
8.2. How Tika extracts content
8.2.1. Organization of content
8.2.2. File header and naming conventions
8.2.3. Storage affects extraction
8.3. Summary
3. Integration and advanced use
Chapter 9. The big picture
9.1. Tika in search engines
9.1.1. The search use case
9.1.2. The anatomy of a search index
9.2. Managing and mining information
9.2.1. Document management systems
9.2.2. Text mining
9.3. Buzzword compliance
9.3.1. Modularity, Spring, and OSGi
9.3.2. Large-scale computing
9.4. Summary
Chapter 10. Tika and the Lucene search stack
10.1. Load-bearing walls
10.1.1. ManifoldCF
10.1.2. Open Relevance
10.2. The steel frame
10.2.1. Lucene Core
10.2.2. Solr
10.3. The finishing touches
10.3.1. Nutch
10.3.2. Droids
10.3.3. Mahout
10.4. Summary
Chapter 11. Extending Tika
11.1. Adding type information
11.1.1. Custom media type configuration
11.2. Custom type detection
11.2.1. The Detector interface
11.2.2. Building a custom type detector
11.2.3. Plugging in new detectors
11.3. Customized parsing
11.3.1. Customizing existing parsers
11.3.2. Writing a new parser
11.3.3. Plugging in new parsers
11.3.4. Overriding existing parsers
11.4. Summary
4. Case studies
Chapter 12. Powering NASA science data systems
12.1. NASA’s Planetary Data System
12.1.1. PDS data model
12.1.2. The PDS search redesign
12.2. NASA’s Earth Science Enterprise
12.2.1. Leveraging Tika in NASA Earth Science SIPS
12.2.2. Using Tika within the ground data systems
12.3. Summary
Chapter 13. Content management with Apache Jackrabbit
13.1. Introducing Apache Jackrabbit
13.2. The text extraction pool
13.3. Content-aware WebDAV
13.4. Summary
Chapter 14. Curating cancer research data with Tika
14.1. The NCI Early Detection Research Network
14.1.1. The EDRN data model
14.1.2. Scientific data curation
14.2. Integrating Tika
14.2.1. Metadata extraction
14.2.2. MIME type identification and classification
14.3. Summary
Chapter 15. The classic search engine example
15.1. The Public Terabyte Dataset Project
15.2. The Bixo web crawler
15.2.1. Parsing fetched documents
15.2.2. Validating Tika’s charset detection
15.3. Summary
Appendix A. Tika quick reference
A.1. Tika facade
A.2. Command-line options
A.3. ContentHandler utilities
Appendix B. Supported metadata keys
B.1. Climate Forecast
B.2. Creative Commons
B.3. Dublin Core
B.4. Geographic metadata
B.5. HTTP headers
B.6. Microsoft Office
B.7. Message (email)
B.8. TIFF (Image)
Index
List of Figures
List of Tables
List of Listings
Foreword
I’m a big fan of search engines and Java, so early in the year 2004 I was looking for a good Java-based open source project on search engines. I quickly discovered Nutch. Nutch is an open source search engine project from the Apache Software Foundation. It was initiated by Doug Cutting, the well-known father of Lucene.
With my new toy on my laptop, I tested and tried to evaluate it. Even if Nutch was in its early stages, it was a promising project—exactly what I was looking for. I proposed my first patches to Nutch relating to language identification in early 2005. Then, in the middle of 2005 I become a Nutch committer and increased my number of contributions relating to language identification, content-type guessing, and document analysis. Looking more deeply at Lucene, I discovered a wide set of projects around it: Nutch, Solr, and what would eventually become Mahout. Lucene provides its own analysis tools, as do Nutch and Solr, and each one employs some proprietary
interfaces to deal with analysis engines.
So I consulted with Chris Mattmann, another Nutch committer with whom I had worked, about the potential for refactoring all these disparate tools in a common and standardized project. The concept of Tika was born.
Chris began to advocate for Tika as a standalone project in 2006. Then Jukka Zitting came into the picture and took the lead on the Tika project; after a lot of refactoring and enhancements, Tika became a Lucene top-level project.
At that point in time, Tika was being used in Nutch, Droids (an Incubator project that you’ll hear about in chapter 10), and many non-Lucene projects—the activity on Tika mailing lists was indicative of this. The next promising steps for the project involved plugging Tika into top-level Lucene projects, such as Lucene itself or Solr. That amounted to a big challenge, as it required Tika to provide a flexible and robust set of interfaces that could be used in any programming context where metadata analysis was needed.
Luckily, Tika got there. With this book, written by Tika’s two main creators and maintainers, Chris and Jukka, you’ll understand the problems of document analysis and document information extraction. They first explain to the reader why developers have such a need for Tika. Today, content handling and analysis are basic building blocks of all major modern services: search engines, content management systems, data mining, and other areas.
If you’re a software developer, you’ve no doubt needed, on many occasions, to guess the encoding, formatting, and language of a file, and then to extract its metadata (title, author, and so on) and content. And you’ve probably noticed that this is a pain. That’s what Tika does for you. It provides a robust toolkit to easily handle any data format and to simplify this painful process.
Chris and Jukka explain many details and examples of the Tika API and toolkit, including the Tika command-line interface and its graphical user interface (GUI) that you can use to extract information about any type of file handled by Tika. They show how you can use the Tika Application Programming Interface (API) to integrate Tika commodities directly with your own projects. You’ll discover that Tika is both simple to use and powerful. Tika has been carefully designed by Chris and Jukka and, despite the internal complexity of this type of library, Tika’s API and tools are simple and easy to understand and to use.
Finally, Chris and Jukka show many real-life uses cases of Tika. The most noticeable real-life projects are Tika powering the NASA Science Data Systems, Tika curating cancer research data at the National Cancer Institute’s Early Detection Research Network, and the use of Tika for content management within the Apache Jackrabbit project. Tika is already used in many projects.
I’m proud to have helped launch Tika. And I’m extremely grateful to Chris and Jukka for bringing Tika to this level and knowing that the long nights I spent writing code for automatic language identification for the MIME type repository weren’t in vain. To now make (even) a small contribution, for example, to assist in research in the fight against cancer, goes straight to my heart.
Thank you both for all your work, and thank you for this book.
JÉRÔME CHARRON
C
HIEF TECHNICAL OFFICER
W
EBPULSE
Preface
While studying information retrieval and search engines at the University of Southern California in the summer of 2005, I became interested in the Apache Nutch project. My professor, Dr. Ellis Horowitz, had recently discovered Nutch and thought it a good platform for the students in the course to get real-world experience during the final project phase of his CS599: Seminar on Search Engines
course.
After poking around Nutch and digging into its innards, I decided on a final project. It was a Really Simple Syndication (RSS) plugin described in detail in NUTCH-30.[¹] The plugin read an RSS file, extracted its outgoing web links and text, and fed that information back into the Nutch crawler for later indexing and retrieval.
¹https://issues.apache.org/jira/browse/NUTCH-30
Seemingly innocuous, the class taught me a great detail about search engines, and helped pinpoint the area of search I was interested in—content detection and extraction.
Fast forward to 2007: after I eventually became a Nutch committer, and focused in on more parsing-related issues (updates to the Nutch parser factory, metadata representation updates, and so on), my Nutch mentor Jérôme Charron and I decided that there was enough critical mass of code in Nutch related to parsing (parsing, language identification, extraction, and representation) that it warranted its own project. Other projects were doing it—rumblings of what would eventually become Hadoop were afoot—which led us to believe that the time was ripe for our own project. Since naming projects after children’s stuffed animals was popular at the time, we felt we could do the same, and Tika was born (named after Jérôme’s daughter’s stuffed animal).
It wasn’t as simple as we thought. After getting little interest from the broader Lucene community (Nutch was a Lucene subproject and thus the project we were proposing had to go through the Lucene PMC), and with Jérôme and I both taking on further responsibility that took time away from direct Nutch development, what would eventually be known as Tika began to fizzle away.
That’s where the other author of this book comes in. Jukka Zitting, bless him, was keenly interested in a technology, separate from the behemoth Nutch codebase, that would perform the types of things that we had carved off as Tika core capabilities: parsing, text extraction, metadata extraction, MIME detection, and more. Jukka was a seasoned Apache veteran, so he knew what to do. Jukka became a real leader of the original Tika proposal, took it to the Apache Incubator, and helped turn Tika into a real Apache project.
After working with Jukka for a year or so in the Incubator community, we took our show on the road back to Lucene as a subproject when Tika graduated. Over a period of two years, we made seven Tika releases, infected several popular Apache projects (including Lucene, Solr, Nutch, and Jackrabbit), and gained enough critical mass to grow into a full-fledged Apache Top Level Project (TLP).
But we weren’t done there. I don’t remember the exact time during the Christmas season in 2009 when I decided it was time to write a book, but it matters little. When I get an idea in my head, it’s hard to get it out. This book was happening. Tika in Action was happening. I approached Jukka and asked him how he felt. In characteristic fashion, he was up for the challenge.
We sure didn’t know what we were getting ourselves into! We didn’t know that the rabbit hole went this deep. That said, I can safely say I don’t think we could’ve taken any other path that would’ve been as fulfilling, exciting, and rewarding. We really put our hearts and souls into creating this book. We sincerely hope you enjoy it. I think I speak for both of us in saying, I know we did!
CHRIS MATTMANN
Acknowledgments
No book is born without great sacrifice by many people. The team who worked on this book means a lot to both of us. We’ll enumerate them here.
Together, we’d like to thank our development editor at Manning, Cynthia Kane, for spending tireless hours working with us to make this book the best possible, and the clearest book to date on Apache Tika. Furthermore, her help with simplifying difficult concepts, creating direct and meaningful illustrations, and with conveying complex information to the reader is something that both of us will leverage and use well beyond this book and into the future.
Of course, the entire team at Manning, from Marjan Bace on down, was a tremendous help in the book’s development and publication. We’d like to thank Nicholas Chase specifically for his help navigating the infrastructure and tools to put this book together. Christina Rudloff was a tremendous help in getting the initial book deal set up and we are very appreciative. The production team of Benjamin Berg, Katie Tennant, Dottie Marsico, and Mary Piergies worked hard to turn our manuscript into the book you are now reading, and Alex Ott did a thorough technical review of the final manuscript during production and helped clarify numerous code issues and details.
We’d also like to thank the following reviewers who went through three time-crunched review cycles and significantly improved the quality of this book with their thoughtful comments: Deepak Vohra, John Griffin, Dean Farrell, Ken Krugler, John Guthrie, Richard Johannesson, Andreas Kemkes, Julien Nioche, Rick Wagner, Andrew F. Hart, Nick Burch, and Sean Kelly.
Finally, we’d like to acknowledge and thank Ken Krugler and Chris Schneider of Bixo Labs, for contributing the bulk of chapter 15 and for showing us a real-world example of where Tika shines. Thanks, guys!
CHRIS—I would like to thank my wife Lisa for her tremendous support. I originally promised her that my PhD dissertation would be the last book that I wrote, and after four years of sleepless nights (and many sleepless nights before that trying to make ends meet), that I would make time to enjoy life and slow down. That worked for about two years, until this opportunity came along. Thanks for the support again, honey: I couldn’t have made it here without you. I can promise a few more years of slowdown now that the book is done!
JUKKA—I would like to thank my wife Kirsi-Marja for the encouragement to take on new challenges and for understanding the long evenings that meeting these challenges sometimes requires. Our two cats, Juuso and Nöpö, also deserve special thanks for their insistence on taking over the keyboard whenever a break from writing was needed.
About this Book
We wrote Tika in Action to be a hands-on guide for developers working with search engines, content management systems, and other similar applications who want to exploit the information locked in digital documents. The book introduces you to the world of mining text and binary documents and other information sources like internet media types and Dublin Core metadata. Then it shows where Tika fits within this landscape and how you can use Tika to build and extend applications. Case studies present real-world experience from domains ranging from search engines to digital asset management and scientific data processing.
In addition to the architectural overviews, you will find more detailed information in the later chapters that focus on advanced features like XMP metadata processing, automatic language detection, and custom parser extensions. The book also describes common file formats like MS Word, PDF, HTML, and Zip, and open source libraries used to process files in these formats. The included code examples are designed to support hands-on experimentation.
No previous knowledge of Tika or text mining techniques is required. The book will be most valuable to readers with a working knowledge of Java.
Roadmap
Chapter 1 gives the reader a contextual overview of Tika, including its history, its core capabilities, and some basic use cases where Tika is most helpful. Tika includes abilities for file type identification, text extraction, integration of existing parsing libraries, and language identification.
Chapter 2 jumps right into using Tika, including instructions for downloading it, building it as a software library, and using Tika in a downstream Maven or Ant project. Quick tips for getting Tika up and running rapidly are present throughout the chapter.
Chapter 3 introduces the reader to the information landscape and identifies where and how information is fed into the Tika framework. The reader will be introduced to the principles of the World Wide Web (WWW), its architecture, and how the web and Tika synergistically complement one another.
Chapter 4 takes the reader on a deep dive into MIME type identification, covering topics ranging from the MIME hierarchy of the web, to identifying of unique byte pattern signatures present in every file, to other means (such as regular expressions and file extensions) of identifying files.
Chapter 5 introduces the reader to content extraction with Tika. It starts with a simple full-text extraction and indexing example using the Tika facade, and continues with a tour of the core Parser interface and how Tika uses it for content extraction. The reader will learn useful techniques for things such as extracting all links from a document or processing Zip archives and other composite documents.
Chapter 6 covers metadata. The chapter begins with a discussion of what metadata means in the context of Tika, along with a short classification of the existing metadata models that Tika supports. Tika’s metadata API is discussed in detail, including how it helps to normalize and validate metadata instances. The chapter describes how to supercharge the LuceneIndexer from chapter 5 and turn it into an RSS-based file notification service in a few simple lines of code.
Chapter 7 introduces the topic of language identification. The language a document is written in is a highly useful piece of metadata, and the chapter describes mechanisms for automatically identifying written languages. The reader will encounter the most translated document in the world and see how Tika can correctly identify the language used in many of the translations.
Chapter 8 gives the reader an in-depth overview of how files represent information, in terms of their content organization, their storage representation, and the way that metadata is codified, all the while showing how Tika hides this complexity and pulls information from these files. The reader takes an in-depth look at Tika’s RSS and HDF5 parser classes, and learns how Tika’s parsers codify the heterogeneity of files, and how you can develop your own parsers using similar methodologies.
Chapter 9 reviews the best places to leverage Tika in your information management software, including pointing out key use cases where Tika can solely (or with a little glue code) implement many of the high-end features of the system. Document record archives, text mining, and search engines are all topics covered.
Chapter 10 educates the reader in the vocabulary of the Lucene ecosystem. Mahout, ManifoldCF, Lucene, Solr, Nutch, Droids—all of these will roll off the tongue by the time you’re done surveying Lucene’s rich and vibrant community. Lucene was the birthplace of Tika, specifically within the Apache Nutch project, and this chapter takes the opportunity to show you how Tika has grown up over the years into the load-bearing walls of the entire Lucene ecosystem.
Chapter 11 explains what to do when stock Tika out of the box doesn’t handle your file type identification, extraction, and representation needs. Read: you don’t have to pick another whiz-bang technology—you simply extend Tika. We show you how in this chapter, taking you start-to-end through an example of a prescription file type that you may exchange with a doctor.
Chapter 12 is the first case study of the book, and it’s high-visibility. We show you how NASA and its planetary and Earth science communities are using Tika to search planetary images, to extract data and metadata from Earth science files, and to identify content for dissemination and acquisition.
Chapter 13 shows you how the Apache Jackrabbit content repository, a key component in many content and document management systems, uses Tika to implement full-text search and WebDAV integration.
Chapter 14 presents how Tika is used at the National Cancer Institute, helping to power data systems for the Early Detection Research Network (EDRN). We show you how Tika is an integral component of another Apache technology, OODT, the data system infrastructure used to power many national-scale data systems. Tika helps to detect file types, and helps to organize cancer information as it’s catalogued, archived, and made available to the broader scientific community.
For chapter 15, we interviewed Ken Krugler and Chris Schneider of Bixo Labs about how they used Tika to classify and identify content from the Public Terabyte Dataset project, an ambitious endeavor to make available a traditional web-scale dataset for public use. Using Tika, Ken and his team demonstrate a classic search engine example, and identify several areas of improvement and future work in Tika including language identification and charset detection.
The book contains two appendixes. The first is a Tika quick reference. Think of it as a cheat-sheet for using Tika, its commands, and a compact form of some of Tika’s documentation. The second appendix is a description of Tika’s relevant metadata keys, giving the reader an idea of how and when to use them in a custom parser, in any of the existing Parser classes that ship with Tika, or in any downstream program or analysis desired.