Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Tika in Action
Tika in Action
Tika in Action
Ebook478 pages4 hours

Tika in Action

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Summary

Tika in Action is a hands-on guide to content mining with Apache Tika. The book's many examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.
About the Technology
Tika is an Apache toolkit that has built into it everything you and your app need to know about file formats. Using Tika, your applications can discover and extract content from digital documents in almost any format, including exotic ones.
About this Book
Tika in Action is the ultimate guide to content mining using Apache Tika. You'll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich book teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you'll find detailed chapters on features like metadata extraction, automatic language detection, and custom parser development.

This book is written for developers who are new to both Scala and Lift and covers just enough Scala to get you started.

Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.
What's Inside
  • Crack MS Word, PDF, HTML, and ZIP
  • Integrate with search engines, CMS, and other data sources
  • Learn through experimentation
  • Many examples

This book requires no previous knowledge of Tika or text mining techniques. It assumes a working knowledge of Java.

========================================​==
Table of Contents
    PART 1 GETTING STARTED
  1. The case for the digital Babel fish
  2. Getting started with Tika
  3. The information landscape
  4. PART 2 TIKA IN DETAIL
  5. Document type detection
  6. Content extraction
  7. Understanding metadata
  8. Language detection
  9. What's in a file?
  10. PART 3 INTEGRATION AND ADVANCED USE
  11. The big picture
  12. Tika and the Lucene search stack
  13. Extending Tika
  14. PART 4 CASE STUDIES
  15. Powering NASA science data systems
  16. Content management with Apache Jackrabbit
  17. Curating cancer research data with Tika
  18. The classic search engine example
LanguageEnglish
PublisherManning
Release dateNov 30, 2011
ISBN9781638352631
Tika in Action
Author

Jukka L. Zitting

Jukka Zitting is a core Tika developer with over a decade of experience of open source content management. Jukka works as a Senior Developer for the Swiss content management company Day Software, and is a member of the JCP expert group for the Content Repository for Java Technology API. He is a member of the Apache Software Foundation and the chairman of the Apache Jackrabbit project.

Related to Tika in Action

Related ebooks

Computers For You

View More

Related articles

Reviews for Tika in Action

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Tika in Action - Jukka L. Zitting

    Copyright

    For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact

             Special Sales Department

             Manning Publications Co.

             20 Baldwin Road

             PO Box 261

             Shelter Island, NY 11964

             Email: 

    [email protected]

    ©2012 by Manning Publications Co. All rights reserved.

    No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

    Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

    Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

    Printed in the United States of America

    1 2 3 4 5 6 7 8 9 10 – MAL – 16 15 14 13 12 11

    Dedication

    To my lovely wife Lisa and my son Christian

    CM

    To my lovely wife Kirsi-Marja and our happy cats

    JZ

    Brief Table of Contents

    Copyright

    Brief Table of Contents

    Table of Contents

    Foreword

    Preface

    Acknowledgments

    About this Book

    About the Authors

    About the Cover Illustration

    1. Getting started

    Chapter 1. The case for the digital Babel fish

    Chapter 2. Getting started with Tika

    Chapter 3. The information landscape

    2. Tika in detail

    Chapter 4. Document type detection

    Chapter 5. Content extraction

    Chapter 6. Understanding metadata

    Chapter 7. Language detection

    Chapter 8. What’s in a file?

    3. Integration and advanced use

    Chapter 9. The big picture

    Chapter 10. Tika and the Lucene search stack

    Chapter 11. Extending Tika

    4. Case studies

    Chapter 12. Powering NASA science data systems

    Chapter 13. Content management with Apache Jackrabbit

    Chapter 14. Curating cancer research data with Tika

    Chapter 15. The classic search engine example

    Appendix A. Tika quick reference

    Appendix B. Supported metadata keys

    Index

    List of Figures

    List of Tables

    List of Listings

    Table of Contents

    Copyright

    Brief Table of Contents

    Table of Contents

    Foreword

    Preface

    Acknowledgments

    About this Book

    About the Authors

    About the Cover Illustration

    1. Getting started

    Chapter 1. The case for the digital Babel fish

    1.1. Understanding digital documents

    1.1.1. A taxonomy of file formats

    1.1.2. Parser libraries

    1.1.3. Structured text as the universal language

    1.1.4. Universal metadata

    1.1.5. The program that understands everything

    1.2. What is Apache Tika?

    1.2.1. A bit of history

    1.2.2. Key design goals

    1.2.3. When and where to use Tika

    1.3. Summary

    Chapter 2. Getting started with Tika

    2.1. Working with Tika source code

    2.1.1. Getting the source code

    2.1.2. The Maven build

    2.1.3. Including Tika in Ant projects

    2.2. The Tika application

    2.2.1. Drag-and-drop text extraction: the Tika GUI

    2.2.2. Tika on the command line

    2.3. Tika as an embedded library

    2.3.1. Using the Tika facade

    2.3.2. Managing dependencies

    2.4. Summary

    Chapter 3. The information landscape

    3.1. Measuring information overload

    3.1.1. Scale and growth

    3.1.2. Complexity

    3.2. I’m feeling lucky—searching the information landscape

    3.2.1. Just click it: the modern search engine

    3.2.2. Tika’s role in search

    3.3. Beyond lucky: machine learning

    3.3.1. Your likes and dislikes

    3.3.2. Real-world machine learning

    3.4. Summary

    2. Tika in detail

    Chapter 4. Document type detection

    4.1. Internet media types

    4.1.1. The parlance of media type names

    4.1.2. Categories of media types

    4.1.3. IANA and other type registries

    4.2. Media types in Tika

    4.2.1. The shared MIME-info database

    4.2.2. The MediaType class

    4.2.3. The MediaTypeRegistry class

    4.2.4. Type hierarchies

    4.3. File format diagnostics

    4.3.1. Filename globs

    4.3.2. Content type hints

    4.3.3. Magic bytes

    4.3.4. Character encodings

    4.3.5. Other mechanisms

    4.4. Tika, the type inspector

    4.5. Summary

    Chapter 5. Content extraction

    5.1. Full-text extraction

    5.1.1. Abstracting the parsing process

    5.1.2. Full-text indexing

    5.1.3. Incremental parsing

    5.2. The Parser interface

    5.2.1. Who knew parsing could be so easy?

    5.2.2. The parse() method

    5.2.3. Parser implementations

    5.2.4. Parser selection

    5.3. Document input stream

    5.3.1. Standardizing input to Tika

    5.3.2. The TikaInputStream class

    5.4. Structured XHTML output

    5.4.1. Semantic structure of text

    5.4.2. Structured output via SAX events

    5.4.3. Marking up structure with XHTML

    5.5. Context-sensitive parsing

    5.5.1. Environment settings

    5.5.2. Custom document handling

    5.6. Summary

    Chapter 6. Understanding metadata

    6.1. The standards of metadata

    6.1.1. Metadata models

    6.1.2. General metadata standards

    6.1.3. Content-specific metadata standards

    6.2. Metadata quality

    6.2.1. Challenges/Problems

    6.2.2. Unifying heterogeneous standards

    6.3. Metadata in Tika

    6.3.1. Keys and multiple values

    6.3.2. Transformations and views

    6.4. Practical uses of metadata

    6.4.1. Common metadata for the Lucene indexer

    6.4.2. Give me my metadata in my schema!

    6.5. Summary

    Chapter 7. Language detection

    7.1. The most translated document in the world

    7.2. Sounds Greek to me—theory of language detection

    7.2.1. Language profiles

    7.2.2. Profiling algorithms

    7.2.3. The N-gram algorithm

    7.2.4. Advanced profiling algorithms

    7.3. Language detection in Tika

    7.3.1. Incremental language detection

    7.3.2. Putting it all together

    7.4. Summary

    Chapter 8. What’s in a file?

    8.1. Types of content

    8.1.1. HDF: a format for scientific data

    8.1.2. Really Simple Syndication: a format for rapidly changing content

    8.2. How Tika extracts content

    8.2.1. Organization of content

    8.2.2. File header and naming conventions

    8.2.3. Storage affects extraction

    8.3. Summary

    3. Integration and advanced use

    Chapter 9. The big picture

    9.1. Tika in search engines

    9.1.1. The search use case

    9.1.2. The anatomy of a search index

    9.2. Managing and mining information

    9.2.1. Document management systems

    9.2.2. Text mining

    9.3. Buzzword compliance

    9.3.1. Modularity, Spring, and OSGi

    9.3.2. Large-scale computing

    9.4. Summary

    Chapter 10. Tika and the Lucene search stack

    10.1. Load-bearing walls

    10.1.1. ManifoldCF

    10.1.2. Open Relevance

    10.2. The steel frame

    10.2.1. Lucene Core

    10.2.2. Solr

    10.3. The finishing touches

    10.3.1. Nutch

    10.3.2. Droids

    10.3.3. Mahout

    10.4. Summary

    Chapter 11. Extending Tika

    11.1. Adding type information

    11.1.1. Custom media type configuration

    11.2. Custom type detection

    11.2.1. The Detector interface

    11.2.2. Building a custom type detector

    11.2.3. Plugging in new detectors

    11.3. Customized parsing

    11.3.1. Customizing existing parsers

    11.3.2. Writing a new parser

    11.3.3. Plugging in new parsers

    11.3.4. Overriding existing parsers

    11.4. Summary

    4. Case studies

    Chapter 12. Powering NASA science data systems

    12.1. NASA’s Planetary Data System

    12.1.1. PDS data model

    12.1.2. The PDS search redesign

    12.2. NASA’s Earth Science Enterprise

    12.2.1. Leveraging Tika in NASA Earth Science SIPS

    12.2.2. Using Tika within the ground data systems

    12.3. Summary

    Chapter 13. Content management with Apache Jackrabbit

    13.1. Introducing Apache Jackrabbit

    13.2. The text extraction pool

    13.3. Content-aware WebDAV

    13.4. Summary

    Chapter 14. Curating cancer research data with Tika

    14.1. The NCI Early Detection Research Network

    14.1.1. The EDRN data model

    14.1.2. Scientific data curation

    14.2. Integrating Tika

    14.2.1. Metadata extraction

    14.2.2. MIME type identification and classification

    14.3. Summary

    Chapter 15. The classic search engine example

    15.1. The Public Terabyte Dataset Project

    15.2. The Bixo web crawler

    15.2.1. Parsing fetched documents

    15.2.2. Validating Tika’s charset detection

    15.3. Summary

    Appendix A. Tika quick reference

    A.1. Tika facade

    A.2. Command-line options

    A.3. ContentHandler utilities

    Appendix B. Supported metadata keys

    B.1. Climate Forecast

    B.2. Creative Commons

    B.3. Dublin Core

    B.4. Geographic metadata

    B.5. HTTP headers

    B.6. Microsoft Office

    B.7. Message (email)

    B.8. TIFF (Image)

    Index

    List of Figures

    List of Tables

    List of Listings

    Foreword

    I’m a big fan of search engines and Java, so early in the year 2004 I was looking for a good Java-based open source project on search engines. I quickly discovered Nutch. Nutch is an open source search engine project from the Apache Software Foundation. It was initiated by Doug Cutting, the well-known father of Lucene.

    With my new toy on my laptop, I tested and tried to evaluate it. Even if Nutch was in its early stages, it was a promising project—exactly what I was looking for. I proposed my first patches to Nutch relating to language identification in early 2005. Then, in the middle of 2005 I become a Nutch committer and increased my number of contributions relating to language identification, content-type guessing, and document analysis. Looking more deeply at Lucene, I discovered a wide set of projects around it: Nutch, Solr, and what would eventually become Mahout. Lucene provides its own analysis tools, as do Nutch and Solr, and each one employs some proprietary interfaces to deal with analysis engines.

    So I consulted with Chris Mattmann, another Nutch committer with whom I had worked, about the potential for refactoring all these disparate tools in a common and standardized project. The concept of Tika was born.

    Chris began to advocate for Tika as a standalone project in 2006. Then Jukka Zitting came into the picture and took the lead on the Tika project; after a lot of refactoring and enhancements, Tika became a Lucene top-level project.

    At that point in time, Tika was being used in Nutch, Droids (an Incubator project that you’ll hear about in chapter 10), and many non-Lucene projects—the activity on Tika mailing lists was indicative of this. The next promising steps for the project involved plugging Tika into top-level Lucene projects, such as Lucene itself or Solr. That amounted to a big challenge, as it required Tika to provide a flexible and robust set of interfaces that could be used in any programming context where metadata analysis was needed.

    Luckily, Tika got there. With this book, written by Tika’s two main creators and maintainers, Chris and Jukka, you’ll understand the problems of document analysis and document information extraction. They first explain to the reader why developers have such a need for Tika. Today, content handling and analysis are basic building blocks of all major modern services: search engines, content management systems, data mining, and other areas.

    If you’re a software developer, you’ve no doubt needed, on many occasions, to guess the encoding, formatting, and language of a file, and then to extract its metadata (title, author, and so on) and content. And you’ve probably noticed that this is a pain. That’s what Tika does for you. It provides a robust toolkit to easily handle any data format and to simplify this painful process.

    Chris and Jukka explain many details and examples of the Tika API and toolkit, including the Tika command-line interface and its graphical user interface (GUI) that you can use to extract information about any type of file handled by Tika. They show how you can use the Tika Application Programming Interface (API) to integrate Tika commodities directly with your own projects. You’ll discover that Tika is both simple to use and powerful. Tika has been carefully designed by Chris and Jukka and, despite the internal complexity of this type of library, Tika’s API and tools are simple and easy to understand and to use.

    Finally, Chris and Jukka show many real-life uses cases of Tika. The most noticeable real-life projects are Tika powering the NASA Science Data Systems, Tika curating cancer research data at the National Cancer Institute’s Early Detection Research Network, and the use of Tika for content management within the Apache Jackrabbit project. Tika is already used in many projects.

    I’m proud to have helped launch Tika. And I’m extremely grateful to Chris and Jukka for bringing Tika to this level and knowing that the long nights I spent writing code for automatic language identification for the MIME type repository weren’t in vain. To now make (even) a small contribution, for example, to assist in research in the fight against cancer, goes straight to my heart.

    Thank you both for all your work, and thank you for this book.

    JÉRÔME CHARRON

     

    C

    HIEF TECHNICAL OFFICER

     

    W

    EBPULSE

    Preface

    While studying information retrieval and search engines at the University of Southern California in the summer of 2005, I became interested in the Apache Nutch project. My professor, Dr. Ellis Horowitz, had recently discovered Nutch and thought it a good platform for the students in the course to get real-world experience during the final project phase of his CS599: Seminar on Search Engines course.

    After poking around Nutch and digging into its innards, I decided on a final project. It was a Really Simple Syndication (RSS) plugin described in detail in NUTCH-30.[¹] The plugin read an RSS file, extracted its outgoing web links and text, and fed that information back into the Nutch crawler for later indexing and retrieval.

    ¹https://issues.apache.org/jira/browse/NUTCH-30

    Seemingly innocuous, the class taught me a great detail about search engines, and helped pinpoint the area of search I was interested in—content detection and extraction.

    Fast forward to 2007: after I eventually became a Nutch committer, and focused in on more parsing-related issues (updates to the Nutch parser factory, metadata representation updates, and so on), my Nutch mentor Jérôme Charron and I decided that there was enough critical mass of code in Nutch related to parsing (parsing, language identification, extraction, and representation) that it warranted its own project. Other projects were doing it—rumblings of what would eventually become Hadoop were afoot—which led us to believe that the time was ripe for our own project. Since naming projects after children’s stuffed animals was popular at the time, we felt we could do the same, and Tika was born (named after Jérôme’s daughter’s stuffed animal).

    It wasn’t as simple as we thought. After getting little interest from the broader Lucene community (Nutch was a Lucene subproject and thus the project we were proposing had to go through the Lucene PMC), and with Jérôme and I both taking on further responsibility that took time away from direct Nutch development, what would eventually be known as Tika began to fizzle away.

    That’s where the other author of this book comes in. Jukka Zitting, bless him, was keenly interested in a technology, separate from the behemoth Nutch codebase, that would perform the types of things that we had carved off as Tika core capabilities: parsing, text extraction, metadata extraction, MIME detection, and more. Jukka was a seasoned Apache veteran, so he knew what to do. Jukka became a real leader of the original Tika proposal, took it to the Apache Incubator, and helped turn Tika into a real Apache project.

    After working with Jukka for a year or so in the Incubator community, we took our show on the road back to Lucene as a subproject when Tika graduated. Over a period of two years, we made seven Tika releases, infected several popular Apache projects (including Lucene, Solr, Nutch, and Jackrabbit), and gained enough critical mass to grow into a full-fledged Apache Top Level Project (TLP).

    But we weren’t done there. I don’t remember the exact time during the Christmas season in 2009 when I decided it was time to write a book, but it matters little. When I get an idea in my head, it’s hard to get it out. This book was happening. Tika in Action was happening. I approached Jukka and asked him how he felt. In characteristic fashion, he was up for the challenge.

    We sure didn’t know what we were getting ourselves into! We didn’t know that the rabbit hole went this deep. That said, I can safely say I don’t think we could’ve taken any other path that would’ve been as fulfilling, exciting, and rewarding. We really put our hearts and souls into creating this book. We sincerely hope you enjoy it. I think I speak for both of us in saying, I know we did!

    CHRIS MATTMANN

    Acknowledgments

    No book is born without great sacrifice by many people. The team who worked on this book means a lot to both of us. We’ll enumerate them here.

    Together, we’d like to thank our development editor at Manning, Cynthia Kane, for spending tireless hours working with us to make this book the best possible, and the clearest book to date on Apache Tika. Furthermore, her help with simplifying difficult concepts, creating direct and meaningful illustrations, and with conveying complex information to the reader is something that both of us will leverage and use well beyond this book and into the future.

    Of course, the entire team at Manning, from Marjan Bace on down, was a tremendous help in the book’s development and publication. We’d like to thank Nicholas Chase specifically for his help navigating the infrastructure and tools to put this book together. Christina Rudloff was a tremendous help in getting the initial book deal set up and we are very appreciative. The production team of Benjamin Berg, Katie Tennant, Dottie Marsico, and Mary Piergies worked hard to turn our manuscript into the book you are now reading, and Alex Ott did a thorough technical review of the final manuscript during production and helped clarify numerous code issues and details.

    We’d also like to thank the following reviewers who went through three time-crunched review cycles and significantly improved the quality of this book with their thoughtful comments: Deepak Vohra, John Griffin, Dean Farrell, Ken Krugler, John Guthrie, Richard Johannesson, Andreas Kemkes, Julien Nioche, Rick Wagner, Andrew F. Hart, Nick Burch, and Sean Kelly.

    Finally, we’d like to acknowledge and thank Ken Krugler and Chris Schneider of Bixo Labs, for contributing the bulk of chapter 15 and for showing us a real-world example of where Tika shines. Thanks, guys!

    CHRIS—I would like to thank my wife Lisa for her tremendous support. I originally promised her that my PhD dissertation would be the last book that I wrote, and after four years of sleepless nights (and many sleepless nights before that trying to make ends meet), that I would make time to enjoy life and slow down. That worked for about two years, until this opportunity came along. Thanks for the support again, honey: I couldn’t have made it here without you. I can promise a few more years of slowdown now that the book is done!

    JUKKA—I would like to thank my wife Kirsi-Marja for the encouragement to take on new challenges and for understanding the long evenings that meeting these challenges sometimes requires. Our two cats, Juuso and Nöpö, also deserve special thanks for their insistence on taking over the keyboard whenever a break from writing was needed.

    About this Book

    We wrote Tika in Action to be a hands-on guide for developers working with search engines, content management systems, and other similar applications who want to exploit the information locked in digital documents. The book introduces you to the world of mining text and binary documents and other information sources like internet media types and Dublin Core metadata. Then it shows where Tika fits within this landscape and how you can use Tika to build and extend applications. Case studies present real-world experience from domains ranging from search engines to digital asset management and scientific data processing.

    In addition to the architectural overviews, you will find more detailed information in the later chapters that focus on advanced features like XMP metadata processing, automatic language detection, and custom parser extensions. The book also describes common file formats like MS Word, PDF, HTML, and Zip, and open source libraries used to process files in these formats. The included code examples are designed to support hands-on experimentation.

    No previous knowledge of Tika or text mining techniques is required. The book will be most valuable to readers with a working knowledge of Java.

    Roadmap

    Chapter 1 gives the reader a contextual overview of Tika, including its history, its core capabilities, and some basic use cases where Tika is most helpful. Tika includes abilities for file type identification, text extraction, integration of existing parsing libraries, and language identification.

    Chapter 2 jumps right into using Tika, including instructions for downloading it, building it as a software library, and using Tika in a downstream Maven or Ant project. Quick tips for getting Tika up and running rapidly are present throughout the chapter.

    Chapter 3 introduces the reader to the information landscape and identifies where and how information is fed into the Tika framework. The reader will be introduced to the principles of the World Wide Web (WWW), its architecture, and how the web and Tika synergistically complement one another.

    Chapter 4 takes the reader on a deep dive into MIME type identification, covering topics ranging from the MIME hierarchy of the web, to identifying of unique byte pattern signatures present in every file, to other means (such as regular expressions and file extensions) of identifying files.

    Chapter 5 introduces the reader to content extraction with Tika. It starts with a simple full-text extraction and indexing example using the Tika facade, and continues with a tour of the core Parser interface and how Tika uses it for content extraction. The reader will learn useful techniques for things such as extracting all links from a document or processing Zip archives and other composite documents.

    Chapter 6 covers metadata. The chapter begins with a discussion of what metadata means in the context of Tika, along with a short classification of the existing metadata models that Tika supports. Tika’s metadata API is discussed in detail, including how it helps to normalize and validate metadata instances. The chapter describes how to supercharge the LuceneIndexer from chapter 5 and turn it into an RSS-based file notification service in a few simple lines of code.

    Chapter 7 introduces the topic of language identification. The language a document is written in is a highly useful piece of metadata, and the chapter describes mechanisms for automatically identifying written languages. The reader will encounter the most translated document in the world and see how Tika can correctly identify the language used in many of the translations.

    Chapter 8 gives the reader an in-depth overview of how files represent information, in terms of their content organization, their storage representation, and the way that metadata is codified, all the while showing how Tika hides this complexity and pulls information from these files. The reader takes an in-depth look at Tika’s RSS and HDF5 parser classes, and learns how Tika’s parsers codify the heterogeneity of files, and how you can develop your own parsers using similar methodologies.

    Chapter 9 reviews the best places to leverage Tika in your information management software, including pointing out key use cases where Tika can solely (or with a little glue code) implement many of the high-end features of the system. Document record archives, text mining, and search engines are all topics covered.

    Chapter 10 educates the reader in the vocabulary of the Lucene ecosystem. Mahout, ManifoldCF, Lucene, Solr, Nutch, Droids—all of these will roll off the tongue by the time you’re done surveying Lucene’s rich and vibrant community. Lucene was the birthplace of Tika, specifically within the Apache Nutch project, and this chapter takes the opportunity to show you how Tika has grown up over the years into the load-bearing walls of the entire Lucene ecosystem.

    Chapter 11 explains what to do when stock Tika out of the box doesn’t handle your file type identification, extraction, and representation needs. Read: you don’t have to pick another whiz-bang technology—you simply extend Tika. We show you how in this chapter, taking you start-to-end through an example of a prescription file type that you may exchange with a doctor.

    Chapter 12 is the first case study of the book, and it’s high-visibility. We show you how NASA and its planetary and Earth science communities are using Tika to search planetary images, to extract data and metadata from Earth science files, and to identify content for dissemination and acquisition.

    Chapter 13 shows you how the Apache Jackrabbit content repository, a key component in many content and document management systems, uses Tika to implement full-text search and WebDAV integration.

    Chapter 14 presents how Tika is used at the National Cancer Institute, helping to power data systems for the Early Detection Research Network (EDRN). We show you how Tika is an integral component of another Apache technology, OODT, the data system infrastructure used to power many national-scale data systems. Tika helps to detect file types, and helps to organize cancer information as it’s catalogued, archived, and made available to the broader scientific community.

    For chapter 15, we interviewed Ken Krugler and Chris Schneider of Bixo Labs about how they used Tika to classify and identify content from the Public Terabyte Dataset project, an ambitious endeavor to make available a traditional web-scale dataset for public use. Using Tika, Ken and his team demonstrate a classic search engine example, and identify several areas of improvement and future work in Tika including language identification and charset detection.

    The book contains two appendixes. The first is a Tika quick reference. Think of it as a cheat-sheet for using Tika, its commands, and a compact form of some of Tika’s documentation. The second appendix is a description of Tika’s relevant metadata keys, giving the reader an idea of how and when to use them in a custom parser, in any of the existing Parser classes that ship with Tika, or in any downstream program or analysis desired.

    Code conventions and downloads

    Enjoying the preview?
    Page 1 of 1