Ebook478 pages4 hours

Tika in Action

Name: Tika in Action
Author: Jukka L. Zitting
ISBN: 9781638352631

By Jukka L. Zitting and Chris Mattmann

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Summary

Tika in Action is a hands-on guide to content mining with Apache Tika. The book's many examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.
About the Technology
Tika is an Apache toolkit that has built into it everything you and your app need to know about file formats. Using Tika, your applications can discover and extract content from digital documents in almost any format, including exotic ones.
About this Book
Tika in Action is the ultimate guide to content mining using Apache Tika. You'll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich book teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you'll find detailed chapters on features like metadata extraction, automatic language detection, and custom parser development.

This book is written for developers who are new to both Scala and Lift and covers just enough Scala to get you started.

Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.
What's Inside

Crack MS Word, PDF, HTML, and ZIP
Integrate with search engines, CMS, and other data sources
Learn through experimentation
Many examples

This book requires no previous knowledge of Tika or text mining techniques. It assumes a working knowledge of Java.

==========================================
Table of Contents

The case for the digital Babel fish
Getting started with Tika
The information landscape
Document type detection
Content extraction
Understanding metadata
Language detection
What's in a file?
The big picture
Tika and the Lucene search stack
Extending Tika
Powering NASA science data systems
Content management with Apache Jackrabbit
Curating cancer research data with Tika
The classic search engine example

Skip carousel

LanguageEnglish

PublisherManning

Release dateNov 30, 2011

ISBN9781638352631

Author

Jukka L. Zitting

Jukka Zitting is a core Tika developer with over a decade of experience of open source content management. Jukka works as a Senior Developer for the Swiss content management company Day Software, and is a member of the JCP expert group for the Content Repository for Java Technology API. He is a member of the Apache Software Foundation and the chairman of the Apache Jackrabbit project.

Related authors

Skip carousel

Related to Tika in Action

Related ebooks

Skip carousel

Scalatra in Action
Ebook
Scalatra in Action
byRoss Baker
Rating: 0 out of 5 stars
0 ratings
Restlet in Action: Developing RESTful web APIs in Java
Ebook
Restlet in Action: Developing RESTful web APIs in Java
byThierry Templier
Rating: 0 out of 5 stars
0 ratings
Solr in Action
Ebook
Solr in Action
byTimothy Potter
Rating: 3 out of 5 stars
3/5
Troubleshooting Java: Read, debug, and optimize JVM applications
Ebook
Troubleshooting Java: Read, debug, and optimize JVM applications
byLaurentiu Spilca
Rating: 0 out of 5 stars
0 ratings
SOA Governance in Action: REST and WS-* Architectures
Ebook
SOA Governance in Action: REST and WS-* Architectures
byJos Dirksen
Rating: 0 out of 5 stars
0 ratings
Mahout in Action
Ebook
Mahout in Action
bySean Owen
Rating: 0 out of 5 stars
0 ratings
Play for Java
Ebook
Play for Java
byNicolas Leroux
Rating: 0 out of 5 stars
0 ratings
Silverlight 5 in Action
Ebook
Silverlight 5 in Action
byPete Brown
Rating: 0 out of 5 stars
0 ratings
Spring Integration in Action
Ebook
Spring Integration in Action
byIwein Fuld
Rating: 0 out of 5 stars
0 ratings
Dependency Injection: Design patterns using Spring and Guice
Ebook
Dependency Injection: Design patterns using Spring and Guice
byDhananjay Prasanna
Rating: 0 out of 5 stars
0 ratings
Spark in Action
Ebook
Spark in Action
byMarko Bonaci
Rating: 0 out of 5 stars
0 ratings
Reactive Application Development
Ebook
Reactive Application Development
byDuncan K. DeVore
Rating: 0 out of 5 stars
0 ratings
Real-time Analytics with Storm and Cassandra
Ebook
Real-time Analytics with Storm and Cassandra
byShilpi Saxena
Rating: 0 out of 5 stars
0 ratings
Location-Aware Applications
Ebook
Location-Aware Applications
byRichard Ferraro
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Traefik API Gateway for Microservices: With Java and Python Microservices Deployed in Kubernetes
Ebook
Traefik API Gateway for Microservices: With Java and Python Microservices Deployed in Kubernetes
byRahul Sharma
Rating: 0 out of 5 stars
0 ratings
Pro Spring Boot 2: An Authoritative Guide to Building Microservices, Web and Enterprise Applications, and Best Practices
Ebook
Pro Spring Boot 2: An Authoritative Guide to Building Microservices, Web and Enterprise Applications, and Best Practices
byFelipe Gutierrez
Rating: 0 out of 5 stars
0 ratings
Practical OneOps
Ebook
Practical OneOps
byNilesh Nimkar
Rating: 0 out of 5 stars
0 ratings
Opa Application Development
Ebook
Opa Application Development
byLi Wenbo
Rating: 0 out of 5 stars
0 ratings
JBoss Weld CDI for Java Platform
Ebook
JBoss Weld CDI for Java Platform
byKen Finnigan
Rating: 0 out of 5 stars
0 ratings
Flex on Java
Ebook
Flex on Java
byBernerd Allmon
Rating: 0 out of 5 stars
0 ratings
Spark GraphX in Action
Ebook
Spark GraphX in Action
byMichael Malak
Rating: 0 out of 5 stars
0 ratings
Instant Highcharts
Ebook
Instant Highcharts
byCyril Grandjean
Rating: 0 out of 5 stars
0 ratings
MVVM Survival Guide for Enterprise Architectures in Silverlight and WPF
Ebook
MVVM Survival Guide for Enterprise Architectures in Silverlight and WPF
byVice Ryan
Rating: 0 out of 5 stars
0 ratings
Mockito Cookbook
Ebook
Mockito Cookbook
byMarcin Grzejszczak
Rating: 0 out of 5 stars
0 ratings
Mastering Eclipse Plug-in Development
Ebook
Mastering Eclipse Plug-in Development
byDr Alex Blewitt
Rating: 0 out of 5 stars
0 ratings
OpenJDK Cookbook
Ebook
OpenJDK Cookbook
byAlex Kasko
Rating: 0 out of 5 stars
0 ratings
NoSQL Databases A Complete Guide - 2020 Edition
Ebook
NoSQL Databases A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
iOS in Practice
Ebook
iOS in Practice
byBear P. Cahill
Rating: 0 out of 5 stars
0 ratings
C++ Cookbook: How to write great code with the latest C++ releases (English Edition)
Ebook
C++ Cookbook: How to write great code with the latest C++ releases (English Edition)
byWayne Murphy
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 5 out of 5 stars
5/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 3 out of 5 stars
3/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
Ebook
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
byJohannes Wild
Rating: 0 out of 5 stars
0 ratings
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
Uncanny Valley: A Memoir
Ebook
Uncanny Valley: A Memoir
byAnna Wiener
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
An Ultimate Guide to Kali Linux for Beginners
Ebook
An Ultimate Guide to Kali Linux for Beginners
byAnsh Goyal
Rating: 3 out of 5 stars
3/5
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Ebook
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Storytelling with Data: Let's Practice!
Ebook
Storytelling with Data: Let's Practice!
byCole Nussbaumer Knaflic
Rating: 4 out of 5 stars
4/5
Going Text: Mastering the Command Line
Ebook
Going Text: Mastering the Command Line
byBrian Schell
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
UNLIMITED
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
UNLIMITED
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
UNLIMITED
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
byTest and Code
0 ratings
0% found this document useful
KubeCon NA 2022: In this episode we bring you with us to KubeCon NA 2022 in Detroit, Michigan. We interviewed 15 attendees from various backgrounds and learned some cool insights.
UNLIMITED
KubeCon NA 2022: In this episode we bring you with us to KubeCon NA 2022 in Detroit, Michigan. We interviewed 15 attendees from various backgrounds and learned some cool insights.
byKubernetes Podcast from Google
0 ratings
0% found this document useful
EP 10: My FIRST SPRING BOOT 3 App | Getting Over Beginner's Fear
UNLIMITED
EP 10: My FIRST SPRING BOOT 3 App | Getting Over Beginner's Fear
byPro Coder Show
0 ratings
0% found this document useful
EP 09: Application Contexts, Dependency Injection, and Inversion of Control - OH MY!
UNLIMITED
EP 09: Application Contexts, Dependency Injection, and Inversion of Control - OH MY!
byPro Coder Show
0 ratings
0% found this document useful
235: Pair programming with Ben Orenstein & Tuple: In this episode, Kaushik goes solo and interviews Ben Orenstein. Ben is a prolific Ruby developer, an amazing conference speaker, an ardent vim-ster, and now the CEO of Tuple. Kaushik has been a big fan of Ben's work and was super stoked to talk to Ben and pick his brains on a host of topics: starting the company Tuple, pair programming in general, learning different programming languages and technology, giving better conference talks and more! This episode is chock full of wisdom from Ben. Enjoy!
UNLIMITED
235: Pair programming with Ben Orenstein & Tuple: In this episode, Kaushik goes solo and interviews Ben Orenstein. Ben is a prolific Ruby developer, an amazing conference speaker, an ardent vim-ster, and now the CEO of Tuple. Kaushik has been a big fan of Ben's work and was super stoked to talk to Ben and pick his brains on a host of topics: starting the company Tuple, pair programming in general, learning different programming languages and technology, giving better conference talks and more! This episode is chock full of wisdom from Ben. Enjoy!
byFragmented - Android Developer Podcast
0 ratings
0% found this document useful
Iceberg at Netflix and Beyond with Ryan Blue: Apache Iceberg is an open source high-performance format for huge data tables. Iceberg enables the use of SQL tables for big data, while making it possible for engines like Spark and Hive to safely work with the same tables, at the same time.
UNLIMITED
Iceberg at Netflix and Beyond with Ryan Blue: Apache Iceberg is an open source high-performance format for huge data tables. Iceberg enables the use of SQL tables for big data, while making it possible for engines like Spark and Hive to safely work with the same tables, at the same time.
byData Archives - Software Engineering Daily
0 ratings
0% found this document useful
gRPC & protocol buffers: with Askhay Shah
UNLIMITED
gRPC & protocol buffers: with Askhay Shah
byGo Time: Golang, Software Engineering
0 ratings
0% found this document useful
#143 - How to Think Like a Software Engineering Manager - Akanksha Gupta
UNLIMITED
#143 - How to Think Like a Software Engineering Manager - Akanksha Gupta
byTech Lead Journal
100%
100% found this document useful
Marco "Ocramius" Pivetta: What Senior Devs Should Spend More Time On (It's Not Writing Code): Robby speaks with Marco "Ocramius" Pivetta, a software consultant specializing in PHP. Marco gives his input on different types of technical debt he's seen, working with less experienced developers as a senior, and getting "kicked in the teeth" as a developer. He also shares what great senior devs should spend more time on (hint: It's not writing code).
UNLIMITED
Marco "Ocramius" Pivetta: What Senior Devs Should Spend More Time On (It's Not Writing Code): Robby speaks with Marco "Ocramius" Pivetta, a software consultant specializing in PHP. Marco gives his input on different types of technical debt he's seen, working with less experienced developers as a senior, and getting "kicked in the teeth" as a developer. He also shares what great senior devs should spend more time on (hint: It's not writing code).
byMaintainable
0 ratings
0% found this document useful
Power Up Your Java Using Python With JPype - Episode 286: An interview with Karl Nelson about using the JPype library for bridging the Java and Python ecosystems for scientific computing
UNLIMITED
Power Up Your Java Using Python With JPype - Episode 286: An interview with Karl Nelson about using the JPype library for bridging the Java and Python ecosystems for scientific computing
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
UNLIMITED
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
LLMs, Retrieval Augmented Generation, Knowledge Graph, Vector Databases with Mike Dillinger: <p>RAG, Retrieval Augemented Generation, is the term you now constantly hear in conjunction with LLM that provides context. But how does it actually work? And what's the relationship with Vector Databases and Knowledge Graphs? This will be a geeky AI e...
UNLIMITED
LLMs, Retrieval Augmented Generation, Knowledge Graph, Vector Databases with Mike Dillinger: <p>RAG, Retrieval Augemented Generation, is the term you now constantly hear in conjunction with LLM that provides context. But how does it actually work? And what's the relationship with Vector Databases and Knowledge Graphs? This will be a geeky AI e...
byCatalog & Cocktails: The Honest, No-BS Data Podcast
0 ratings
0% found this document useful
Robert Chang: Building the Minerva Metrics Store @ Airbnb: Robert Chang is a product manager for the data platform at Airbnb, where he helped build and roll out Minerva, Airbnb's internal metrics store. They use Minerva to track over 12,000(!) metrics and 4,000(!) dimensions with consistency across the...
UNLIMITED
Robert Chang: Building the Minerva Metrics Store @ Airbnb: Robert Chang is a product manager for the data platform at Airbnb, where he helped build and roll out Minerva, Airbnb's internal metrics store. They use Minerva to track over 12,000(!) metrics and 4,000(!) dimensions with consistency across the...
byThe Analytics Engineering Podcast
0 ratings
0% found this document useful
Aaron Blohowiak: The Myth of the Sufficiently Smart Engineer: Robby speaks with Aaron Blohowiak, Senior Software Engineer at Netflix. They discuss mistakes teams make when refactoring too much before finding a product-market-fit and how Netflix deals with technical debt. Aaron also shares some early era Ruby on Rails stories, along with reasons why developers might be intimidated to apply at top-tier organizations like Netflix.
UNLIMITED
Aaron Blohowiak: The Myth of the Sufficiently Smart Engineer: Robby speaks with Aaron Blohowiak, Senior Software Engineer at Netflix. They discuss mistakes teams make when refactoring too much before finding a product-market-fit and how Netflix deals with technical debt. Aaron also shares some early era Ruby on Rails stories, along with reasons why developers might be intimidated to apply at top-tier organizations like Netflix.
byMaintainable
0 ratings
0% found this document useful
Episode 101. Allright, let's talk about Kafka: Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache...
UNLIMITED
Episode 101. Allright, let's talk about Kafka: Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache...
byJava Pub House
0 ratings
0% found this document useful
EP 01: The Best of SpringOne 2021 (ft. Dan Vega)
UNLIMITED
EP 01: The Best of SpringOne 2021 (ft. Dan Vega)
byPro Coder Show
0 ratings
0% found this document useful
#567: AWS Lambda SnapStart
UNLIMITED
#567: AWS Lambda SnapStart
byAWS Podcast
0 ratings
0% found this document useful
Colin Campbell - The Daily Habits of Effective Engineers: Robby has a chat with Colin Campbell, the Director of Engineering at Tucows, about the professional ethos of software development and why the caliber of an engineer’s work is a reflection of their daily habits, the importance of humility for software engineers, the strategic approach of doing nothing during Sprint Zero, the practical aspects of software engineering, and so much more.
UNLIMITED
Colin Campbell - The Daily Habits of Effective Engineers: Robby has a chat with Colin Campbell, the Director of Engineering at Tucows, about the professional ethos of software development and why the caliber of an engineer’s work is a reflection of their daily habits, the importance of humility for software engineers, the strategic approach of doing nothing during Sprint Zero, the practical aspects of software engineering, and so much more.
byMaintainable
0 ratings
0% found this document useful
A New Distributed Cloud Architecture
UNLIMITED
A New Distributed Cloud Architecture
byThe Cloudcast
0 ratings
0% found this document useful
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
UNLIMITED
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
UNLIMITED
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
Scale Out Cloud Storage
UNLIMITED
Scale Out Cloud Storage
byThe Cloudcast
0 ratings
0% found this document useful
How ChatGPT Changes Tech + The End of Remote Work? — With Aaron Levie
UNLIMITED
How ChatGPT Changes Tech + The End of Remote Work? — With Aaron Levie
byBig Technology Podcast
100%
100% found this document useful
STUMP'D - Coding Interview Questions: In this episode of Syntax, Scott and Wes are back with another edition of Stump’d! where they try to stump each other with interview questions. Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How...
UNLIMITED
STUMP'D - Coding Interview Questions: In this episode of Syntax, Scott and Wes are back with another edition of Stump’d! where they try to stump each other with interview questions. Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
UNLIMITED
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
EP 22: What is OAuth 2?
UNLIMITED
EP 22: What is OAuth 2?
byPro Coder Show
0 ratings
0% found this document useful
Managed Kafka with Tom Crayford: Kafka is a distributed log for producers and consumers to publish messages to each other. We’ve done many shows about Kafka as a key building block for distributed systems, but we often leave out the discussion of the complexities of setting up Kafka a...
UNLIMITED
Managed Kafka with Tom Crayford: Kafka is a distributed log for producers and consumers to publish messages to each other. We’ve done many shows about Kafka as a key building block for distributed systems, but we often leave out the discussion of the complexities of setting up Kafka a...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
EP 20: What are Servlets?
UNLIMITED
EP 20: What are Servlets?
byPro Coder Show
0 ratings
0% found this document useful

Skip carousel

All Your Database Are Belong To Us
Linux Format
UNLIMITED
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
Join the Pod, Man!
Linux Format
UNLIMITED
Join the Pod, Man!
May 30, 2023
8 min read
Build a Better nginx Reverse Proxy
Maximum PC
UNLIMITED
Build a Better nginx Reverse Proxy
Feb 4, 2020
4 min read
Workflow
Linux Format
UNLIMITED
Workflow
Nov 17, 2020
3 min read
Basic Concepts
Linux Format
UNLIMITED
Basic Concepts
Jul 2, 2019
A messaging system such as Kafka enables you to send messages between processes, applications and servers. Applications connect to Kafka to send or get data. Strictly speaking, a Kafka ‘topic’ is a unit of storage in Kafka: data in Kafka is stored in
1 min read
Text Docs To Rich Docs
Linux Format
UNLIMITED
Text Docs To Rich Docs
Dec 17, 2019
6 min read
Build Your First Reverse Proxy
Maximum PC
UNLIMITED
Build Your First Reverse Proxy
Jan 7, 2020
7 min read
2029 VISION Where Technology Is Taking Business
NZBusiness and Management
UNLIMITED
2029 VISION Where Technology Is Taking Business
May 27, 2019
6 min read
Route Traffic Between Networks Using A Pi
Linux Format
UNLIMITED
Route Traffic Between Networks Using A Pi
Jun 2, 2020
A deep-dive into Pi networking solutions resulted in this tutorial. The goal was to uncover a Pi configuration that would enable the routing of network traffic from a wired network to a wireless network. The aim is to build a network router using a R
10 min read
Installation
Linux Format
UNLIMITED
Installation
Oct 19, 2021
1 min read
How To Develop A RESTful Client In Go
Linux Format
UNLIMITED
How To Develop A RESTful Client In Go
Nov 16, 2021
Mihalis Tsoukalos is a systems engineer and technical writer. He’s the author of Go Systems Programming and Mastering Go. You can reach him at @mactsouk. The subject of this month’s tutorial is RESTful services. In particular, you’re going to learn h
9 min read
Monitor Systems And Docker Deployments
Linux Format
UNLIMITED
Monitor Systems And Docker Deployments
Jun 30, 2020
Welcome to Netdata, software for distributed real-time performance and health monitoring of UNIX machines. Don’t you dare turn that page! A key advantage of Netdata is that it collects all of its metrics without introducing too much load on to the Li
8 min read
Can I Use Python 2 In Maya 2022?
3D World
UNLIMITED
Can I Use Python 2 In Maya 2022?
Aug 10, 2021
1 min read
The Return Of Gpu Computing
PC Pro Magazine
UNLIMITED
The Return Of Gpu Computing
Jul 8, 2021
5 min read
Grafana, Telegraf And Influxdb
Linux Format
UNLIMITED
Grafana, Telegraf And Influxdb
Jun 30, 2020
If you don’t like Netdata or if you want to try something else, you can give Grafana (https://grafana.com), Telegraf (www.influxdata.com/time-series-platform/telegraf) and InfluxDB (www.influxdata.com/products/influxdb-overview) a try. Grafana can’t
1 min read
Your First Steps In Grafana
Linux Format
UNLIMITED
Your First Steps In Grafana
Nov 17, 2020
The easiest way to get hold of Grafana and begin using it as soon as possible is by downloading and executing its official Docker image. This means that apart from the Docker image, you won’t need to download, set up or install anything else for Graf
1 min read
Grafana Terminology
Linux Format
UNLIMITED
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Use EBPF To Keep Tabs On Your CPU
Linux Format
UNLIMITED
Use EBPF To Keep Tabs On Your CPU
Oct 18, 2022
Did you miss part one? Get hold of it on page 60 Mihalis Tsoukalos is a systems engineer and a technical writer. You can reach him at @mactsouk. We’re continuing our dive into the notoriously complex Extended Berkeley Packet Filter (eBPF) feature of
9 min read
Build A Dynamic App Security Pipeline
Linux Format
UNLIMITED
Build A Dynamic App Security Pipeline
Sep 21, 2021
8 min read
EBPF To Enhance Kubernetes Monitoring
Techfastly
UNLIMITED
EBPF To Enhance Kubernetes Monitoring
Apr 1, 2022
The introduction of Docker and Kubernetes has brought a dramatic revolution in the IT industry. Unlike the traditional methods of developing and deploying software, Kubernetes or K8s uses scaling and automated deployment. Thanks to the Linux function
4 min read
Electronic Data Analysis Key To Agri Economics
Farmer's Weekly
UNLIMITED
Electronic Data Analysis Key To Agri Economics
Nov 9, 2020
Collecting and analysing electronically generated data enable agricultural economists to compile meaningful recommendations for end-users in the agriculture sector. Data collection and analyses were increasingly being made easier, due to the developm
1 min read
Getting Started With Gpt-3
PC Pro Magazine
UNLIMITED
Getting Started With Gpt-3
Mar 9, 2023
2 min read
What Is The Future Of Game Streaming Now That Stadia Is Dead?
APC
UNLIMITED
What Is The Future Of Game Streaming Now That Stadia Is Dead?
Oct 31, 2022
Once hyped as being ‘the future of gaming’, the Google Stadia game streaming service was officially, just three years after launch and before even making it to Australian shores. When game streaming first launched we did have some apprehension about
2 min read
Data Backups: Critical Part of Cyber Strategy Strategies to Protect Your Data
Techfastly
UNLIMITED
Data Backups: Critical Part of Cyber Strategy Strategies to Protect Your Data
Jun 1, 2022
6 min read
Add Military-level Security To Any Project
Linux Format
UNLIMITED
Add Military-level Security To Any Project
Aug 27, 2019
7 min read
Build A Self-hosted Fediverse Server
Linux Format
UNLIMITED
Build A Self-hosted Fediverse Server
Jan 11, 2022
7 min read
Charts And Diagrams
Linux Format
UNLIMITED
Charts And Diagrams
Nov 15, 2022
1 min read
Why Are We Stuck With M.2 When U.2 Is So Much Better?
APC
UNLIMITED
Why Are We Stuck With M.2 When U.2 Is So Much Better?
May 22, 2023
4 min read
An Introduction To Rabbitmq
Linux Format
UNLIMITED
An Introduction To Rabbitmq
Jun 29, 2021
RabbitMQ is a Message Broker, which means that it can safely hold messages generated by applications and make them available to other applications. The main advantages are reliability, support for clustering and high-availability queues, tracing capa
1 min read
Perl at 34
Linux Format
UNLIMITED
Perl at 34
Feb 8, 2022
7 min read

Related categories

Skip carousel

Reviews for Tika in Action

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Tika in Action - Jukka L. Zitting

Copyright

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact

Special Sales Department

Manning Publications Co.

20 Baldwin Road

PO Box 261

Shelter Island, NY 11964

Email:

[email protected]

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

Printed in the United States of America

1 2 3 4 5 6 7 8 9 10 – MAL – 16 15 14 13 12 11

Dedication

To my lovely wife Lisa and my son Christian

To my lovely wife Kirsi-Marja and our happy cats

Brief Table of Contents

Copyright

Brief Table of Contents

Table of Contents

Foreword

Preface

Acknowledgments

About this Book

About the Authors

About the Cover Illustration

1. Getting started

Chapter 1. The case for the digital Babel fish

Chapter 2. Getting started with Tika

Chapter 3. The information landscape

2. Tika in detail

Chapter 4. Document type detection

Chapter 5. Content extraction

Chapter 6. Understanding metadata

Chapter 7. Language detection

Chapter 8. What’s in a file?

3. Integration and advanced use

Chapter 9. The big picture

Chapter 10. Tika and the Lucene search stack

Chapter 11. Extending Tika

4. Case studies

Chapter 12. Powering NASA science data systems

Chapter 13. Content management with Apache Jackrabbit

Chapter 14. Curating cancer research data with Tika

Chapter 15. The classic search engine example

Appendix A. Tika quick reference

Appendix B. Supported metadata keys

Index

List of Figures

List of Tables

List of Listings

Copyright

Brief Table of Contents

Table of Contents

Foreword

Preface

Acknowledgments

About this Book

About the Authors

About the Cover Illustration

1. Getting started

Chapter 1. The case for the digital Babel fish

1.1. Understanding digital documents

1.1.1. A taxonomy of file formats

1.1.2. Parser libraries

1.1.3. Structured text as the universal language

1.1.4. Universal metadata

1.1.5. The program that understands everything

1.2. What is Apache Tika?

1.2.1. A bit of history

1.2.2. Key design goals

1.2.3. When and where to use Tika

1.3. Summary

Chapter 2. Getting started with Tika

2.1. Working with Tika source code

2.1.1. Getting the source code

2.1.2. The Maven build

2.1.3. Including Tika in Ant projects

2.2. The Tika application

2.2.1. Drag-and-drop text extraction: the Tika GUI

2.2.2. Tika on the command line

2.3. Tika as an embedded library

2.3.1. Using the Tika facade

2.3.2. Managing dependencies

2.4. Summary

Chapter 3. The information landscape

3.1. Measuring information overload

3.1.1. Scale and growth

3.1.2. Complexity

3.2. I’m feeling lucky—searching the information landscape

3.2.1. Just click it: the modern search engine

3.2.2. Tika’s role in search

3.3. Beyond lucky: machine learning

3.3.1. Your likes and dislikes

3.3.2. Real-world machine learning

3.4. Summary

2. Tika in detail

Chapter 4. Document type detection

4.1. Internet media types

4.1.1. The parlance of media type names

4.1.2. Categories of media types

4.1.3. IANA and other type registries

4.2. Media types in Tika

4.2.1. The shared MIME-info database

4.2.2. The MediaType class

4.2.3. The MediaTypeRegistry class

4.2.4. Type hierarchies

4.3. File format diagnostics

4.3.1. Filename globs

4.3.2. Content type hints

4.3.3. Magic bytes

4.3.4. Character encodings

4.3.5. Other mechanisms

4.4. Tika, the type inspector

4.5. Summary

Chapter 5. Content extraction

5.1. Full-text extraction

5.1.1. Abstracting the parsing process

5.1.2. Full-text indexing

5.1.3. Incremental parsing

5.2. The Parser interface

5.2.1. Who knew parsing could be so easy?

5.2.2. The parse() method

5.2.3. Parser implementations

5.2.4. Parser selection

5.3. Document input stream

5.3.1. Standardizing input to Tika

5.3.2. The TikaInputStream class

5.4. Structured XHTML output

5.4.1. Semantic structure of text

5.4.2. Structured output via SAX events

5.4.3. Marking up structure with XHTML

5.5. Context-sensitive parsing

5.5.1. Environment settings

5.5.2. Custom document handling

5.6. Summary

Chapter 6. Understanding metadata

6.1. The standards of metadata

6.1.1. Metadata models

6.1.2. General metadata standards

6.1.3. Content-specific metadata standards

6.2. Metadata quality

6.2.1. Challenges/Problems

6.2.2. Unifying heterogeneous standards

6.3. Metadata in Tika

6.3.1. Keys and multiple values

6.3.2. Transformations and views

6.4. Practical uses of metadata

6.4.1. Common metadata for the Lucene indexer

6.4.2. Give me my metadata in my schema!

6.5. Summary

Chapter 7. Language detection

7.1. The most translated document in the world

7.2. Sounds Greek to me—theory of language detection

7.2.1. Language profiles

7.2.2. Profiling algorithms

7.2.3. The N-gram algorithm

7.2.4. Advanced profiling algorithms

7.3. Language detection in Tika

7.3.1. Incremental language detection

7.3.2. Putting it all together

7.4. Summary

Chapter 8. What’s in a file?

8.1. Types of content

8.1.1. HDF: a format for scientific data

8.1.2. Really Simple Syndication: a format for rapidly changing content

8.2. How Tika extracts content

8.2.1. Organization of content

8.2.2. File header and naming conventions

8.2.3. Storage affects extraction

8.3. Summary

3. Integration and advanced use

Chapter 9. The big picture

9.1. Tika in search engines

9.1.1. The search use case

9.1.2. The anatomy of a search index

9.2. Managing and mining information

9.2.1. Document management systems

9.2.2. Text mining

9.3. Buzzword compliance

9.3.1. Modularity, Spring, and OSGi

9.3.2. Large-scale computing

9.4. Summary

Chapter 10. Tika and the Lucene search stack

10.1. Load-bearing walls

10.1.1. ManifoldCF

10.1.2. Open Relevance

10.2. The steel frame

10.2.1. Lucene Core

10.2.2. Solr

10.3. The finishing touches

10.3.1. Nutch

10.3.2. Droids

10.3.3. Mahout

10.4. Summary

Chapter 11. Extending Tika

11.1. Adding type information

11.1.1. Custom media type configuration

11.2. Custom type detection

11.2.1. The Detector interface

11.2.2. Building a custom type detector

11.2.3. Plugging in new detectors

11.3. Customized parsing

11.3.1. Customizing existing parsers

11.3.2. Writing a new parser

11.3.3. Plugging in new parsers

11.3.4. Overriding existing parsers

11.4. Summary

4. Case studies

Chapter 12. Powering NASA science data systems

12.1. NASA’s Planetary Data System

12.1.1. PDS data model

12.1.2. The PDS search redesign

12.2. NASA’s Earth Science Enterprise

12.2.1. Leveraging Tika in NASA Earth Science SIPS

12.2.2. Using Tika within the ground data systems

12.3. Summary

Chapter 13. Content management with Apache Jackrabbit

13.1. Introducing Apache Jackrabbit

13.2. The text extraction pool

13.3. Content-aware WebDAV

13.4. Summary

Chapter 14. Curating cancer research data with Tika

14.1. The NCI Early Detection Research Network

14.1.1. The EDRN data model

14.1.2. Scientific data curation

14.2. Integrating Tika

14.2.1. Metadata extraction

14.2.2. MIME type identification and classification

14.3. Summary

Chapter 15. The classic search engine example

15.1. The Public Terabyte Dataset Project

15.2. The Bixo web crawler

15.2.1. Parsing fetched documents

15.2.2. Validating Tika’s charset detection

15.3. Summary

Appendix A. Tika quick reference

A.1. Tika facade

A.2. Command-line options

A.3. ContentHandler utilities

Appendix B. Supported metadata keys

B.1. Climate Forecast

B.2. Creative Commons

B.3. Dublin Core

B.4. Geographic metadata

B.5. HTTP headers

B.6. Microsoft Office

B.7. Message (email)

B.8. TIFF (Image)

Index

List of Figures

List of Tables

List of Listings

Foreword

I’m a big fan of search engines and Java, so early in the year 2004 I was looking for a good Java-based open source project on search engines. I quickly discovered Nutch. Nutch is an open source search engine project from the Apache Software Foundation. It was initiated by Doug Cutting, the well-known father of Lucene.

With my new toy on my laptop, I tested and tried to evaluate it. Even if Nutch was in its early stages, it was a promising project—exactly what I was looking for. I proposed my first patches to Nutch relating to language identification in early 2005. Then, in the middle of 2005 I become a Nutch committer and increased my number of contributions relating to language identification, content-type guessing, and document analysis. Looking more deeply at Lucene, I discovered a wide set of projects around it: Nutch, Solr, and what would eventually become Mahout. Lucene provides its own analysis tools, as do Nutch and Solr, and each one employs some proprietary interfaces to deal with analysis engines.

So I consulted with Chris Mattmann, another Nutch committer with whom I had worked, about the potential for refactoring all these disparate tools in a common and standardized project. The concept of Tika was born.

Chris began to advocate for Tika as a standalone project in 2006. Then Jukka Zitting came into the picture and took the lead on the Tika project; after a lot of refactoring and enhancements, Tika became a Lucene top-level project.

At that point in time, Tika was being used in Nutch, Droids (an Incubator project that you’ll hear about in chapter 10), and many non-Lucene projects—the activity on Tika mailing lists was indicative of this. The next promising steps for the project involved plugging Tika into top-level Lucene projects, such as Lucene itself or Solr. That amounted to a big challenge, as it required Tika to provide a flexible and robust set of interfaces that could be used in any programming context where metadata analysis was needed.

Luckily, Tika got there. With this book, written by Tika’s two main creators and maintainers, Chris and Jukka, you’ll understand the problems of document analysis and document information extraction. They first explain to the reader why developers have such a need for Tika. Today, content handling and analysis are basic building blocks of all major modern services: search engines, content management systems, data mining, and other areas.

If you’re a software developer, you’ve no doubt needed, on many occasions, to guess the encoding, formatting, and language of a file, and then to extract its metadata (title, author, and so on) and content. And you’ve probably noticed that this is a pain. That’s what Tika does for you. It provides a robust toolkit to easily handle any data format and to simplify this painful process.

Chris and Jukka explain many details and examples of the Tika API and toolkit, including the Tika command-line interface and its graphical user interface (GUI) that you can use to extract information about any type of file handled by Tika. They show how you can use the Tika Application Programming Interface (API) to integrate Tika commodities directly with your own projects. You’ll discover that Tika is both simple to use and powerful. Tika has been carefully designed by Chris and Jukka and, despite the internal complexity of this type of library, Tika’s API and tools are simple and easy to understand and to use.

Finally, Chris and Jukka show many real-life uses cases of Tika. The most noticeable real-life projects are Tika powering the NASA Science Data Systems, Tika curating cancer research data at the National Cancer Institute’s Early Detection Research Network, and the use of Tika for content management within the Apache Jackrabbit project. Tika is already used in many projects.

I’m proud to have helped launch Tika. And I’m extremely grateful to Chris and Jukka for bringing Tika to this level and knowing that the long nights I spent writing code for automatic language identification for the MIME type repository weren’t in vain. To now make (even) a small contribution, for example, to assist in research in the fight against cancer, goes straight to my heart.

Thank you both for all your work, and thank you for this book.

JÉRÔME CHARRON

HIEF TECHNICAL OFFICER

EBPULSE

Preface

While studying information retrieval and search engines at the University of Southern California in the summer of 2005, I became interested in the Apache Nutch project. My professor, Dr. Ellis Horowitz, had recently discovered Nutch and thought it a good platform for the students in the course to get real-world experience during the final project phase of his CS599: Seminar on Search Engines course.

After poking around Nutch and digging into its innards, I decided on a final project. It was a Really Simple Syndication (RSS) plugin described in detail in NUTCH-30.[¹] The plugin read an RSS file, extracted its outgoing web links and text, and fed that information back into the Nutch crawler for later indexing and retrieval.

¹https://issues.apache.org/jira/browse/NUTCH-30

Seemingly innocuous, the class taught me a great detail about search engines, and helped pinpoint the area of search I was interested in—content detection and extraction.

Fast forward to 2007: after I eventually became a Nutch committer, and focused in on more parsing-related issues (updates to the Nutch parser factory, metadata representation updates, and so on), my Nutch mentor Jérôme Charron and I decided that there was enough critical mass of code in Nutch related to parsing (parsing, language identification, extraction, and representation) that it warranted its own project. Other projects were doing it—rumblings of what would eventually become Hadoop were afoot—which led us to believe that the time was ripe for our own project. Since naming projects after children’s stuffed animals was popular at the time, we felt we could do the same, and Tika was born (named after Jérôme’s daughter’s stuffed animal).

It wasn’t as simple as we thought. After getting little interest from the broader Lucene community (Nutch was a Lucene subproject and thus the project we were proposing had to go through the Lucene PMC), and with Jérôme and I both taking on further responsibility that took time away from direct Nutch development, what would eventually be known as Tika began to fizzle away.

That’s where the other author of this book comes in. Jukka Zitting, bless him, was keenly interested in a technology, separate from the behemoth Nutch codebase, that would perform the types of things that we had carved off as Tika core capabilities: parsing, text extraction, metadata extraction, MIME detection, and more. Jukka was a seasoned Apache veteran, so he knew what to do. Jukka became a real leader of the original Tika proposal, took it to the Apache Incubator, and helped turn Tika into a real Apache project.

After working with Jukka for a year or so in the Incubator community, we took our show on the road back to Lucene as a subproject when Tika graduated. Over a period of two years, we made seven Tika releases, infected several popular Apache projects (including Lucene, Solr, Nutch, and Jackrabbit), and gained enough critical mass to grow into a full-fledged Apache Top Level Project (TLP).

But we weren’t done there. I don’t remember the exact time during the Christmas season in 2009 when I decided it was time to write a book, but it matters little. When I get an idea in my head, it’s hard to get it out. This book was happening. Tika in Action was happening. I approached Jukka and asked him how he felt. In characteristic fashion, he was up for the challenge.

We sure didn’t know what we were getting ourselves into! We didn’t know that the rabbit hole went this deep. That said, I can safely say I don’t think we could’ve taken any other path that would’ve been as fulfilling, exciting, and rewarding. We really put our hearts and souls into creating this book. We sincerely hope you enjoy it. I think I speak for both of us in saying, I know we did!

CHRIS MATTMANN

Acknowledgments

No book is born without great sacrifice by many people. The team who worked on this book means a lot to both of us. We’ll enumerate them here.

Together, we’d like to thank our development editor at Manning, Cynthia Kane, for spending tireless hours working with us to make this book the best possible, and the clearest book to date on Apache Tika. Furthermore, her help with simplifying difficult concepts, creating direct and meaningful illustrations, and with conveying complex information to the reader is something that both of us will leverage and use well beyond this book and into the future.

Of course, the entire team at Manning, from Marjan Bace on down, was a tremendous help in the book’s development and publication. We’d like to thank Nicholas Chase specifically for his help navigating the infrastructure and tools to put this book together. Christina Rudloff was a tremendous help in getting the initial book deal set up and we are very appreciative. The production team of Benjamin Berg, Katie Tennant, Dottie Marsico, and Mary Piergies worked hard to turn our manuscript into the book you are now reading, and Alex Ott did a thorough technical review of the final manuscript during production and helped clarify numerous code issues and details.

We’d also like to thank the following reviewers who went through three time-crunched review cycles and significantly improved the quality of this book with their thoughtful comments: Deepak Vohra, John Griffin, Dean Farrell, Ken Krugler, John Guthrie, Richard Johannesson, Andreas Kemkes, Julien Nioche, Rick Wagner, Andrew F. Hart, Nick Burch, and Sean Kelly.

Finally, we’d like to acknowledge and thank Ken Krugler and Chris Schneider of Bixo Labs, for contributing the bulk of chapter 15 and for showing us a real-world example of where Tika shines. Thanks, guys!

CHRIS—I would like to thank my wife Lisa for her tremendous support. I originally promised her that my PhD dissertation would be the last book that I wrote, and after four years of sleepless nights (and many sleepless nights before that trying to make ends meet), that I would make time to enjoy life and slow down. That worked for about two years, until this opportunity came along. Thanks for the support again, honey: I couldn’t have made it here without you. I can promise a few more years of slowdown now that the book is done!

JUKKA—I would like to thank my wife Kirsi-Marja for the encouragement to take on new challenges and for understanding the long evenings that meeting these challenges sometimes requires. Our two cats, Juuso and Nöpö, also deserve special thanks for their insistence on taking over the keyboard whenever a break from writing was needed.

About this Book

We wrote Tika in Action to be a hands-on guide for developers working with search engines, content management systems, and other similar applications who want to exploit the information locked in digital documents. The book introduces you to the world of mining text and binary documents and other information sources like internet media types and Dublin Core metadata. Then it shows where Tika fits within this landscape and how you can use Tika to build and extend applications. Case studies present real-world experience from domains ranging from search engines to digital asset management and scientific data processing.

In addition to the architectural overviews, you will find more detailed information in the later chapters that focus on advanced features like XMP metadata processing, automatic language detection, and custom parser extensions. The book also describes common file formats like MS Word, PDF, HTML, and Zip, and open source libraries used to process files in these formats. The included code examples are designed to support hands-on experimentation.

No previous knowledge of Tika or text mining techniques is required. The book will be most valuable to readers with a working knowledge of Java.

Roadmap

Chapter 1 gives the reader a contextual overview of Tika, including its history, its core capabilities, and some basic use cases where Tika is most helpful. Tika includes abilities for file type identification, text extraction, integration of existing parsing libraries, and language identification.

Chapter 2 jumps right into using Tika, including instructions for downloading it, building it as a software library, and using Tika in a downstream Maven or Ant project. Quick tips for getting Tika up and running rapidly are present throughout the chapter.

Chapter 3 introduces the reader to the information landscape and identifies where and how information is fed into the Tika framework. The reader will be introduced to the principles of the World Wide Web (WWW), its architecture, and how the web and Tika synergistically complement one another.

Chapter 4 takes the reader on a deep dive into MIME type identification, covering topics ranging from the MIME hierarchy of the web, to identifying of unique byte pattern signatures present in every file, to other means (such as regular expressions and file extensions) of identifying files.

Chapter 5 introduces the reader to content extraction with Tika. It starts with a simple full-text extraction and indexing example using the Tika facade, and continues with a tour of the core Parser interface and how Tika uses it for content extraction. The reader will learn useful techniques for things such as extracting all links from a document or processing Zip archives and other composite documents.

Chapter 6 covers metadata. The chapter begins with a discussion of what metadata means in the context of Tika, along with a short classification of the existing metadata models that Tika supports. Tika’s metadata API is discussed in detail, including how it helps to normalize and validate metadata instances. The chapter describes how to supercharge the LuceneIndexer from chapter 5 and turn it into an RSS-based file notification service in a few simple lines of code.

Chapter 7 introduces the topic of language identification. The language a document is written in is a highly useful piece of metadata, and the chapter describes mechanisms for automatically identifying written languages. The reader will encounter the most translated document in the world and see how Tika can correctly identify the language used in many of the translations.

Chapter 8 gives the reader an in-depth overview of how files represent information, in terms of their content organization, their storage representation, and the way that metadata is codified, all the while showing how Tika hides this complexity and pulls information from these files. The reader takes an in-depth look at Tika’s RSS and HDF5 parser classes, and learns how Tika’s parsers codify the heterogeneity of files, and how you can develop your own parsers using similar methodologies.

Chapter 9 reviews the best places to leverage Tika in your information management software, including pointing out key use cases where Tika can solely (or with a little glue code) implement many of the high-end features of the system. Document record archives, text mining, and search engines are all topics covered.

Chapter 10 educates the reader in the vocabulary of the Lucene ecosystem. Mahout, ManifoldCF, Lucene, Solr, Nutch, Droids—all of these will roll off the tongue by the time you’re done surveying Lucene’s rich and vibrant community. Lucene was the birthplace of Tika, specifically within the Apache Nutch project, and this chapter takes the opportunity to show you how Tika has grown up over the years into the load-bearing walls of the entire Lucene ecosystem.

Chapter 11 explains what to do when stock Tika out of the box doesn’t handle your file type identification, extraction, and representation needs. Read: you don’t have to pick another whiz-bang technology—you simply extend Tika. We show you how in this chapter, taking you start-to-end through an example of a prescription file type that you may exchange with a doctor.

Chapter 12 is the first case study of the book, and it’s high-visibility. We show you how NASA and its planetary and Earth science communities are using Tika to search planetary images, to extract data and metadata from Earth science files, and to identify content for dissemination and acquisition.

Chapter 13 shows you how the Apache Jackrabbit content repository, a key component in many content and document management systems, uses Tika to implement full-text search and WebDAV integration.

Chapter 14 presents how Tika is used at the National Cancer Institute, helping to power data systems for the Early Detection Research Network (EDRN). We show you how Tika is an integral component of another Apache technology, OODT, the data system infrastructure used to power many national-scale data systems. Tika helps to detect file types, and helps to organize cancer information as it’s catalogued, archived, and made available to the broader scientific community.

For chapter 15, we interviewed Ken Krugler and Chris Schneider of Bixo Labs about how they used Tika to classify and identify content from the Public Terabyte Dataset project, an ambitious endeavor to make available a traditional web-scale dataset for public use. Using Tika, Ken and his team demonstrate a classic search engine example, and identify several areas of improvement and future work in Tika including language identification and charset detection.

The book contains two appendixes. The first is a Tika quick reference. Think of it as a cheat-sheet for using Tika, its commands, and a compact form of some of Tika’s documentation. The second appendix is a description of Tika’s relevant metadata keys, giving the reader an idea of how and when to use them in a custom parser, in any of the existing Parser classes that ship with Tika, or in any downstream program or analysis desired.

Code conventions and downloads

Enjoying the preview?

Page 1 of 1

Tika in Action

About this ebook

Jukka L. Zitting

Related authors

Related to Tika in Action

Related ebooks

Scalatra in Action

Restlet in Action: Developing RESTful web APIs in Java

Solr in Action

Troubleshooting Java: Read, debug, and optimize JVM applications

SOA Governance in Action: REST and WS-* Architectures

Mahout in Action

Play for Java

Silverlight 5 in Action

Spring Integration in Action

Dependency Injection: Design patterns using Spring and Guice

Spark in Action

Reactive Application Development

Real-time Analytics with Storm and Cassandra

Location-Aware Applications

Mastering Apache Cassandra - Second Edition

Traefik API Gateway for Microservices: With Java and Python Microservices Deployed in Kubernetes

Pro Spring Boot 2: An Authoritative Guide to Building Microservices, Web and Enterprise Applications, and Best Practices

Practical OneOps

Opa Application Development

JBoss Weld CDI for Java Platform

Flex on Java

Spark GraphX in Action

Instant Highcharts

MVVM Survival Guide for Enterprise Architectures in Silverlight and WPF

Mockito Cookbook

Mastering Eclipse Plug-in Development

OpenJDK Cookbook

NoSQL Databases A Complete Guide - 2020 Edition

iOS in Practice

C++ Cookbook: How to write great code with the latest C++ releases (English Edition)

Computers For You

The Invisible Rainbow: A History of Electricity and Life

ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution

Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls

The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game

Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics

How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally

Elon Musk

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

Deep Search: How to Explore the Internet More Effectively

The Professional Voiceover Handbook: Voiceover training, #1

Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide

Uncanny Valley: A Memoir

Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are

CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61

Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition

Dark Aeon: Transhumanism and the War Against Humanity

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)

An Ultimate Guide to Kali Linux for Beginners

Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention

Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work

Storytelling with Data: Let's Practice!

Going Text: Mastering the Command Line

Related podcast episodes

Related articles

Related categories

Reviews for Tika in Action

What did you think?

Book preview

Tika in Action - Jukka L. Zitting

Copyright

Dedication

Brief Table of Contents

Table of Contents

Foreword

Preface

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters