problem about tika-Collection of common programming errors


  • Rahul Kulhari
    java apache parsing rtf tika
    I am parsing one Document that contains RTF Content using Apache tika but it is giving some exception. it is not giving contents of document.Here is a piece of code : public String contentEx(File f) throws IOException, SAXException,TikaException {System.out.println(f.getName());InputStream is = new FileInputStream(f);Parser ps = new AutoDetectParser();BodyContentHandler bch = new BodyContentHandler();Metadata metadata = new Metadata();ps.parse(is, bch, metadata, new ParseContext());return bch.to

  • user2041057
    java dependencies tika
    I want to use Tika for extracting the text of some file formates like .doc, .ppt and so on. Currently I’m depended to tika-app-1.2.jar, but I think depending to this jar is not a good idea because this jar is runnable. Moreover in parsing the .ppt files it gives me this Runtime Exception:org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5de82b72at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)at org.ap

  • AKIWEB
    java parsing tika
    What’s wrong with this code… I am trying to parse pdf files and extract the text from it… But for some pdf I am able to extract the text… And for some it throws the errorInvalid dictionary, found: ” but expected: ‘/’ org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@67fb878And also I didn’t get any metadata values in md variable for some pdf… But for Some I get that…This is my code..!! Some problem with the ByteArray??priva

  • Anish
    java pdf tika
    I am using Apache Tika 0.9 to extract plain text from different file formats. Everything is working fine except, when it Comes to pdf extraction, my code works perfectly, extracts text from almost all the pdfs (unless the pdf is like 200mb and scanned). But once I take my application to another machine (from Win-7 to Windows Server 2008) it works for a few pdfs and throws a RunTime Exception for other pdfs which where readily working on the previous machine. All other formats are running superbl

  • user2041057
    java tika text-extraction
    I have extracted the text of a .pdf file with tika using AutoDetectParser class. but when I use the same code for extracting the text of a .ppt file, it throws an exception. How to do it? thanksEDIT: The code that I used is: File file = new File(“1.ppt”); InputStream input = new FileInputStream(file); Parser autoDetectParser = new AutoDetectParser(); Metadata metadata = new Metadata(); StringWriter writer = new StringWriter(); ContentHandler handler = new WriteOutContentHandler(writer); autoD

  • javanna
    solr full-text-search tika solr-cell
    I am trying to index using curl based requestthe request iscurl “http://localhost:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&fmap.content=attr_content&commit=true” -F “myfile=@/root/apache-solr-3.1.0/docs/who.pdf”On submitting the request, i am getting this error,Error report</title><style><!–H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color

  • Ravish Bhagdev
    solr full-text-indexing pdfbox tika document-conversion
    Seems like Solr is not parsing my PDF files correctly. I was wondering if there is any other alternative to using Apache Tika (which I believe uses PDFBox internally) for parsing PDF files? I seem to be getting random spaces in between my content when using this. I have isolated the problem by running PDF through PDFBox directly (latest version) which has the same problem.Some OCR commercial software such as Omnifind work on PDF fine but we are not able to integrate them with Solr in same way

  • dfj
    solr tika
    I’m having difficulty when doing a SOLR with Tika import, my documents keep crashing when indexing web pages. I am removing the content of the Tika documents, and restarting the import, but this is very tedious, and I obviously lose the content of these documents.Here is the crash log:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 927at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerExceptio

  • anand khatri
    solr tika data-import solr4 apache-tika
    I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr.My HTML file is look like this.<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″> <meta name=”product_id” content=”11″/> <meta name=”assetid” content=”10001″/> <meta name=”title” content=”title of the article”/> <meta name=”type” content=”0xyzb”/> <meta name=”categor

  • Dhaval950
    solr solrnet tika
    I have use Solr 3.1 , Apache Tika 0.9 and Solrnet 0.3.1 to index the docuent like .doc , .pdf file. I have successfully index and extract document on locally using this codeStartup.Init<Article>(“http://k9server:8080/solr”);ISolrOperations<Article> solr =ServiceLocator.Current.GetInstance<ISolrOperations<Article>>();string filecontent = null;using (var file = File.OpenRead(@”D:\\solr.doc”)){var response = solr.Extract(new ExtractParameters(file, “abcd1”){ExtractOnly = tru

  • TechGeeky
    java parsing tika
    I am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for few XML I got the following error. I am not sure what does this error means. Some problem with my code or some problem with the XML file? And this is the below line number 100 in HTML Parser.javaString parsedText = tika.parseToString(htmlStream, md);Error that I am havin

  • Jarrod Roberson
    java mime tika
    I am trying to add a custom mime type to Apache Tika.I have the following custom-mimetypes.xml document in org.apache.tika.mime :<?xml version=”1.0″ encoding=”UTF-8″?> <mime-info><mime-type type=”text/stringtemplategroup”><glob pattern=”*.stg”/></mime-type><mime-type type=”text/stringtemplate”><glob pattern=”*.st”/></mime-type> </mime-info>I am getting an error about a Conflicting extension pattern .st:Caused by: org.apache.tika.mime.MimeType

  • Wivani
    apache-poi tika
    I’m using POI for extracting data from excel file. (the 5th column in the excel sheet contain names of files that exist in my filesystem) I loop over the table’s rows (extract the cell’s content with POI) and for each row I create instance of Tika, and I parse the files that named in the 5th column with Tika “parseToString(file)”, when the file is Office document (excel, ppt, word) I get this error:Exception in thread “AWT-EventQueue-0” java.lang.NoSuchFieldError: filesystemat org.apache.poi.hwp

  • RNJ
    java parsing namespaces confluence tika
    I’m using TIKA 1.0 to remove the HTML content from some Confluence 4.3 pages, however, it fail when trying to parse pages like:<ac:macro ac:name=”column”><ac:rich-text-body><p><ac:image ac:height=”64″ ac:width=”70″><ri:attachment ri:filename=”plugins_icon.png”><ri:page ri:content-title=”_Images” /></ri:attachment></ac:image></p></ac:rich-text-body> </ac:macro>It throws the following error: Exception in thread “main” org.apache.tik

Web site is in building