{"id":3680,"date":"2014-03-29T07:50:33","date_gmt":"2014-03-29T07:50:33","guid":{"rendered":"https:\/\/unknownerror.org\/index.php\/2014\/03\/29\/problem-about-tika-collection-of-common-programming-errors\/"},"modified":"2014-03-29T07:50:33","modified_gmt":"2014-03-29T07:50:33","slug":"problem-about-tika-collection-of-common-programming-errors","status":"publish","type":"post","link":"https:\/\/unknownerror.org\/index.php\/2014\/03\/29\/problem-about-tika-collection-of-common-programming-errors\/","title":{"rendered":"problem about tika-Collection of common programming errors"},"content":{"rendered":"<ul>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/13e4371b8ada8a37a7e955bad18bb989?s=32&amp;d=identicon&amp;r=PG&amp;f=1\" \/><br \/>\nRahul Kulhari<br \/>\njava apache parsing rtf tika<br \/>\nI am parsing one Document that contains RTF Content using Apache tika but it is giving some exception. it is not giving contents of document.Here is a piece of code : public String contentEx(File f) throws IOException, SAXException,TikaException {System.out.println(f.getName());InputStream is = new FileInputStream(f);Parser ps = new AutoDetectParser();BodyContentHandler bch = new BodyContentHandler();Metadata metadata = new Metadata();ps.parse(is, bch, metadata, new ParseContext());return bch.to<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/0c5f99268cf92dfdccfc14c5a22cefde?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nuser2041057<br \/>\njava dependencies tika<br \/>\nI want to use Tika for extracting the text of some file formates like .doc, .ppt and so on. Currently I&#8217;m depended to tika-app-1.2.jar, but I think depending to this jar is not a good idea because this jar is runnable. Moreover in parsing the .ppt files it gives me this Runtime Exception:org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5de82b72at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)at org.ap<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/1d5036972c6c5d712b9d7a558eaa4c2d?s=32&amp;d=identicon&amp;r=PG&amp;f=1\" \/><br \/>\nAKIWEB<br \/>\njava parsing tika<br \/>\nWhat&#8217;s wrong with this code&#8230; I am trying to parse pdf files and extract the text from it&#8230; But for some pdf I am able to extract the text&#8230; And for some it throws the errorInvalid dictionary, found: &#8221; but expected: &#8216;\/&#8217; org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@67fb878And also I didn&#8217;t get any metadata values in md variable for some pdf&#8230; But for Some I get that&#8230;This is my code..!! Some problem with the ByteArray??priva<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/a3f585de7058de6d658aa65ceb51fe63?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nAnish<br \/>\njava pdf tika<br \/>\nI am using Apache Tika 0.9 to extract plain text from different file formats. Everything is working fine except, when it Comes to pdf extraction, my code works perfectly, extracts text from almost all the pdfs (unless the pdf is like 200mb and scanned). But once I take my application to another machine (from Win-7 to Windows Server 2008) it works for a few pdfs and throws a RunTime Exception for other pdfs which where readily working on the previous machine. All other formats are running superbl<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/0c5f99268cf92dfdccfc14c5a22cefde?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nuser2041057<br \/>\njava tika text-extraction<br \/>\nI have extracted the text of a .pdf file with tika using AutoDetectParser class. but when I use the same code for extracting the text of a .ppt file, it throws an exception. How to do it? thanksEDIT: The code that I used is: File file = new File(&#8220;1.ppt&#8221;); InputStream input = new FileInputStream(file); Parser autoDetectParser = new AutoDetectParser(); Metadata metadata = new Metadata(); StringWriter writer = new StringWriter(); ContentHandler handler = new WriteOutContentHandler(writer); autoD<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/i.stack.imgur.com\/qNUpW.jpg?s=32&amp;g=1\" \/><br \/>\njavanna<br \/>\nsolr full-text-search tika solr-cell<br \/>\nI am trying to index using curl based requestthe request iscurl &#8220;http:\/\/localhost:8080\/solr1\/update\/extract?literal.id=who.pdf&amp;uprefix=attr_&amp;fmap.content=attr_content&amp;commit=true&#8221; -F &#8220;myfile=@\/root\/apache-solr-3.1.0\/docs\/who.pdf&#8221;On submitting the request, i am getting this error,Error report&lt;\/title&gt;&lt;style&gt;&lt;!&#8211;H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/9931c23eef3760b8c2fe85828d3ef24f?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nRavish Bhagdev<br \/>\nsolr full-text-indexing pdfbox tika document-conversion<br \/>\nSeems like Solr is not parsing my PDF files correctly. I was wondering if there is any other alternative to using Apache Tika (which I believe uses PDFBox internally) for parsing PDF files? I seem to be getting random spaces in between my content when using this. I have isolated the problem by running PDF through PDFBox directly (latest version) which has the same problem.Some OCR commercial software such as Omnifind work on PDF fine but we are not able to integrate them with Solr in same way<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/40ec25f3a781912beea4454bb64c91bb?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\ndfj<br \/>\nsolr tika<br \/>\nI&#8217;m having difficulty when doing a SOLR with Tika import, my documents keep crashing when indexing web pages. I am removing the content of the Tika documents, and restarting the import, but this is very tedious, and I obviously lose the content of these documents.Here is the crash log:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 927at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerExceptio<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/aa4842fd2fcbbc0b067de0f8d8719a35?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nanand khatri<br \/>\nsolr tika data-import solr4 apache-tika<br \/>\nI am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr.My HTML file is look like this.&lt;meta http-equiv=&#8221;Content-Type&#8221; content=&#8221;text\/html; charset=UTF-8&#8243;&gt; &lt;meta name=&#8221;product_id&#8221; content=&#8221;11&#8243;\/&gt; &lt;meta name=&#8221;assetid&#8221; content=&#8221;10001&#8243;\/&gt; &lt;meta name=&#8221;title&#8221; content=&#8221;title of the article&#8221;\/&gt; &lt;meta name=&#8221;type&#8221; content=&#8221;0xyzb&#8221;\/&gt; &lt;meta name=&#8221;categor<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/37fbfa7c33b17ed81bba3fabb861e576?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nDhaval950<br \/>\nsolr solrnet tika<br \/>\nI have use Solr 3.1 , Apache Tika 0.9 and Solrnet 0.3.1 to index the docuent like .doc , .pdf file. I have successfully index and extract document on locally using this codeStartup.Init&lt;Article&gt;(&#8220;http:\/\/k9server:8080\/solr&#8221;);ISolrOperations&lt;Article&gt; solr =ServiceLocator.Current.GetInstance&lt;ISolrOperations&lt;Article&gt;&gt;();string filecontent = null;using (var file = File.OpenRead(@&#8221;D:\\\\solr.doc&#8221;)){var response = solr.Extract(new ExtractParameters(file, &#8220;abcd1&#8221;){ExtractOnly = tru<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/411b669b6877cdf3c69f47490df330cf?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nTechGeeky<br \/>\njava parsing tika<br \/>\nI am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for few XML I got the following error. I am not sure what does this error means. Some problem with my code or some problem with the XML file? And this is the below line number 100 in HTML Parser.javaString parsedText = tika.parseToString(htmlStream, md);Error that I am havin<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/12356ae4540e9286f4984eb7beef801e?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nJarrod Roberson<br \/>\njava mime tika<br \/>\nI am trying to add a custom mime type to Apache Tika.I have the following custom-mimetypes.xml document in org.apache.tika.mime :&lt;?xml version=&#8221;1.0&#8243; encoding=&#8221;UTF-8&#8243;?&gt; &lt;mime-info&gt;&lt;mime-type type=&#8221;text\/stringtemplategroup&#8221;&gt;&lt;glob pattern=&#8221;*.stg&#8221;\/&gt;&lt;\/mime-type&gt;&lt;mime-type type=&#8221;text\/stringtemplate&#8221;&gt;&lt;glob pattern=&#8221;*.st&#8221;\/&gt;&lt;\/mime-type&gt; &lt;\/mime-info&gt;I am getting an error about a Conflicting extension pattern .st:Caused by: org.apache.tika.mime.MimeType<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/3f91180e9a74235a4378b01a9e3e72ae?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nWivani<br \/>\napache-poi tika<br \/>\nI&#8217;m using POI for extracting data from excel file. (the 5th column in the excel sheet contain names of files that exist in my filesystem) I loop over the table&#8217;s rows (extract the cell&#8217;s content with POI) and for each row I create instance of Tika, and I parse the files that named in the 5th column with Tika &#8220;parseToString(file)&#8221;, when the file is Office document (excel, ppt, word) I get this error:Exception in thread &#8220;AWT-EventQueue-0&#8221; java.lang.NoSuchFieldError: filesystemat org.apache.poi.hwp<\/li>\n<li><img decoding=\"async\" src=\"http:\/\/www.gravatar.com\/avatar\/18ca22c4b3efc91e5e498c7ca4125026?s=32&amp;d=identicon&amp;r=PG\" \/><br \/>\nRNJ<br \/>\njava parsing namespaces confluence tika<br \/>\nI&#8217;m using TIKA 1.0 to remove the HTML content from some Confluence 4.3 pages, however, it fail when trying to parse pages like:&lt;ac:macro ac:name=&#8221;column&#8221;&gt;&lt;ac:rich-text-body&gt;&lt;p&gt;&lt;ac:image ac:height=&#8221;64&#8243; ac:width=&#8221;70&#8243;&gt;&lt;ri:attachment ri:filename=&#8221;plugins_icon.png&#8221;&gt;&lt;ri:page ri:content-title=&#8221;_Images&#8221; \/&gt;&lt;\/ri:attachment&gt;&lt;\/ac:image&gt;&lt;\/p&gt;&lt;\/ac:rich-text-body&gt; &lt;\/ac:macro&gt;It throws the following error: Exception in thread &#8220;main&#8221; org.apache.tik<\/li>\n<\/ul>\n<p>Web site is in building<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Rahul Kulhari java apache parsing rtf tika I am parsing one Document that contains RTF Content using Apache tika but it is giving some exception. it is not giving contents of document.Here is a piece of code : public String contentEx(File f) throws IOException, SAXException,TikaException {System.out.println(f.getName());InputStream is = new FileInputStream(f);Parser ps = new AutoDetectParser();BodyContentHandler bch [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3680","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/3680","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/comments?post=3680"}],"version-history":[{"count":0,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/posts\/3680\/revisions"}],"wp:attachment":[{"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/media?parent=3680"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/categories?post=3680"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/unknownerror.org\/index.php\/wp-json\/wp\/v2\/tags?post=3680"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}