problem about tika-Collection of common programming errors
Rahul Kulhari
java apache parsing rtf tika
I am parsing one Document that contains RTF Content using Apache tika but it is giving some exception. it is not giving contents of document.Here is a piece of code : public String contentEx(File f) throws IOException, SAXException,TikaException {System.out.println(f.getName());InputStream is = new FileInputStream(f);Parser ps = new AutoDetectParser();BodyContentHandler bch = new BodyContentHandler();Metadata metadata = new Metadata();ps.parse(is, bch, metadata, new ParseContext());return bch.to
user2041057
java dependencies tika
I want to use Tika for extracting the text of some file formates like .doc, .ppt and so on. Currently I’m depended to tika-app-1.2.jar, but I think depending to this jar is not a good idea because this jar is runnable. Moreover in parsing the .ppt files it gives me this Runtime Exception:org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5de82b72at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)at org.ap
AKIWEB
java parsing tika
What’s wrong with this code… I am trying to parse pdf files and extract the text from it… But for some pdf I am able to extract the text… And for some it throws the errorInvalid dictionary, found: ” but expected: ‘/’ org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@67fb878And also I didn’t get any metadata values in md variable for some pdf… But for Some I get that…This is my code..!! Some problem with the ByteArray??priva
Anish
java pdf tika
I am using Apache Tika 0.9 to extract plain text from different file formats. Everything is working fine except, when it Comes to pdf extraction, my code works perfectly, extracts text from almost all the pdfs (unless the pdf is like 200mb and scanned). But once I take my application to another machine (from Win-7 to Windows Server 2008) it works for a few pdfs and throws a RunTime Exception for other pdfs which where readily working on the previous machine. All other formats are running superbl
user2041057
java tika text-extraction
I have extracted the text of a .pdf file with tika using AutoDetectParser class. but when I use the same code for extracting the text of a .ppt file, it throws an exception. How to do it? thanksEDIT: The code that I used is: File file = new File(“1.ppt”); InputStream input = new FileInputStream(file); Parser autoDetectParser = new AutoDetectParser(); Metadata metadata = new Metadata(); StringWriter writer = new StringWriter(); ContentHandler handler = new WriteOutContentHandler(writer); autoD
javanna
solr full-text-search tika solr-cell
I am trying to index using curl based requestthe request iscurl “http://localhost:8080/solr1/update/extract?literal.id=who.pdf&uprefix=attr_&fmap.content=attr_content&commit=true” -F “myfile=@/root/apache-solr-3.1.0/docs/who.pdf”On submitting the request, i am getting this error,Error report</title><style><!–H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color
Ravish Bhagdev
solr full-text-indexing pdfbox tika document-conversion
Seems like Solr is not parsing my PDF files correctly. I was wondering if there is any other alternative to using Apache Tika (which I believe uses PDFBox internally) for parsing PDF files? I seem to be getting random spaces in between my content when using this. I have isolated the problem by running PDF through PDFBox directly (latest version) which has the same problem.Some OCR commercial software such as Omnifind work on PDF fine but we are not able to integrate them with Solr in same way
dfj
solr tika
I’m having difficulty when doing a SOLR with Tika import, my documents keep crashing when indexing web pages. I am removing the content of the Tika documents, and restarting the import, but this is very tedious, and I obviously lose the content of these documents.Here is the crash log:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 927at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerExceptio
anand khatri
solr tika data-import solr4 apache-tika
I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr.My HTML file is look like this.<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″> <meta name=”product_id” content=”11″/> <meta name=”assetid” content=”10001″/> <meta name=”title” content=”title of the article”/> <meta name=”type” content=”0xyzb”/> <meta name=”categor
Dhaval950
solr solrnet tika
I have use Solr 3.1 , Apache Tika 0.9 and Solrnet 0.3.1 to index the docuent like .doc , .pdf file. I have successfully index and extract document on locally using this codeStartup.Init<Article>(“http://k9server:8080/solr”);ISolrOperations<Article> solr =ServiceLocator.Current.GetInstance<ISolrOperations<Article>>();string filecontent = null;using (var file = File.OpenRead(@”D:\\solr.doc”)){var response = solr.Extract(new ExtractParameters(file, “abcd1”){ExtractOnly = tru
TechGeeky
java parsing tika
I am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for few XML I got the following error. I am not sure what does this error means. Some problem with my code or some problem with the XML file? And this is the below line number 100 in HTML Parser.javaString parsedText = tika.parseToString(htmlStream, md);Error that I am havin
Jarrod Roberson
java mime tika
I am trying to add a custom mime type to Apache Tika.I have the following custom-mimetypes.xml document in org.apache.tika.mime :<?xml version=”1.0″ encoding=”UTF-8″?> <mime-info><mime-type type=”text/stringtemplategroup”><glob pattern=”*.stg”/></mime-type><mime-type type=”text/stringtemplate”><glob pattern=”*.st”/></mime-type> </mime-info>I am getting an error about a Conflicting extension pattern .st:Caused by: org.apache.tika.mime.MimeType
Wivani
apache-poi tika
I’m using POI for extracting data from excel file. (the 5th column in the excel sheet contain names of files that exist in my filesystem) I loop over the table’s rows (extract the cell’s content with POI) and for each row I create instance of Tika, and I parse the files that named in the 5th column with Tika “parseToString(file)”, when the file is Office document (excel, ppt, word) I get this error:Exception in thread “AWT-EventQueue-0” java.lang.NoSuchFieldError: filesystemat org.apache.poi.hwp
RNJ
java parsing namespaces confluence tika
I’m using TIKA 1.0 to remove the HTML content from some Confluence 4.3 pages, however, it fail when trying to parse pages like:<ac:macro ac:name=”column”><ac:rich-text-body><p><ac:image ac:height=”64″ ac:width=”70″><ri:attachment ri:filename=”plugins_icon.png”><ri:page ri:content-title=”_Images” /></ri:attachment></ac:image></p></ac:rich-text-body> </ac:macro>It throws the following error: Exception in thread “main” org.apache.tik
Web site is in building