com.java4less.pdf
Class PDFToTextConverter

java.lang.Object
  extended by org.apache.pdfbox.util.PDFStreamEngine
      extended by org.apache.pdfbox.util.PDFTextStripper
          extended by com.java4less.pdf.PDFToTextConverter

public class PDFToTextConverter
extends org.apache.pdfbox.util.PDFTextStripper

this class converts a PDF file to text. This works only if the PDF file contains texts and not an just an image. For example some scanners or faxes can create PDF files but these files contains an image of the scanned page. These cannot be converter to text.


Constructor Summary
PDFToTextConverter()
           
 
Method Summary
 java.lang.String convertToString(java.io.InputStream is)
          convert PDF input stream to text
 int getPageColumns()
          number of characters per line in text output
 boolean isAddEmptyLines()
          if false, empty lines will be removed
 boolean isPreserveSpaces()
          if false, spaces will be removed.
 void setAddEmptyLines(boolean addEmptyLines)
          if false, empty lines will be removed
 void setPageColumns(int pageColumns)
          number of characters per line in text output
 void setPreserveSpaces(boolean preserveSpaces)
          if false, spaces will be removed.
 
Methods inherited from class org.apache.pdfbox.util.PDFTextStripper
getAverageCharTolerance, getEndBookmark, getEndPage, getLineSeparator, getPageSeparator, getSpacingTolerance, getStartBookmark, getStartPage, getText, getText, getWordSeparator, inspectFontEncoding, resetEngine, setAverageCharTolerance, setEndBookmark, setEndPage, setLineSeparator, setPageSeparator, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, shouldSeparateByBeads, shouldSortByPosition, shouldSuppressDuplicateOverlappingText, writeText, writeText
 
Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, processEncodedText, processOperator, processStream, processSubStream, registerOperatorProcessor, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PDFToTextConverter

public PDFToTextConverter()
                   throws java.io.IOException
Throws:
java.io.IOException
Method Detail

convertToString

public java.lang.String convertToString(java.io.InputStream is)
                                 throws java.io.IOException
convert PDF input stream to text

Throws:
java.io.IOException

getPageColumns

public int getPageColumns()
number of characters per line in text output

Returns:

setPageColumns

public void setPageColumns(int pageColumns)
number of characters per line in text output

Parameters:
pageColumns -

isPreserveSpaces

public boolean isPreserveSpaces()
if false, spaces will be removed.

Returns:

setPreserveSpaces

public void setPreserveSpaces(boolean preserveSpaces)
if false, spaces will be removed.

Parameters:
preserveSpaces -

isAddEmptyLines

public boolean isAddEmptyLines()
if false, empty lines will be removed

Returns:

setAddEmptyLines

public void setAddEmptyLines(boolean addEmptyLines)
if false, empty lines will be removed

Parameters:
addEmptyLines -