public class Html2Txt extends java.lang.Object
One important restriction is that the HTML document must be "well-formed", i.e. all opening tags must be exactly matched by closing tags, i.e.:
Let's <i>emphasize</i>. <ul> <li>List items</li> <li>must be terminated with "<tt></li></tt>". </ul> <br /> <hr />
Modifier and Type | Class and Description |
---|---|
static interface |
Html2Txt.BlockElementFormatter
Formats an HTML block element.
|
static interface |
Html2Txt.HtmlErrorHandler
Handles
Html2Txt.HtmlException s. |
static class |
Html2Txt.HtmlException
Representation of an exceptional condition that occurred during HTML processing.
|
static class |
Html2Txt.IndentingBlockElementFormatter |
static interface |
Html2Txt.InlineElementFormatter
Formats an HTML inline element.
|
Modifier and Type | Field and Description |
---|---|
protected static java.util.Map<java.lang.String,Html2Txt.BlockElementFormatter> |
ALL_BLOCK_ELEMENTS
Defines the strategies for formatting HTML block elements.
|
protected static java.util.Map<java.lang.String,Html2Txt.InlineElementFormatter> |
ALL_INLINE_ELEMENTS
Defines the strategies for formatting HTML inline elements.
|
protected static Html2Txt.BlockElementFormatter |
OL_FORMATTER
Formatter for the "
<ol> " ("ordered list") HTML block element. |
static Html2Txt.HtmlErrorHandler |
SIMPLE_HTML_ERROR_HANDLER
All methods of theis
Html2Txt.HtmlErrorHandler throw the Html2Txt.HtmlException they recieve. |
static org.xml.sax.ErrorHandler |
SIMPLE_SAX_ERROR_HANDLER
All methods of this
ErrorHandler throw the SAXException they recieve. |
protected static Html2Txt.BlockElementFormatter |
TABLE_FORMATTER
Formatter for the "
<table> " HTML block element. |
Constructor and Description |
---|
Html2Txt() |
Modifier and Type | Method and Description |
---|---|
void |
html2txt(org.w3c.dom.Document document,
java.io.Writer output)
Generates a plain text document from the given HTML document, and writes it to the
output . |
void |
html2txt(java.io.File inputFile,
java.io.File outputFile)
Reads, scans and parses the HTML document in the
inputFile , generates a plain text document, and
writes it to the outputFile . |
void |
html2txt(java.io.File inputFile,
java.io.Writer output)
Reads, scans and parses the HTML document in the
inputFile , generates a plain text document, and
writes it to the output . |
void |
html2txt(java.io.Reader input,
java.io.Writer output)
Reads, scans and parses the HTML document in the
inputFile , generates a plain text document, and
writes it to the output . |
static int |
maxLength(java.lang.Iterable<? extends java.lang.CharSequence> css) |
static de.unkrig.commons.lang.protocol.Producer<? extends java.lang.String> |
rightPad(de.unkrig.commons.lang.protocol.Producer<? extends java.lang.CharSequence> delegate,
int width,
char c)
Wraps the given delegate such that it right-pads the products with c to the given
width.
|
static de.unkrig.commons.lang.protocol.Consumer<java.lang.CharSequence> |
rightTrim(de.unkrig.commons.lang.protocol.Consumer<? super java.lang.String> delegate)
Creates and returns a
Consumer that forwards its subjects to the delegate, with trailing
spaces (' ' ) removed. |
Html2Txt |
setErrorHandler(Html2Txt.HtmlErrorHandler htmlErrorHandler)
Sets a custom
Html2Txt.HtmlErrorHandler on this object. |
void |
setInputCharset(java.nio.charset.Charset cs)
Sets the charset to use when reading HTML input files.
|
void |
setOutputCharset(java.nio.charset.Charset cs)
Sets the charset to use when writing text input files.
|
Html2Txt |
setPageLeftMarginWidth(int pageLeftMarginWidth)
The number of spaces that preceeds each line of output; defaults to zero.
|
Html2Txt |
setPageRightMarginWidth(int pageRightMarginWidth)
The maximum length of output lines is "pageWidth - rightMarginWidth".
|
Html2Txt |
setPageWidth(int pageWidth)
The maximum length of output lines is "pageWidth - rightMarginWidth".
|
public static final org.xml.sax.ErrorHandler SIMPLE_SAX_ERROR_HANDLER
ErrorHandler
throw the SAXException
they recieve.public static final Html2Txt.HtmlErrorHandler SIMPLE_HTML_ERROR_HANDLER
Html2Txt.HtmlErrorHandler
throw the Html2Txt.HtmlException
they recieve.protected static final Html2Txt.BlockElementFormatter OL_FORMATTER
<ol>
" ("ordered list") HTML block element.protected static final Html2Txt.BlockElementFormatter TABLE_FORMATTER
<table>
" HTML block element.protected static final java.util.Map<java.lang.String,Html2Txt.BlockElementFormatter> ALL_BLOCK_ELEMENTS
To see the HTML block elements and how they are formatted, click the word "ALL_BLOCK_ELEMENTS
"
(right above). The right hand side of the mapping means:
NYI_BLOCK_ELEMENT_FORMATTER
IGNORE_BLOCK_ELEMENT_FORMATTER
new
IndentingBlockElementFormatter(N)
OL_FORMATTER
).
protected static final java.util.Map<java.lang.String,Html2Txt.InlineElementFormatter> ALL_INLINE_ELEMENTS
To see the HTML inline elements and how they are formatted, click the word "ALL_INLINE_ELEMENTS
"
(right above). The right hand side of the mapping means:
NYI_INLINE_ELEMENT_FORMATTER
IGNORE_INLINE_ELEMENT_FORMATTER
new
SimpleInlineElementFormatter("foo", "bar")
foo
", the element content, and "bar
".
A_FORMATTER
).
public Html2Txt()
public Html2Txt setErrorHandler(Html2Txt.HtmlErrorHandler htmlErrorHandler)
Html2Txt.HtmlErrorHandler
on this object. The default handler is SIMPLE_HTML_ERROR_HANDLER
.public Html2Txt setPageLeftMarginWidth(int pageLeftMarginWidth)
public Html2Txt setPageRightMarginWidth(int pageRightMarginWidth)
Defaults to "1
", to avoid extra line wraps on certain terminals.
setPageWidth(int)
public void setInputCharset(java.nio.charset.Charset cs)
JVM
default charset
.public void setOutputCharset(java.nio.charset.Charset cs)
JVM
default charset
.public Html2Txt setPageWidth(int pageWidth)
Defaults to the value of the environment variable "$COLUMNS
", or, if that is not set, to 80.
setPageRightMarginWidth(int)
public void html2txt(java.io.File inputFile, java.io.Writer output) throws javax.xml.parsers.ParserConfigurationException, org.xml.sax.SAXException, javax.xml.transform.TransformerException, Html2Txt.HtmlException
inputFile
, generates a plain text document, and
writes it to the output
.javax.xml.parsers.ParserConfigurationException
org.xml.sax.SAXException
javax.xml.transform.TransformerException
Html2Txt.HtmlException
public void html2txt(java.io.Reader input, java.io.Writer output) throws javax.xml.parsers.ParserConfigurationException, org.xml.sax.SAXException, javax.xml.transform.TransformerException, Html2Txt.HtmlException
inputFile
, generates a plain text document, and
writes it to the output
.javax.xml.parsers.ParserConfigurationException
org.xml.sax.SAXException
javax.xml.transform.TransformerException
Html2Txt.HtmlException
public void html2txt(org.w3c.dom.Document document, java.io.Writer output) throws Html2Txt.HtmlException
output
.Html2Txt.HtmlException
public void html2txt(java.io.File inputFile, java.io.File outputFile) throws java.lang.Exception
inputFile
, generates a plain text document, and
writes it to the outputFile
.java.lang.Exception
public static int maxLength(java.lang.Iterable<? extends java.lang.CharSequence> css)
0
iff css is emptypublic static de.unkrig.commons.lang.protocol.Producer<? extends java.lang.String> rightPad(de.unkrig.commons.lang.protocol.Producer<? extends java.lang.CharSequence> delegate, int width, char c)
public static de.unkrig.commons.lang.protocol.Consumer<java.lang.CharSequence> rightTrim(de.unkrig.commons.lang.protocol.Consumer<? super java.lang.String> delegate)
Consumer
that forwards its subjects to the delegate, with trailing
spaces (' '
) removed.