public class ExtractorPattern extends java.lang.Object implements ScriptInstancesContainer
Modifier and Type | Field and Description |
---|---|
static int |
INT_APPEND |
static int |
INT_OVERWRITE |
static org.apache.log4j.Logger |
log
Used for logging.
|
static java.lang.String |
STR_APPEND
Constants used with the "ifSavedInSessionVariable" field.
|
static java.lang.String |
STR_OVERWRITE |
Constructor and Description |
---|
ExtractorPattern()
Generates a new empty pattern.
|
ExtractorPattern(ExtractorPattern extractorPattern)
Creates an
ExtractorPattern by copying an existing one. |
ExtractorPattern(int extractorPatternID,
int scrapeableFileID,
java.lang.String identifier,
java.lang.String patternText,
int sequence,
boolean automaticallySaveInSessionVariable,
boolean applyWhitespaceTidyingBeforeExtraction,
int ifSavedInSessionVariable,
boolean filterDuplicates,
boolean cacheDataSet,
java.lang.String fileToSaveTo,
java.lang.String fileToSaveToDelimiter,
boolean willBeInvokedManually)
Generates a new extractor pattern with all of its information.
|
ExtractorPattern(java.lang.String identifier,
java.lang.String patternText,
int sequence)
Generates a new pattern based on the given value.
|
Modifier and Type | Method and Description |
---|---|
void |
addExtractorPatternToken(ExtractorPatternToken extractorPatternToken)
Adds an extractor pattern token to the collection.
|
void |
addSubExtractorPattern(ExtractorPattern subExtractorPattern)
Adds a child extractor pattern.
|
void |
addSubExtractorPatternToBeDeleted(ExtractorPattern subExtractorPatternToBeDeleted)
Adds a child extractor pattern to be deleted.
|
DataSet |
applyPattern(java.lang.String sessionName,
java.lang.String content,
Notifiable notifiable,
ScrapingSessionState scrapingSessionState,
ExtractorPattern subExtractorPattern)
Applies the extractor pattern to a string.
|
void |
bumpSubExtractorPatternSequences(int bottomSequence,
int topSequence,
int bumpModifier)
Bumps sequences up or down between the bottom and top sequences.
|
boolean |
containsNullableTokens()
Indicates whether or not the pattern contains any tokens that will null session variables.
|
void |
delete()
Deletes the extractor pattern from the database.
|
void |
dumpToLog()
Dumps the extractor pattern to the log.
|
boolean |
getApplyWhitespaceTidyingBeforeExtraction()
Indicates whether we should run white space tidying before extraction (strip out tabs and new lines) or not
|
boolean |
getAutomaticallySaveInSessionVariable()
Indicates whether or not the data set generated by this extractor pattern should be automatically saved in a session variable.
|
boolean |
getCacheDataSet()
Indicates whether or not the data set is to be cached.
|
ExtractorPatternToken |
getDeepExtractorPatternToken(java.lang.String identifier)
Gets an extractor pattern token from the extractor pattern based on identifier.
|
java.util.Iterator<ExtractorPatternToken> |
getDeepExtractorPatternTokensIterator()
Allows iterating over all extractor pattern tokens, including those in sub-extractor patterns.
|
ExtractorPatternToken |
getExtractorPatternToken(int sequence)
Gets an extractor pattern from the session based on sequence.
|
ExtractorPatternToken |
getExtractorPatternToken(java.lang.String identifier)
Gets an extractor pattern token from the extractor pattern based on identifier.
|
java.util.HashSet |
getExtractorPatternTokens()
Gets all extractor pattern tokens held by this pattern.
|
java.util.HashSet |
getExtractorPatternTokensCopy()
Gets a copy of all extractor pattern tokens held by this pattern.
|
java.util.Iterator |
getExtractorPatternTokensIterator()
Gets an
Iterator used to cycle through the ExtractorPatternToken objects held by this object. |
java.lang.String |
getFileToSaveTo()
Indicates the file to save the data to.
|
java.lang.String |
getFileToSaveToDelimiter()
Indicates the delimiter to use for the file to save the data to.
|
boolean |
getFilterDuplicates()
Indicates whether or not duplicates should be filtered.
|
int |
getID()
Gets the ID of the extractor pattern.
|
java.lang.String |
getIdentifier()
Gets the identifier for the pattern.
|
int |
getIfSavedInSessionVariable()
If automatically saved in a session variable indicates whether the data set should be appended or overwritte.
|
static int |
getIntSaveInSessionVariableForString(java.lang.String saveInSessionVariable)
Gets the
String constant for the "saveInSessionVariable" field, given the int version. |
java.lang.String |
getName()
Gets the name for the pattern.
|
int |
getNumExtractorPatternTokens()
Gets the total number of script instances held by this object.
|
int |
getNumSubExtractorPatterns()
Gets the number of sub-extractor patterns.
|
int |
getNumSubExtractorPatternsToBeDeleted()
Gets the number of sub-extractor patterns to be deleted.
|
ExtractorPattern |
getParentExtractorPattern()
Gets parent extractor pattern, if this one is a child.
|
java.lang.String |
getPatternText()
Gets the text of the pattern.
|
int |
getScrapeableFileID()
Gets the ID of the HTTP parameter.
|
ScriptInstances |
getScriptInstances()
Returns a
ScriptInstances object containing all script instances associated with this scrapeable file. |
int |
getSequence()
Gets the sequence with which the parameter should be sent.
|
static java.lang.String |
getStringSaveInSessionVariableForInt(int saveInSessionVariable)
Gets the int constant for the "saveInSessionVariable" field, given the
String version. |
ExtractorPattern |
getSubExtractorPattern(int sequence)
Gets a sub-extractor pattern corresponding to the sequence.
|
java.util.HashSet |
getSubExtractorPatterns()
Gets a
HashSet holding the sub-extractor patterns. |
java.util.Iterator |
getSubExtractorPatternsIterator()
Gets an iterator of
ExtractorPattern objects so that an external class can iterate through them. |
java.util.Iterator |
getSubExtractorPatternsToBeDeletedIterator()
Gets an iterator of
ExtractorPattern objects to be deleted so that an external class can iterate through them. |
java.util.HashSet |
getTokenIdentifiersToBeUsedInFilteringDuplicates()
Gets a
HashSet of String objects containing token identifiers that should be used in filtering duplicates. |
AbstractView |
getView()
Gets the view representing this extractor pattern.
|
boolean |
getWillBeInvokedManually()
Indicates whether or not the pattern will be invoked manually from a script.
|
void |
logTokensWithNoRegEx(Notifiable notifiable)
Logs extractor pattern tokens with no regex.
|
void |
removeAllExtractorPatternTokens()
Removes all extractor pattern tokens from the pattern.
|
void |
removeExtractorPatternToken(int sequence)
Removes a script instance.
|
void |
removeSubExtractorPattern(ExtractorPattern subExtractorPattern)
Revoves a child extractor pattern.
|
void |
resetContainsNullableTokens()
Resets the containsNullableTokens value.
|
void |
save()
Saves the extractor pattern to the database.
|
void |
sequenceSubExtractorPatterns()
This method adds a sequence number to the sub-extractor patterns.
|
void |
setApplyWhitespaceTidyingBeforeExtraction(boolean applyWhitespaceTidyingBeforeExtraction)
Sets whether we should run white space tidying before extraction (strip out tabs and new lines) or not
|
void |
setAutomaticallySaveInSessionVariable(boolean automaticallySaveInSessionVariable)
Determines whether or not the data set generated by this extractor pattern should be automatically saved in a session variable.
|
void |
setCacheDataSet(boolean cacheDataSet)
Determines whether or not the data set is to be cached.
|
void |
setFileToSaveTo(java.lang.String fileToSaveTo)
Determines the file to save the data to.
|
void |
setFileToSaveToDelimiter(java.lang.String fileToSaveToDelimiter)
Determines the delimiter to use for the file to save the data to.
|
void |
setFilterDuplicates(boolean filterDuplicates)
Determines whether or not duplicates should be filtered.
|
void |
setIdentifier(java.lang.String identifier)
Sets the identifier.
|
void |
setIfSavedInSessionVariable(int ifSavedInSessionVariable)
If automatically saved in a session variable determines whether the data set should be appended or overwritte.
|
void |
setPatternText(java.lang.String patternText)
Sets the value.
|
protected void |
setScrapeableFileID(int scrapeableFileID)
Sets the scrapeable file ID of the scrapeable file associated with this extractor pattern.
|
void |
setScriptInstances(ScriptInstances scriptInstances)
Sets the script instances to be associated with this scrapeable file.
|
void |
setSequence(int sequence)
Sets the sequence with which the parameter should be sent.
|
void |
setView(AbstractView view)
Sets the view representing this extractor pattern.
|
void |
setWillBeInvokedManually(boolean willBeInvokedManually)
Determines whether or not the pattern will be invoked manually from a script.
|
void |
stripNonSubExtractorPatternStuff()
Strips off any objects that wouldn't apply to a sub-extractor pattern.
|
public static org.apache.log4j.Logger log
public static final java.lang.String STR_APPEND
public static final java.lang.String STR_OVERWRITE
public static final int INT_APPEND
public static final int INT_OVERWRITE
public ExtractorPattern()
public ExtractorPattern(java.lang.String identifier, java.lang.String patternText, int sequence)
identifier
- Identifies the pattern.patternText
- The text of the pattern.sequence
- The sequence of the pattern.public ExtractorPattern(int extractorPatternID, int scrapeableFileID, java.lang.String identifier, java.lang.String patternText, int sequence, boolean automaticallySaveInSessionVariable, boolean applyWhitespaceTidyingBeforeExtraction, int ifSavedInSessionVariable, boolean filterDuplicates, boolean cacheDataSet, java.lang.String fileToSaveTo, java.lang.String fileToSaveToDelimiter, boolean willBeInvokedManually)
fileToSaveTo
- The file to save the data extracted to.fileToSaveToDelimiter
- The delimiter t use when saving to a file.willBeInvokedManually
- Indicates whether or not the extractor pattern will be invoked manually.extractorPatternID
- The ID of the extractor pattern.scrapeableFileID
- The ID of the scrapeable file associated with this extractor pattern.identifier
- The identifier.patternText
- The text of the pattern.sequence
- The sequence of the pattern.automaticallySaveInSessionVariable
- Save in session variable?applyWhitespaceTidyingBeforeExtraction
- True to apply whitespace tidyingifSavedInSessionVariable
- How to save in a session variable.filterDuplicates
- Filter duplicates.cacheDataSet
- Cache the data set?public ExtractorPattern(@NotNull ExtractorPattern extractorPattern)
ExtractorPattern
by copying an existing one.extractorPattern
- The ExtractorPattern
to copy.public int getID()
getID
in interface ScriptInstancesContainer
protected void setScrapeableFileID(int scrapeableFileID)
public int getScrapeableFileID()
@Nullable public ExtractorPattern getParentExtractorPattern()
@Nullable public java.lang.String getIdentifier()
public void setIdentifier(@Nullable java.lang.String identifier)
identifier
- The identifier.@Nullable public java.lang.String getName()
getName
in interface ScriptInstancesContainer
@Nullable public java.lang.String getPatternText()
public void setPatternText(java.lang.String patternText)
patternText
- The text to be set.public int getSequence()
public void setSequence(int sequence)
sequence
- The sequence.@Nullable public AbstractView getView()
public void setView(AbstractView view)
view
- The view.@Nullable public ScriptInstances getScriptInstances()
ScriptInstances
object containing all script instances associated with this scrapeable file.getScriptInstances
in interface ScriptInstancesContainer
public void setScriptInstances(@Nullable ScriptInstances scriptInstances)
setScriptInstances
in interface ScriptInstancesContainer
scriptInstances
- The script instances.public void save()
public void stripNonSubExtractorPatternStuff()
public void delete()
@Nullable public java.util.HashSet getSubExtractorPatterns()
HashSet
holding the sub-extractor patterns. This is only public because of Castor.HashSet
.@NotNull public java.util.Iterator getSubExtractorPatternsIterator()
ExtractorPattern
objects so that an external class can iterate through them.@NotNull public java.util.Iterator getSubExtractorPatternsToBeDeletedIterator()
ExtractorPattern
objects to be deleted so that an external class can iterate through them.public void addSubExtractorPatternToBeDeleted(@Nullable ExtractorPattern subExtractorPatternToBeDeleted)
subExtractorPatternToBeDeleted
- An ExtractorPattern
.public int getNumSubExtractorPatternsToBeDeleted()
public void addExtractorPatternToken(@Nullable ExtractorPatternToken extractorPatternToken)
extractorPatternToken
- An ExtractorPatternToken
.public void removeExtractorPatternToken(int sequence)
sequence
- The sequence.@Nullable public ExtractorPatternToken getExtractorPatternToken(int sequence)
sequence
- The sequence of the token to be retrieved.@Nullable public ExtractorPatternToken getExtractorPatternToken(java.lang.String identifier)
identifier
- The identifier of the token to be retrieved.@NotNull public java.util.Iterator<ExtractorPatternToken> getDeepExtractorPatternTokensIterator()
Iterater<ExtractorPatternToken>
.@Nullable public ExtractorPatternToken getDeepExtractorPatternToken(java.lang.String identifier)
identifier
- The identifier of the token to be retrieved.@NotNull public java.util.Iterator getExtractorPatternTokensIterator()
Iterator
used to cycle through the ExtractorPatternToken
objects held by this object.Iterator
.@NotNull public java.util.HashSet getExtractorPatternTokens()
HashSet
of ExtractorPatternToken
s.@NotNull public java.util.HashSet getExtractorPatternTokensCopy()
HashSet
of ExtractorPatternToken
s.public void removeAllExtractorPatternTokens()
public int getNumExtractorPatternTokens()
public boolean getAutomaticallySaveInSessionVariable()
public boolean getApplyWhitespaceTidyingBeforeExtraction()
public void setApplyWhitespaceTidyingBeforeExtraction(boolean applyWhitespaceTidyingBeforeExtraction)
applyWhitespaceTidyingBeforeExtraction
- Whether we should run white space tidying before extraction (strip out tabs and new lines) or notpublic void setAutomaticallySaveInSessionVariable(boolean automaticallySaveInSessionVariable)
automaticallySaveInSessionVariable
- A boolean.public int getIfSavedInSessionVariable()
public void setIfSavedInSessionVariable(int ifSavedInSessionVariable)
ifSavedInSessionVariable
- An int corresponding to one of the related constants in this class.public static int getIntSaveInSessionVariableForString(@NotNull java.lang.String saveInSessionVariable)
String
constant for the "saveInSessionVariable" field, given the int version.saveInSessionVariable
- An int corresponding to one of the related constants in this class.String
corresponding to one of the constants in this class.@NotNull public static java.lang.String getStringSaveInSessionVariableForInt(int saveInSessionVariable)
String
version.saveInSessionVariable
- A String
corresponding to one of the related constants in this class.public int getNumSubExtractorPatterns()
public void addSubExtractorPattern(@Nullable ExtractorPattern subExtractorPattern)
subExtractorPattern
- An ExtractorPattern
.public void removeSubExtractorPattern(@Nullable ExtractorPattern subExtractorPattern)
subExtractorPattern
- An ExtractorPattern
.public void sequenceSubExtractorPatterns()
@Nullable public ExtractorPattern getSubExtractorPattern(int sequence)
sequence
- The sequence of the pattern to be retrieved.public void bumpSubExtractorPatternSequences(int bottomSequence, int topSequence, int bumpModifier)
bottomSequence
- The bottom sequence.topSequence
- The top sequence.bumpModifier
- Modifies each sequence.public boolean getFilterDuplicates()
public void setFilterDuplicates(boolean filterDuplicates)
filterDuplicates
- A boolean.public boolean getCacheDataSet()
public void setCacheDataSet(boolean cacheDataSet)
cacheDataSet
- A boolean.@Nullable public java.lang.String getFileToSaveTo()
String
representing the file to save to.public void setFileToSaveTo(java.lang.String fileToSaveTo)
fileToSaveTo
- A String
representing the file to save to.@Nullable public java.lang.String getFileToSaveToDelimiter()
public void setFileToSaveToDelimiter(java.lang.String fileToSaveToDelimiter)
fileToSaveToDelimiter
- The delimiterpublic boolean getWillBeInvokedManually()
public void setWillBeInvokedManually(boolean willBeInvokedManually)
willBeInvokedManually
- A boolean.public void resetContainsNullableTokens()
public boolean containsNullableTokens()
@Nullable public java.util.HashSet getTokenIdentifiersToBeUsedInFilteringDuplicates()
HashSet
of String
objects containing token identifiers that should be used in filtering duplicates.public void logTokensWithNoRegEx(@NotNull Notifiable notifiable)
notifiable
- The Notifiable
object that should be used to log.public DataSet applyPattern(java.lang.String sessionName, java.lang.String content, Notifiable notifiable, ScrapingSessionState scrapingSessionState, ExtractorPattern subExtractorPattern) throws java.lang.Exception
sessionName
- The name of the scraping session (used to help with debugging)content
- The content to which the extractor pattern is to be applied.notifiable
- A Notifiable
to which messages will be logged.scrapingSessionState
- A ScrapingSessionState
used to resolve variables.subExtractorPattern
- This will be passed in if the extractor pattern should only return a data set containing data matched by this particular sub-extractor
pattern.DataSet
containing the extracted data.java.lang.Exception
- if the pattern either takes too long or another problem occurs in the process.public void dumpToLog()