API Reference¶
py_pdf_term package¶
- class py_pdf_term.DomainPDFList(domain: str, pdf_paths: list[str])¶
Bases:
object
Domain name and PDF file paths of the domain
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
pdf_paths (list[str]) – PDF file paths of the domain.
- classmethod validate(domain_pdfs: DomainPDFList) None ¶
- domain: str¶
- pdf_paths: list[str]¶
- class py_pdf_term.PDFTechnicalTermList(pdf_path: str, pages: list[PageTechnicalTermList])¶
Bases:
object
Path of a PDF file and technical terms of the PDF file.
- Parameters:
pdf_path (str) – Path of a PDF file.
pages (list[PageTechnicalTermList]) – Technical terms of each page of the PDF file.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- pdf_path: str¶
- pages: list[PageTechnicalTermList]¶
- class py_pdf_term.PyPDFTermMultiDomainExtractor(xml_config: XMLLayerConfig | None = None, candidate_config: CandidateLayerConfig | None = None, method_config: MultiDomainMethodLayerConfig | None = None, styling_config: StylingLayerConfig | None = None, techterm_config: TechnicalTermLayerConfig | None = None, bin_opener_mapper: BinaryOpenerMapper | None = None, lang_tokenizer_mapper: LanguageTokenizerMapper | None = None, token_classifier_mapper: TokenClassifierMapper | None = None, token_filter_mapper: CandidateTokenFilterMapper | None = None, term_filter_mapper: CandidateTermFilterMapper | None = None, splitter_mapper: SplitterMapper | None = None, augmenter_mapper: AugmenterMapper | None = None, method_mapper: MultiDomainRankingMethodMapper | None = None, styling_score_mapper: StylingScoreMapper | None = None, xml_cache_mapper: XMLLayerCacheMapper | None = None, candidate_cache_mapper: CandidateLayerCacheMapper | None = None, method_ranking_cache_mapper: MethodLayerRankingCacheMapper | None = None, method_data_cache_mapper: MethodLayerDataCacheMapper | None = None, styling_cache_mapper: StylingLayerCacheMapper | None = None, cache_dir: str = DEFAULT_CACHE_DIR)¶
Bases:
object
Top level class of py-pdf-term. This class extracts technical terms from a PDF file with cross-domain information.
- Parameters:
xml_config – Config of XML Layer.
candidate_config – Config of Candidate Term Layer.
method_config – Config of Method Layer.
styling_config – Config of Styling Layer.
techterm_config – Config of Technial Term Layer.
bin_opener_mapper – Mapper from xml_config.open_bin to a function to open a input PDF file in the binary mode. This is used in XML Layer.
lang_tokenizer_mapper – Mapper from an element in candidate_config.lang_tokenizers to a class to tokenize texts in a specific language with spaCy. This is used in Candidate Term Layer.
token_classifier_mapper – Mapper from an element in candidate_config.token_classifiers to a class to classify tokens into True/False by several functions. This is used in Candidate Term Layer.
token_filter_mapper – Mapper from an element in candidate_config.token_filters to a class to filter tokens which are likely to be parts of candidates. This is used in Candidate Term Layer.
term_filter_mapper – Mapper from an element in candidate_config.term_filters to a class to filter terms which are likely to be candidates. This is used in Candidate Term Layer.
splitter_mapper – Mapper from an element in candidate_config.splitters to a class to split too long terms or wrongly concatenated terms. This is used in Candidate Term Layer.
augmenter_mapper – Mapper from an element in candidate_config.augmenters to a class to augment candidates. The augumentation means that if a long candidate is found, sub-terms of it could also be candidates. This is used in Candidate Term Layer.
method_mapper – Mapper from method_config.method to a class to calculate method scores of candidate terms. This is used in Method Layer.
styling_score_mapper – Mapper from an element in styling_config.styling_scores to a class to calculate scores of candidate terms based on their styling such as color, fontsize and so on. This is used in Styling Layer.
xml_cache_mapper – Mapper from xml_config.cache to a class to provide XML Layer with the cache mechanism. The xml cache manages XML files converted from input PDF files.
candidate_cache_mapper – Mapper from candidate_config.cache to a class to provide Candidate Term Layer with the cache mechanism. The candidate cache manages lists of candidate terms.
method_ranking_cache_mapper – Mapper from method_config.ranking_cache to a class to provide Method Layer with the cache mechanism. The method ranking cache manages candidate terms ordered by the method scores.
method_data_cache_mapper – Mapper from method_config.data_cache to a class to provide Method Layer with the cache mechanism. The method data cache manages analysis data of the candidate terms such as frequency or likelihood.
styling_cache_mapper – Mapper from styling_config.cache to a class to provide Styling Layer with the cache mechanism. The styling cache manages candidate terms ordered by the styling scores.
cache_dir – Path like string where cache files to be stored. For example, path to a local directory, a url or a bucket name of a cloud storage service.
- extract(domain: str, pdf_path: str, multi_domain_pdfs: list[DomainPDFList]) PDFTechnicalTermList ¶
Extract technical terms from a PDF file.
- Parameters:
domain – Domain name which the input PDF file belongs to. This may be the name of a course, the name of a technical field or something.
pdf_path – Path like string to the input PDF file which terminologies to be extracted. The file MUST belong to domain.
multi_domain_pdfs – List of path like strings to the PDF files which classified by domain. There MUST be an element in multi_domain_pdfs whose domain equals to domain.
- Returns:
Terminology list per page from the input PDF file.
- Return type:
- class py_pdf_term.PyPDFTermSingleDomainExtractor(xml_config: XMLLayerConfig | None = None, candidate_config: CandidateLayerConfig | None = None, method_config: SingleDomainMethodLayerConfig | None = None, styling_config: StylingLayerConfig | None = None, techterm_config: TechnicalTermLayerConfig | None = None, bin_opener_mapper: BinaryOpenerMapper | None = None, lang_tokenizer_mapper: LanguageTokenizerMapper | None = None, token_classifier_mapper: TokenClassifierMapper | None = None, token_filter_mapper: CandidateTokenFilterMapper | None = None, term_filter_mapper: CandidateTermFilterMapper | None = None, splitter_mapper: SplitterMapper | None = None, augmenter_mapper: AugmenterMapper | None = None, method_mapper: SingleDomainRankingMethodMapper | None = None, styling_score_mapper: StylingScoreMapper | None = None, xml_cache_mapper: XMLLayerCacheMapper | None = None, candidate_cache_mapper: CandidateLayerCacheMapper | None = None, method_ranking_cache_mapper: MethodLayerRankingCacheMapper | None = None, method_data_cache_mapper: MethodLayerDataCacheMapper | None = None, styling_cache_mapper: StylingLayerCacheMapper | None = None, cache_dir: str = DEFAULT_CACHE_DIR)¶
Bases:
object
Top level class of py-pdf-term. This class extracts technical terms from a PDF file withoout cross-domain information.
- Parameters:
xml_config – Config of XML Layer.
candidate_config – Config of Candidate Term Layer.
method_config – Config of Method Layer.
styling_config – Config of Styling Layer.
techterm_config – Config of Technial Term Layer.
bin_opener_mapper – Mapper from xml_config.open_bin to a function to open a input PDF file in the binary mode. This is used in XML Layer.
lang_tokenizer_mapper – Mapper from an element in candidate_config.lang_tokenizers to a class to tokenize texts in a specific language with spaCy. This is used in Candidate Term Layer.
token_classifier_mapper – Mapper from an element in candidate_config.token_classifiers to a class to classify tokens into True/False by several functions. This is used in Candidate Term Layer.
token_filter_mapper – Mapper from an element in candidate_config.token_filters to a class to filter tokens which are likely to be parts of candidates. This is used in Candidate Term Layer.
term_filter_mapper – Mapper from an element in candidate_config.term_filters to a class to filter terms which are likely to be candidates. This is used in Candidate Term Layer.
splitter_mapper – Mapper from an element in candidate_config.splitters to a class to split too long terms or wrongly concatenated terms. This is used in Candidate Term Layer.
augmenter_mapper – Mapper from an element in candidate_config.augmenters to a class to augment candidates. The augumentation means that if a long candidate is found, sub-terms of it could also be candidates. This is used in Candidate Term Layer.
method_mapper – Mapper from method_config.method to a class to calculate method scores of candidate terms. This is used in Method Layer.
styling_score_mapper – Mapper from an element in styling_config.styling_scores to a class to calculate scores of candidate terms based on their styling such as color, fontsize and so on. This is used in Styling Layer.
xml_cache_mapper – Mapper from xml_config.cache to a class to provide XML Layer with the cache mechanism. The xml cache manages XML files converted from input PDF files.
candidate_cache_mapper – Mapper from candidate_config.cache to a class to provide Candidate Term Layer with the cache mechanism. The candidate cache manages lists of candidate terms.
method_ranking_cache_mapper – Mapper from method_config.ranking_cache to a class to provide Method Layer with the cache mechanism. The method ranking cache manages candidate terms ordered by the method scores.
method_data_cache_mapper – Mapper from method_config.data_cache to a class to provide Method Layer with the cache mechanism. The method data cache manages analysis data of the candidate terms such as frequency or likelihood.
styling_cache_mapper – Mapper from styling_config.cache to a class to provide Styling Layer with the cache mechanism. The styling cache manages candidate terms ordered by the styling scores.
cache_dir – Path like string where cache files to be stored. For example, path to a local directory, a url or a bucket name of a cloud storage service.
- extract(pdf_path: str, domain_pdfs: DomainPDFList) PDFTechnicalTermList ¶
Extract technical terms from a PDF file.
- Parameters:
pdf_path – Path like string to the input PDF file which terminologies to be extracted. The file MUST belong to domain.
domain_pdfs – List of path like strings to the PDF files which belong to a specific domain.
- Returns:
Terminology list per page from the input PDF file.
- Return type:
py_pdf_term.configs subpackage¶
- class py_pdf_term.configs.CandidateLayerConfig(lang_tokenizers: list[str] = <factory>, token_classifiers: list[str] = <factory>, token_filters: list[str] = <factory>, term_filters: list[str] = <factory>, splitters: list[str] = <factory>, augmenters: list[str] = <factory>, cache: str = 'py_pdf_term.CandidateLayerFileCache')¶
Bases:
BaseLayerConfig
Configuration for candidate layer.
- Parameters:
lang_tokenizers (list[str]) – List of language tokenizer class names. The default tokenizers are “py_pdf_term.JapaneseTokenizer” and “py_pdf_term.EnglishTokenizer”.
token_classifiers (list[str]) – List of token classifier class names. The default classifiers are “py_pdf_term.JapaneseTokenClassifier” and “py_pdf_term.EnglishTokenClassifier”.
token_filters (list[str]) – List of token filter class names. The default filters are “py_pdf_term.JapaneseTokenFilter” and “py_pdf_term.EnglishTokenFilter”.
term_filters (list[str]) – List of term filter class names. The default filters are “py_pdf_term.JapaneseConcatenationFilter”, “py_pdf_term.EnglishConcatenationFilter”, “py_pdf_term.JapaneseSymbolLikeFilter”, “py_pdf_term.EnglishSymbolLikeFilter”, “py_pdf_term.JapaneseProperNounFilter”, “py_pdf_term.EnglishProperNounFilter”, “py_pdf_term.JapaneseNumericFilter”, and “py_pdf_term.EnglishNumericFilter”.
splitters (list[str]) – List of splitter class names. The default splitters are “py_pdf_term.SymbolNameSplitter” and “py_pdf_term.RepeatSplitter”.
augmenters (list[str]) – List of augmenter class names. The default augmenters are “py_pdf_term.JapaneseAugmenter” and “py_pdf_term.EnglishAugmenter”.
cache (str) – Cache class name. The default cache is “py_pdf_term.CandidateLayerFileCache”.
- cache: str = 'py_pdf_term.CandidateLayerFileCache'¶
- lang_tokenizers: list[str]¶
- token_classifiers: list[str]¶
- token_filters: list[str]¶
- term_filters: list[str]¶
- splitters: list[str]¶
- augmenters: list[str]¶
- class py_pdf_term.configs.MultiDomainMethodLayerConfig(method: str = 'py_pdf_term.TFIDFMethod', hyper_params: dict[str, ~typing.Any] = <factory>, ranking_cache: str = 'py_pdf_term.MethodLayerRankingFileCache', data_cache: str = 'py_pdf_term.MethodLayerDataFileCache')¶
Bases:
BaseMethodLayerConfig
Configuration for a multi-domain method layer.
- Parameters:
method (str) – Multi-domain method class name. The default method is “py_pdf_term.TFIDFMethod”.
hyper_params (dict[str, Any]) – Hyper parameters for the method. The default hyper parameters are empty.
ranking_cache (str) – Ranking cache class name. The default cache is “py_pdf_term.MethodLayerRankingFileCache”.
data_cache (str) – Data cache class name. The default cache is “py_pdf_term.MethodLayerDataFileCache”.
- method: str = 'py_pdf_term.TFIDFMethod'¶
- class py_pdf_term.configs.SingleDomainMethodLayerConfig(method: str = 'py_pdf_term.FLRHMethod', hyper_params: dict[str, ~typing.Any] = <factory>, ranking_cache: str = 'py_pdf_term.MethodLayerRankingFileCache', data_cache: str = 'py_pdf_term.MethodLayerDataFileCache')¶
Bases:
BaseMethodLayerConfig
Configuration for a single-domain method layer.
- Parameters:
method – Single-domain method class name. The default method is “py_pdf_term.FLRHMethod”.
hyper_params – Hyper parameters for the method. The default hyper parameters are empty.
ranking_cache – Ranking cache class name. The default cache is “py_pdf_term.MethodLayerRankingFileCache”.
data_cache – Data cache class name. The default cache is “py_pdf_term.MethodLayerDataFileCache”.
- method: str = 'py_pdf_term.FLRHMethod'¶
- class py_pdf_term.configs.StylingLayerConfig(styling_scores: list[str] = <factory>, cache: str = 'py_pdf_term.StylingLayerFileCache')¶
Bases:
BaseLayerConfig
Configuration for a styling layer.
- Parameters:
styling_scores (list[str]) – List of styling score class names. The default scores are “py_pdf_term.FontsizeScore” and “py_pdf_term.ColorScore”.
cache (str) – Cache class name. The default cache is “py_pdf_term.StylingLayerFileCache”.
- cache: str = 'py_pdf_term.StylingLayerFileCache'¶
- styling_scores: list[str]¶
- class py_pdf_term.configs.TechnicalTermLayerConfig(max_num_terms: int = 10, acceptance_rate: float = 0.75)¶
Bases:
BaseLayerConfig
Configuration for a technical term layer.
- Parameters:
max_num_terms (int) – Maximum number of terms in a page of a PDF file to be extracted. The N-best candidates are extracted as technical terms. The default value is 10.
acceptance_rate (float) – Acceptance rate of the ranking method scores. The candidates whose ranking method scores are lower than the acceptance rate are filtered out even if they are in the N-best candidates. The default value is 0.75.
- acceptance_rate: float = 0.75¶
- max_num_terms: int = 10¶
- class py_pdf_term.configs.XMLLayerConfig(bin_opener: str = 'py_pdf_term.StandardBinaryOpener', include_pattern: str | None = None, exclude_pattern: str | None = None, nfc_norm: bool = True, cache: str = 'py_pdf_term.XMLLayerFileCache')¶
Bases:
BaseLayerConfig
Configuration for an XML layer.
- Parameters:
bin_opener (str) – Binary opener class name. The default opener is “py_pdf_term.StandardBinaryOpener”.
include_pattern (str | None) – Regular expression pattern of text to include in the output.
exclude_pattern (str | None) – Regular expression pattern of text to exclude from the output (overrides include_pattern).
nfc_norm (bool) – If True, normalize text to NFC, otherwise keep original.
cache (str) – Cache class name. The default cache is “py_pdf_term.XMLLayerFileCache”.
- bin_opener: str = 'py_pdf_term.StandardBinaryOpener'¶
- cache: str = 'py_pdf_term.XMLLayerFileCache'¶
- exclude_pattern: str | None = None¶
- include_pattern: str | None = None¶
- nfc_norm: bool = True¶
py_pdf_term.mappers subpackage¶
- class py_pdf_term.mappers.AugmenterMapper¶
Bases:
BaseMapper
[type
[BaseAugmenter
]]Mapper to find augmenter classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.BinaryOpenerMapper¶
Bases:
BaseMapper
[type
[BaseBinaryOpener
]]Mapper to find binary opener classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.CandidateLayerCacheMapper¶
Bases:
BaseMapper
[type
[BaseCandidateLayerCache
]]Mapper to find candidate layer cache classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.CandidateTermFilterMapper¶
Bases:
BaseMapper
[type
[BaseCandidateTermFilter
]]Mapper to find candidate term filter classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.CandidateTokenFilterMapper¶
Bases:
BaseMapper
[type
[BaseCandidateTokenFilter
]]Mapper to find candidate token filter classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.LanguageTokenizerMapper¶
Bases:
BaseMapper
[type
[BaseLanguageTokenizer
]]Mapper to find language tokenizer classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.MethodLayerDataCacheMapper¶
Bases:
BaseMapper
[type
[BaseMethodLayerDataCache
[Any
]]]Mapper to find method layer data cache classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.MethodLayerRankingCacheMapper¶
Bases:
BaseMapper
[type
[BaseMethodLayerRankingCache
]]Mapper to find method layer ranking cache classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.MultiDomainRankingMethodMapper¶
Bases:
BaseMapper
[type
[BaseMultiDomainRankingMethod
[Any
]]]Mapper to find multi-domain ranking method classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.SingleDomainRankingMethodMapper¶
Bases:
BaseMapper
[type
[BaseSingleDomainRankingMethod
[Any
]]]Mapper to find single-domain ranking method classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.SplitterMapper¶
Bases:
BaseMapper
[type
[BaseSplitter
]]Mapper to find splitter classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.StylingLayerCacheMapper¶
Bases:
BaseMapper
[type
[BaseStylingLayerCache
]]Mapper to find styling layer cache classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.StylingScoreMapper¶
Bases:
BaseMapper
[type
[BaseStylingScore
]]Mapper to find styling score classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.TokenClassifierMapper¶
Bases:
BaseMapper
[type
[BaseTokenClassifier
]]Mapper to find token classifier classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
- class py_pdf_term.mappers.XMLLayerCacheMapper¶
Bases:
BaseMapper
[type
[BaseXMLLayerCache
]]Mapper to find XML layer cache classes.
- classmethod default_mapper() Self ¶
Return a default mapper for this class.
py_pdf_term.pdftoxml package¶
- class py_pdf_term.pdftoxml.PDFnXMLElement(pdf_path: str, xml_root: Element)¶
Bases:
object
Pair of path to a PDF file and XML element tree.
- Parameters:
pdf_path (str) – Path to a PDF file.
xml_root (xml.etree.ElementTree.Element) – Root element of a XML element tree.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- pdf_path: str¶
- xml_root: Element¶
- class py_pdf_term.pdftoxml.PDFnXMLPath(pdf_path: str, xml_path: str)¶
Bases:
object
Pair of path to a PDF file and that to a XML file.
- Parameters:
pdf_path (str) – Path to a PDF file.
xml_path (str) – Path to a XML file.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- pdf_path: str¶
- xml_path: str¶
- class py_pdf_term.pdftoxml.PDFtoXMLConverter(bin_opener: BaseBinaryOpener | None = None)¶
Bases:
object
Converter from PDF to textful XML format.
- Parameters:
bin_opener – Binary opener to open PDF and XML files. If None, StandardBinaryOpener is used, which opens files with the standard open function in Python.
- convert_as_element(pdf_path: str, nfc_norm: bool = True, include_pattern: str | None = None, exclude_pattern: str | None = None) PDFnXMLElement ¶
Convert a PDF file to a textful XML element.
- Parameters:
pdf_path – Path to a PDF file.
nfc_norm – If True, normalize text to NFC, otherwise keep original.
include_pattern – Regular expression pattern of text to include in the output.
exclude_pattern – Regular expression pattern of text to exclude from the output (overrides include_pattern).
- Return type:
Pair of path to the PDF file and XML element tree of the output.
- convert_as_file(pdf_path: str, xml_path: str, nfc_norm: bool = True, include_pattern: str | None = None, exclude_pattern: str | None = None) PDFnXMLPath ¶
Convert a PDF file to a textful XML file.
- Parameters:
pdf_path – Path to a PDF file.
xml_path – Path to a XML file to output.
nfc_norm – If True, normalize text to NFC, otherwise keep original.
include_pattern – Regular expression pattern of text to include in the output.
exclude_pattern – Regular expression pattern of text to exclude from the output (overrides include_pattern).
- Return type:
Pair of path to the PDF file and that to the output XML file.
py_pdf_term.tokenizers package¶
- class py_pdf_term.tokenizers.BaseLanguageTokenizer¶
Bases:
object
Base class for language tokenizers. A language tokenizer is expected to tokenize a text into a list of tokens by SpaCy.
- abstractmethod classmethod class_init() None ¶
Initialize the language tokenizer class. This method is expected to be called before using the language tokenizer.
- abstractmethod inscope(text: str) bool ¶
Test whether the text is in the scope of the language tokenizer.
- Parameters:
text – Text to test.
- Returns:
True if the text is in the scope of the language tokenizer, otherwise False.
- Return type:
bool
- class py_pdf_term.tokenizers.EnglishTokenizer¶
Bases:
BaseLanguageTokenizer
Tokenizer for English. This tokenizer uses SpaCy’s en_core_web_sm model.
- classmethod class_init() None ¶
Initialize the language tokenizer class. This method is expected to be called before using the language tokenizer.
- inscope(text: str) bool ¶
Test whether the text is in the scope of the language tokenizer.
- Parameters:
text – Text to test.
- Returns:
True if the text is in the scope of the language tokenizer, otherwise False.
- Return type:
bool
- class py_pdf_term.tokenizers.JapaneseTokenizer¶
Bases:
BaseLanguageTokenizer
Tokenizer for Japanese. This tokenizer uses SpaCy’s ja_core_news_sm model.
- classmethod class_init() None ¶
Initialize the language tokenizer class. This method is expected to be called before using the language tokenizer.
- inscope(text: str) bool ¶
Test whether the text is in the scope of the language tokenizer.
- Parameters:
text – Text to test.
- Returns:
True if the text is in the scope of the language tokenizer, otherwise False.
- Return type:
bool
- class py_pdf_term.tokenizers.Term(tokens: list[Token], fontsize: float = 0.0, ncolor: str = '', augmented: bool = False)¶
Bases:
object
- augmented: bool = False¶
- fontsize: float = 0.0¶
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- property lang: str | None¶
- lemma() str ¶
- ncolor: str = ''¶
- surface_form() str ¶
- to_dict() dict[str, Any] ¶
- class py_pdf_term.tokenizers.Token(lang: str, surface_form: str, pos: str, category: str, subcategory: str, lemma: str, is_meaningless: bool = False)¶
Bases:
object
Token in a text.
- Parameters:
lang (str) – Language of the token. (e.g., “en”, “ja”)
surface_form (str) – Surface form of the token.
pos (str) – Part-of-speech tag of the token.
category (str) – Category of the token.
subcategory (str) – Subcategory of the token.
lemma (str) – Lemmatized form of the token.
is_meaningless (bool) – Whether the token is meaningless or not. This is calculated by MeaninglessMarker.
- NUM_ATTR: ClassVar[int] = 6¶
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- is_meaningless: bool = False¶
- to_dict() dict[str, str] ¶
- lang: str¶
- surface_form: str¶
- pos: str¶
- category: str¶
- subcategory: str¶
- lemma: str¶
- class py_pdf_term.tokenizers.Tokenizer(lang_tokenizers: list[BaseLanguageTokenizer] | None = None)¶
Bases:
object
Tokenizer for multiple languages. This tokenizer uses SpaCy.
- Parameters:
lang_tokenizers – List of language tokenizers. The order of the language tokenizers is important. The first language tokenizer that returns True in inscope() is used. If None, this tokenizer uses the default language tokenizers. The default language tokenizers are JapaneseTokenizer and EnglishTokenizer.
py_pdf_term.candidates package¶
- class py_pdf_term.candidates.CandidateTermExtractor(lang_tokenizer_clses: list[type[BaseLanguageTokenizer]] | None = None, token_classifier_clses: list[type[BaseTokenClassifier]] | None = None, token_filter_clses: list[type[BaseCandidateTokenFilter]] | None = None, term_filter_clses: list[type[BaseCandidateTermFilter]] | None = None, splitter_clses: list[type[BaseSplitter]] | None = None, augmenter_clses: list[type[BaseAugmenter]] | None = None)¶
Bases:
object
Term extractor which extracts candidate terms from a XML file.
- Parameters:
lang_tokenizer_clses – List of language tokenizer classes to tokenize texts. If None, the default language tokenizers are used.
token_classifier_clses – List of token classifier classes to classify tokens. If None, the default token classifiers are used.
token_filter_clses – List of token filter classes to filter tokens. If None, the default token filters are used.
term_filter_clses – List of term filter classes to filter candidate terms. If None, the default term filters are used.
splitter_clses – List of splitter classes to split candidate terms. If None, the default splitters are used.
augmenter_clses – List of augmenter classes to augment candidate terms. If None, the default augmenters are used.
- extract_from_domain_elements(domain: str, pdfnxmls: list[PDFnXMLElement]) DomainCandidateTermList ¶
Extract candidate terms from pairs of PDF and XML elements in a domain.
- Parameters:
domain – Domain name of PDF files.
pdfnxmls – List of pairs of paths to PDF and XML elements in a domain.
- Returns:
List of candidate terms in a domain.
- Return type:
- extract_from_domain_files(domain: str, pdfnxmls: list[PDFnXMLPath]) DomainCandidateTermList ¶
Extract candidte terms from pairs of PDF and XML files in a domain.
- Parameters:
domain – Domain name of PDF files.
pdfnxmls – List of pairs of paths to PDF and XML files in a domain.
- Returns:
List of candidate terms in a domain.
- Return type:
- extract_from_text(text: str, fontsize: float = 0.0, ncolor: str = '') list[Term] ¶
Extract candidate terms from a text. This method is mainly used for testing.
- Parameters:
text – Text to extract candidate terms.
fontsize – Font size of output terms.
ncolor – Color of output terms.
- Returns:
List of candidate terms in a text.
- Return type:
list[Term]
- extract_from_xml_element(pdfnxml: PDFnXMLElement) PDFCandidateTermList ¶
Extract candidate terms from a pair of PDF and XML elements.
- Parameters:
pdfnxml – Pair of path to a PDF and XML elements.
- Returns:
List of candidate terms in a PDF file.
- Return type:
- extract_from_xml_file(pdfnxml: PDFnXMLPath) PDFCandidateTermList ¶
Extract candidate terms from a pair of PDF and XML files.
- Parameters:
pdfnxml – Pair of paths to a PDF and XML file.
- Returns:
List of candidate terms in a PDF file.
- Return type:
- class py_pdf_term.candidates.DomainCandidateTermList(domain: str, pdfs: list[PDFCandidateTermList])¶
Bases:
object
Domain name of PDF files and candidate terms of the domain.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
pdfs (list[PDFCandidateTermList]) – Candidate terms of each PDF file of the domain.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- domain: str¶
- pdfs: list[PDFCandidateTermList]¶
- class py_pdf_term.candidates.PDFCandidateTermList(pdf_path: str, pages: list[PageCandidateTermList])¶
Bases:
object
Path of a PDF file and candidate terms of the PDF file.
- Parameters:
pdf_path (str) – Path of a PDF file.
pages (list[PageCandidateTermList]) – Candidate terms of each page of the PDF file.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- pdf_path: str¶
- pages: list[PageCandidateTermList]¶
- class py_pdf_term.candidates.PageCandidateTermList(page_num: int, candidates: list[Term])¶
Bases:
object
Page number and candidate terms of the page.
- Parameters:
page_num (int) – Page number of a PDF file.
candidates (list[Term]) – Candidate terms of the page.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- page_num: int¶
py_pdf_term.candidates.augmenters subpackage¶
- class py_pdf_term.candidates.augmenters.AugmenterCombiner(augmenters: list[BaseAugmenter] | None = None)¶
Bases:
object
Combiner of augmenters of a candidate term.
- Parameters:
augmenters – List of augmenters to be combined. The augmenters are applied in order. If None, the default augmenters are used. The default augmenters are JapaneseConnectorTermAugmenter and EnglishConnectorTermAugmenter.
- class py_pdf_term.candidates.augmenters.BaseAugmenter¶
Bases:
object
Base class for augmenters of a candidate term.
When a long term is a candidate, subterms of the long term may be also candidates. For example, if “semantic analysis of programming language” is a candidate, “semantic analysis” and “programming language” may be also candidates.
This class is used to augment a candidate term to its subterms.
- class py_pdf_term.candidates.augmenters.EnglishConnectorTermAugmenter¶
Bases:
BaseSeparationAugmenter
An augmenter of a candidate term by separating tokens based on English connector terms.
- class py_pdf_term.candidates.augmenters.JapaneseConnectorTermAugmenter¶
Bases:
BaseSeparationAugmenter
An augmenter of a candidate term by separating tokens based on Japanese connector terms.
py_pdf_term.candidates.splitters subpackage¶
- class py_pdf_term.candidates.splitters.BaseSplitter(classifiers: list[BaseTokenClassifier] | None = None)¶
Bases:
object
Base class for splitters of a wrongly concatenated term.
Since text extraction from PDF is not perfect especially in a table or a figure, a term may be wrongly concatenated. For example, when a PDF file contains a table which shows the difference between quick sort, merge sort, and heap sort, the extracted text may be something like “quick sort merge sort heap sort”. In this case, “quick sort”, “merge sort”, and “heap sort” are wrongly concatenated.
This class is used to split a wrongly concatenated term into subterms.
- Parameters:
classifiers – List of token classifiers to classify tokens into specific categories. If None, the default classifiers are used. The default classifiers are JapaneseTokenClassifier and EnglishTokenClassifier.
- class py_pdf_term.candidates.splitters.RepeatSplitter(classifiers: list[BaseTokenClassifier] | None = None)¶
Bases:
BaseSplitter
Splitter to split a term by repeated tokens. For example, “quick sort merge sort heap sort” is split into “quick sort”, “merge sort”, and “heap sort”.
- Parameters:
classifiers – List of token classifiers to classify tokens into specific categories. If None, the default classifiers are used. The default classifiers are JapaneseTokenClassifier and EnglishTokenClassifier.
- class py_pdf_term.candidates.splitters.SplitterCombiner(splitters: list[BaseSplitter] | None = None)¶
Bases:
object
Combiner of splitters.
- Parameters:
splitters – List of splitters to split terms. The splitters are applied in order. If None, the default splitters are used. The default splitters are SymbolNameSplitter and RepeatSplitter.
- class py_pdf_term.candidates.splitters.SymbolNameSplitter(classifiers: list[BaseTokenClassifier] | None = None)¶
Bases:
BaseSplitter
Splitter to split down a symbol at the end of a term. For example, given “Programming Language 2”, this splitter splits it into “Programming Language” and “2”, and then “2” is ignored as a meaningless term.
- Parameters:
classifiers – List of token classifiers to classify tokens into specific categories. If None, the default classifiers are used. The default classifiers are JapaneseTokenClassifier and EnglishTokenClassifier.
py_pdf_term.candidates.filters subpackage¶
- class py_pdf_term.candidates.filters.BaseCandidateTermFilter¶
Bases:
object
Base class for filters of candidate terms.
- class py_pdf_term.candidates.filters.BaseCandidateTokenFilter¶
Bases:
object
Base class for filters of tokens which can be part of a candidate term.
- class py_pdf_term.candidates.filters.BaseEnglishCandidateTermFilter¶
Bases:
BaseCandidateTermFilter
Base class for filters of English candidate terms.
- class py_pdf_term.candidates.filters.BaseJapaneseCandidateTermFilter¶
Bases:
BaseCandidateTermFilter
Base class for filters of Japanese candidate terms.
- class py_pdf_term.candidates.filters.EnglishConcatenationFilter¶
Bases:
BaseEnglishCandidateTermFilter
Candidate term filter to filter out invalidly concatenated English terms.
- class py_pdf_term.candidates.filters.EnglishNumericFilter¶
Bases:
BaseEnglishCandidateTermFilter
Term filter to remove English numeric phrases from candidate terms.
- class py_pdf_term.candidates.filters.EnglishProperNounFilter¶
Bases:
BaseEnglishCandidateTermFilter
Term filter to remove English proper nouns from candidate terms.
- class py_pdf_term.candidates.filters.EnglishSymbolLikeFilter¶
Bases:
BaseEnglishCandidateTermFilter
Candidate term filter to filter out symbol-like English terms.
- class py_pdf_term.candidates.filters.EnglishTokenFilter¶
Bases:
BaseCandidateTokenFilter
Candidate token filter to filter out English tokens which cannot be part of candidate terms.
- class py_pdf_term.candidates.filters.FilterCombiner(token_filters: list[BaseCandidateTokenFilter] | None = None, term_filters: list[BaseCandidateTermFilter] | None = None)¶
Bases:
object
Combiner of token filters and term filters.
- Parameters:
token_filters – List of token filters to filter tokens. If None, the default token filters are used. The default token filters are JapaneseTokenFilter and EnglishTokenFilter.
term_filters – List of term filters to filter candidate terms. If None, the default term filters are used. The default term filters are JapaneseConcatenationFilter, EnglishConcatenationFilter, JapaneseSymbolLikeFilter, EnglishSymbolLikeFilter, JapaneseProperNounFilter, EnglishProperNounFilter, JapaneseNumericFilter, and EnglishNumericFilter.
- class py_pdf_term.candidates.filters.JapaneseConcatenationFilter¶
Bases:
BaseJapaneseCandidateTermFilter
Candidate term filter to filter out invalidly concatenated Japanese terms.
- class py_pdf_term.candidates.filters.JapaneseNumericFilter¶
Bases:
BaseJapaneseCandidateTermFilter
Term filter to remove Japanese numeric phrases from candidate terms.
- class py_pdf_term.candidates.filters.JapaneseProperNounFilter¶
Bases:
BaseJapaneseCandidateTermFilter
Term filter to remove Japanese proper nouns from candidate terms.
- class py_pdf_term.candidates.filters.JapaneseSymbolLikeFilter¶
Bases:
BaseJapaneseCandidateTermFilter
Candidate term filter to filter out symbol-like Japanese terms.
- class py_pdf_term.candidates.filters.JapaneseTokenFilter¶
Bases:
BaseCandidateTokenFilter
Candidate token filter to filter out Japanese tokens which cannot be part of candidate terms.
py_pdf_term.candidates.classifiers subpackage¶
- class py_pdf_term.candidates.classifiers.BaseTokenClassifier¶
Bases:
object
Base class for token classifiers. A token classifier is used to classify a token into a specific category.
- abstractmethod inscope(token: Token) bool ¶
Test whether a token is in the scope of this classifier or not.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is in the scope of this classifier, False otherwise.
- Return type:
bool
- is_connector(token: Token) bool ¶
Test whether a token is a connector or not. A connector is a token that is used to connect two terms such as a connector symbol and a connector term.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is a connector, False otherwise.
- Return type:
bool
- abstractmethod is_connector_symbol(token: Token) bool ¶
Test whether a token is a connector symbol or not. A connector symbol is a symbol that is used to connect two terms such as - and ・. If this method returns True, is_symbol() must also return True.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is a connector symbol, False otherwise.
- Return type:
bool
- abstractmethod is_connector_term(token: Token) bool ¶
Test whether a token is a connector term or not. A connector term is a term that is used to connect two terms such as “of” and “in” in English, and “の” in Japanese.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is a connector term, False otherwise.
- Return type:
bool
- class py_pdf_term.candidates.classifiers.EnglishTokenClassifier¶
Bases:
BaseTokenClassifier
Token classifier for English tokens.
- inscope(token: Token) bool ¶
Test whether a token is in the scope of this classifier or not.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is in the scope of this classifier, False otherwise.
- Return type:
bool
- is_connector_symbol(token: Token) bool ¶
Test whether a token is a connector symbol or not. A connector symbol is a symbol that is used to connect two terms such as - and ・. If this method returns True, is_symbol() must also return True.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is a connector symbol, False otherwise.
- Return type:
bool
- is_connector_term(token: Token) bool ¶
Test whether a token is a connector term or not. A connector term is a term that is used to connect two terms such as “of” and “in” in English, and “の” in Japanese.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is a connector term, False otherwise.
- Return type:
bool
- class py_pdf_term.candidates.classifiers.JapaneseTokenClassifier¶
Bases:
BaseTokenClassifier
Token classifier for Japanese tokens.
- inscope(token: Token) bool ¶
Test whether a token is in the scope of this classifier or not.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is in the scope of this classifier, False otherwise.
- Return type:
bool
- is_connector_symbol(token: Token) bool ¶
Test whether a token is a connector symbol or not. A connector symbol is a symbol that is used to connect two terms such as - and ・. If this method returns True, is_symbol() must also return True.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is a connector symbol, False otherwise.
- Return type:
bool
- is_connector_term(token: Token) bool ¶
Test whether a token is a connector term or not. A connector term is a term that is used to connect two terms such as “of” and “in” in English, and “の” in Japanese.
- Parameters:
token – Token to be tested.
- Returns:
True if the token is a connector term, False otherwise.
- Return type:
bool
- class py_pdf_term.candidates.classifiers.MeaninglessMarker(classifiers: list[BaseTokenClassifier] | None = None)¶
Bases:
object
Marker class to mark meaningless tokens in a term.
- Parameters:
classifiers – List of token classifiers to mark meaningless tokens. If None, JapaneseTokenClassifier and EnglishTokenClassifier are used.
py_pdf_term.analysis package¶
- class py_pdf_term.analysis.ContainerTermsAnalyzer(ignore_augmented: bool = True)¶
Bases:
object
Analyze container terms of the domain.
- Parameters:
ignore_augmented – If True, ignore augmented terms. The default is True.
- analyze(domain_candidates: DomainCandidateTermList) DomainContainerTerms ¶
Analyze container terms of the domain.
- Parameters:
domain_candidates – List of candidate terms in a domain. The target of analysis.
- Returns:
Domain name and container terms of candidate terms in the domain.
- Return type:
- class py_pdf_term.analysis.DomainContainerTerms(domain: str, container_terms: dict[str, set[str]])¶
Bases:
object
Domain name and container terms of the domain.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
container_terms (dict[str, set[str]]) – Set of lemmatized containers of the lemmatized term in the domain. (term, container) is valid if and only if the container contains the term as a proper subsequence.
- domain: str¶
- container_terms: dict[str, set[str]]¶
- class py_pdf_term.analysis.DomainLeftRightFrequency(domain: str, left_freq: dict[str, dict[str, int]], right_freq: dict[str, dict[str, int]])¶
Bases:
object
Domain name and left/right frequency of the domain.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
left_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (left, token) in the domain. If token or left is meaningless, this is fixed at zero.
right_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (token, right) in the domain. If token or right is meaningless, this is fixed at zero.
- domain: str¶
- left_freq: dict[str, dict[str, int]]¶
- right_freq: dict[str, dict[str, int]]¶
- class py_pdf_term.analysis.DomainTermOccurrence(domain: str, term_freq: dict[str, int], doc_term_freq: dict[str, int])¶
Bases:
object
Domain name and term occurrence of the domain
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
doc_term_freq (dict[str, int]) – Number of documents in the domain that contain the lemmatized term. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
- domain: str¶
- term_freq: dict[str, int]¶
- doc_term_freq: dict[str, int]¶
- class py_pdf_term.analysis.TermLeftRightFrequencyAnalyzer(ignore_augmented: bool = True)¶
Bases:
object
Analyze left/right frequency of terms in a domain.
- Parameters:
ignore_augmented – If True, ignore augmented terms. The default is True.
- analyze(domain_candidates: DomainCandidateTermList) DomainLeftRightFrequency ¶
Analyze left/right frequency of terms in a domain.
- Parameters:
domain_candidates – List of candidate terms in a domain. The target of analysis.
- Returns:
Domain name and left/right frequency of candidate terms in the domain.
- Return type:
- class py_pdf_term.analysis.TermOccurrenceAnalyzer(ignore_augmented: bool = True)¶
Bases:
object
Analyze term occurrences in a domain.
- Parameters:
ignore_augmented – If True, ignore augmented terms. The default is True.
- analyze(domain_candidates: DomainCandidateTermList) DomainTermOccurrence ¶
Analyze term occurrences in a domain.
- Parameters:
domain_candidates – List of candidate terms in a domain. The target of analysis.
- Returns:
Domain name and term occurrence of candidate terms in the domain.
- Return type:
py_pdf_term.methods package¶
- class py_pdf_term.methods.BaseMultiDomainRankingMethod(data_collector: BaseRankingDataCollector, ranker: BaseMultiDomainRanker)¶
Bases:
Generic
Base class for ranking methods with an algorithm which requires cross-domain information.
- Parameters:
data_collector – Collector of metadata to rank candidate terms in domain-specific PDF documents.
ranker – Ranker of candidate terms in PDF documents by an algorithm which requires cross-domain information.
- collect_data(domain_candidates: DomainCandidateTermList) RankingData ¶
Collect metadata to rank candidate terms in PDF documents. This method is used to collect metadata before ranking candidate terms in PDF documents. The following two code snippets are equivalent:
` ranking_data_list = list(map(method.collect_data, domain_candidates_list)) term_ranking = method.rank_terms(domain_candidates, ranking_data_list) `
and
` term_ranking = method.rank_terms(domain_candidates) `
This method is useful when you want to utilize cached metadata to rank candidate terms in PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- abstractmethod classmethod collect_data_from_dict(obj: dict[str, Any]) RankingData ¶
Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.
- Parameters:
obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- rank_domain_terms(domain: str, domain_candidates_list: list[DomainCandidateTermList], ranking_data_list: list[RankingData] | None = None) MethodTermRanking ¶
Rank candidate terms in PDF documents in a domain.
- Parameters:
domain – Domain to rank candidate terms in PDF documents.
domain_candidates_list – List of candidate terms in domain-specific PDF documents.
ranking_data_list – Metadata to rank candidate terms in PDF documents. If this argument is not None, this method skips collecting metadata and uses this argument instead. The default is None.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- rank_terms(domain_candidates_list: list[DomainCandidateTermList], ranking_data_list: list[RankingData] | None = None) Iterator[MethodTermRanking] ¶
Rank candidate terms in PDF documents in multiple domains.
- Parameters:
domain_candidates_list – List of candidate terms in domain-specific PDF documents.
ranking_data_list – Metadata to rank candidate terms in PDF documents. If this argument is not None, this method skips collecting metadata and uses this argument instead. The default is None.
- Yields:
MethodTermRanking – Ranking result of candidate terms in PDF documents.
- class py_pdf_term.methods.BaseSingleDomainRankingMethod(data_collector: BaseRankingDataCollector, ranker: BaseSingleDomainRanker)¶
Bases:
Generic
Base class for ranking methods with an algorithm which does not require cross-domain information.
- Parameters:
data_collector – Collector of metadata to rank candidate terms in domain-specific PDF documents.
ranker – Ranker of candidate terms in PDF documents by an algorithm which does not require cross-domain information.
- collect_data(domain_candidates: DomainCandidateTermList) RankingData ¶
Collect metadata to rank candidate terms in PDF documents. This method is used to collect metadata before ranking candidate terms in PDF documents. The following two code snippets are equivalent:
` ranking_data = method.collect_data(domain_candidates) term_ranking = method.rank_terms(domain_candidates, ranking_data) `
and
` term_ranking = method.rank_terms(domain_candidates) `
This method is useful when you want to utilize cached metadata to rank candidate terms in PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- abstractmethod classmethod collect_data_from_dict(obj: dict[str, Any]) RankingData ¶
Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.
- Parameters:
obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: RankingData | None = None) MethodTermRanking ¶
Rank candidate terms in PDF documents in a domain.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents. If this argument is not None, this method skips collecting metadata and uses this argument instead. The default is None.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- class py_pdf_term.methods.FLRHMethod(threshold: float = 1e-8, max_loop: int = 1000)¶
Bases:
BaseSingleDomainRankingMethod
[FLRHRankingData
]Ranking method by FLRH algorithm. This algorithm is a combination of FLR and HITS.
- Parameters:
threshold – Threshold of the FLRH algorithm. The default is 1e-8.
max_loop – Maximum number of loops of the FLRH algorithm. The default is 1000.
- classmethod collect_data_from_dict(obj: dict[str, Any]) FLRHRankingData ¶
Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.
- Parameters:
obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.FLRMethod¶
Bases:
BaseSingleDomainRankingMethod
[FLRRankingData
]Ranking method by FLR algorithm.
- classmethod collect_data_from_dict(obj: dict[str, Any]) FLRRankingData ¶
Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.
- Parameters:
obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.HITSMethod(threshold: float = 1e-8, max_loop: int = 1000)¶
Bases:
BaseSingleDomainRankingMethod
[HITSRankingData
]Ranking method by HITS algorithm.
- Parameters:
threshold – Threshold to determine convergence. If the difference between original auth/hub values and new auth/hub values is less than this threshold, the algorithm is considered to be converged. The default is 1e-8.
max_loop – Maximum number of loops to run the algorithm. If the algorithm does not converge within this number of loops, it is forced to stop. The default is 1000.
- classmethod collect_data_from_dict(obj: dict[str, Any]) HITSRankingData ¶
Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.
- Parameters:
obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.MCValueMethod¶
Bases:
BaseSingleDomainRankingMethod
[MCValueRankingData
]Ranking method by MC-Value algorithm.
- classmethod collect_data_from_dict(obj: dict[str, Any]) MCValueRankingData ¶
Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.
- Parameters:
obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.MDPMethod¶
Bases:
BaseMultiDomainRankingMethod
[MDPRankingData
]Ranking method by MDP algorithm.
- classmethod collect_data_from_dict(obj: dict[str, Any]) MDPRankingData ¶
Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.
- Parameters:
obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.MethodTermRanking(domain: str, ranking: list[ScoredTerm])¶
Bases:
object
Domain name and ranking of technical terms of the domain.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
ranking (list[ScoredTerm]) – List of pairs of lemmatized term and method score. The list is sorted by the score in descending order.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- domain: str¶
- ranking: list[ScoredTerm]¶
- class py_pdf_term.methods.TFIDFMethod¶
Bases:
BaseMultiDomainRankingMethod
[TFIDFRankingData
]Ranking method by TF-IDF algorithm.
- classmethod collect_data_from_dict(obj: dict[str, Any]) TFIDFRankingData ¶
Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.
- Parameters:
obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
py_pdf_term.methods.collectors subpackage¶
- class py_pdf_term.methods.collectors.BaseRankingDataCollector¶
Bases:
Generic
Base class for ranking data collectors. This class is used to collect metadata to rank candidate terms in domain-specific PDF documents.
- abstractmethod collect(domain_candidates: DomainCandidateTermList) RankingData ¶
Collect metadata to rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.collectors.FLRHRankingDataCollector¶
Bases:
BaseRankingDataCollector
[FLRHRankingData
]Collector of metadata to rank candidate terms in domain-specific PDF documents by FLRH algorithm.
- collect(domain_candidates: DomainCandidateTermList) FLRHRankingData ¶
Collect metadata to rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.collectors.FLRRankingDataCollector¶
Bases:
BaseRankingDataCollector
[FLRRankingData
]Collector of metadata to rank candidate terms in domain-specific PDF documents by FLR algorithm.
- collect(domain_candidates: DomainCandidateTermList) FLRRankingData ¶
Collect metadata to rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.collectors.HITSRankingDataCollector¶
Bases:
BaseRankingDataCollector
[HITSRankingData
]Collector of metadata to rank candidate terms in domain-specific PDF documents by HITS algorithm.
- collect(domain_candidates: DomainCandidateTermList) HITSRankingData ¶
Collect metadata to rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.collectors.MCValueRankingDataCollector¶
Bases:
BaseRankingDataCollector
[MCValueRankingData
]Collector of metadata to rank candidate terms in domain-specific PDF documents by MC-Value algorithm.
- collect(domain_candidates: DomainCandidateTermList) MCValueRankingData ¶
Collect metadata to rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.collectors.MDPRankingDataCollector¶
Bases:
BaseRankingDataCollector
[MDPRankingData
]Collector of metadata to rank candidate terms in domain-specific PDF documents by MDP algorithm.
- collect(domain_candidates: DomainCandidateTermList) MDPRankingData ¶
Collect metadata to rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
- class py_pdf_term.methods.collectors.TFIDFRankingDataCollector¶
Bases:
BaseRankingDataCollector
[TFIDFRankingData
]Collector of metadata to rank candidate terms in domain-specific PDF documents by TF-IDF algorithm.
- collect(domain_candidates: DomainCandidateTermList) TFIDFRankingData ¶
Collect metadata to rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
- Returns:
Metadata to rank candidate terms in PDF documents.
- Return type:
RankingData
py_pdf_term.methods.rankers subpackage¶
- class py_pdf_term.methods.rankers.BaseMultiDomainRanker¶
Bases:
Generic
Base class for term rankers with an algorithm which requires cross-domain information.
- abstractmethod rank_terms(domain_candidates: DomainCandidateTermList, ranking_data_list: list[RankingData]) MethodTermRanking ¶
Rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data_list – List of metadata to rank candidate terms in PDF documents for each domain.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- class py_pdf_term.methods.rankers.BaseSingleDomainRanker¶
Bases:
Generic
Base class for term rankers with an algorithm which does not require cross-domain information.
- abstractmethod rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: RankingData) MethodTermRanking ¶
Rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- class py_pdf_term.methods.rankers.FLRHRanker(threshold: float = 1e-8, max_loop: int = 1000)¶
Bases:
BaseSingleDomainRanker
[FLRHRankingData
]Term ranker by FLRH algorithm. This algorithm is a combination of FLR and HITS.
- Parameters:
threshold – Threshold value for HITS algorithm. The default is 1e-8.
max_loop – Maximum number of loops for HITS algorithm. The default is 1000.
- rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: FLRHRankingData) MethodTermRanking ¶
Rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- class py_pdf_term.methods.rankers.FLRRanker¶
Bases:
BaseSingleDomainRanker
[FLRRankingData
]Term ranker by FLR algorithm.
- rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: FLRRankingData) MethodTermRanking ¶
Rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- class py_pdf_term.methods.rankers.HITSRanker(threshold: float = 1e-8, max_loop: int = 1000)¶
Bases:
BaseSingleDomainRanker
[HITSRankingData
]Term ranker by HITS algorithm.
- Parameters:
threshold – Threshold to determine convergence. If the difference between original auth/hub values and new auth/hub values is less than this threshold, the algorithm is considered to be converged. The default is 1e-8.
max_loop – Maximum number of loops to run the algorithm. If the algorithm does not converge within this number of loops, it is forced to stop. The default is 1000.
- rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: HITSRankingData) MethodTermRanking ¶
Rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- class py_pdf_term.methods.rankers.MCValueRanker¶
Bases:
BaseSingleDomainRanker
[MCValueRankingData
]Term ranker by MC-Value algorithm.
- rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: MCValueRankingData) MethodTermRanking ¶
Rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- class py_pdf_term.methods.rankers.MDPRanker¶
Bases:
BaseMultiDomainRanker
[MDPRankingData
]Term ranker by MDP algorithm.
- rank_terms(domain_candidates: DomainCandidateTermList, ranking_data_list: list[MDPRankingData]) MethodTermRanking ¶
Rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data_list – List of metadata to rank candidate terms in PDF documents for each domain.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
- class py_pdf_term.methods.rankers.TFIDFRanker¶
Bases:
BaseMultiDomainRanker
[TFIDFRankingData
]Term ranker by TF-IDF algorithm.
- rank_terms(domain_candidates: DomainCandidateTermList, ranking_data_list: list[TFIDFRankingData]) MethodTermRanking ¶
Rank candidate terms in domain-specific PDF documents.
- Parameters:
domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data_list – List of metadata to rank candidate terms in PDF documents for each domain.
- Returns:
Ranking result of candidate terms in PDF documents.
- Return type:
py_pdf_term.methods.rankingdata subpackage¶
- class py_pdf_term.methods.rankingdata.BaseRankingData(domain: str)¶
Bases:
object
Base class for ranking data of technical terms of a domain.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- domain: str¶
- class py_pdf_term.methods.rankingdata.FLRHRankingData(domain: str, term_freq: dict[str, int], left_freq: dict[str, dict[str, int]], right_freq: dict[str, dict[str, int]])¶
Bases:
BaseRankingData
Data of technical terms of a domain for FLRH algorithm.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
left_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (left, token) in the domain. If token or left is meaningless this is fixed at zero.
right_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (token, right) in the domain. If token or right is meaningless this is fixed at zero.
- domain: str¶
- term_freq: dict[str, int]¶
- left_freq: dict[str, dict[str, int]]¶
- right_freq: dict[str, dict[str, int]]¶
- class py_pdf_term.methods.rankingdata.FLRRankingData(domain: str, term_freq: dict[str, int], left_freq: dict[str, dict[str, int]], right_freq: dict[str, dict[str, int]])¶
Bases:
BaseRankingData
Data of technical terms of a domain for FLR algorithm.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
left_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (left, token) in the domain. If token or left is meaningless this is fixed at zero.
right_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (token, right) in the domain. If token or right is meaningless this is fixed at zero.
- domain: str¶
- term_freq: dict[str, int]¶
- left_freq: dict[str, dict[str, int]]¶
- right_freq: dict[str, dict[str, int]]¶
- class py_pdf_term.methods.rankingdata.HITSRankingData(domain: str, term_freq: dict[str, int], left_freq: dict[str, dict[str, int]], right_freq: dict[str, dict[str, int]])¶
Bases:
BaseRankingData
Data of technical terms of a domain for HITS algorithm.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
left_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (left, token) in the domain. If token or left is meaningless this is fixed at zero.
right_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (token, right) in the domain. If token or right is meaningless this is fixed at zero.
- domain: str¶
- term_freq: dict[str, int]¶
- left_freq: dict[str, dict[str, int]]¶
- right_freq: dict[str, dict[str, int]]¶
- class py_pdf_term.methods.rankingdata.MCValueRankingData(domain: str, term_freq: dict[str, int], container_terms: dict[str, set[str]])¶
Bases:
BaseRankingData
Data of technical terms of a domain for MC-Value algorithm.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
container_terms (dict[str, set[str]]) – Set of containers of the lemmatized term in the domain. (term, container) is valid iff the container contains the term as a proper subsequence.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- domain: str¶
- term_freq: dict[str, int]¶
- container_terms: dict[str, set[str]]¶
- class py_pdf_term.methods.rankingdata.MDPRankingData(domain: str, term_freq: dict[str, int])¶
Bases:
BaseRankingData
Data of technical terms of a domain for MDP algorithm.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
num_terms (int) – Brute force counting of all lemmatized terms occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- domain: str¶
- term_freq: dict[str, int]¶
- num_terms: int¶
- class py_pdf_term.methods.rankingdata.TFIDFRankingData(domain: str, term_freq: dict[str, int], doc_freq: dict[str, int], num_docs: int)¶
Bases:
BaseRankingData
Data of technical terms of a domain for TF-IDF algorithm.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
doc_freq (dict[str, int]) – Number of documents in the domain that contain the lemmatized term. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
num_docs (int) – Number of documents in the domain.
- domain: str¶
- term_freq: dict[str, int]¶
- doc_freq: dict[str, int]¶
- num_docs: int¶
py_pdf_term.stylings package¶
- class py_pdf_term.stylings.DomainStylingScoreList(domain: str, pdfs: list[PDFStylingScoreList])¶
Bases:
object
Domain name of PDF files and styling scores of technical terms of the domain.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
pdfs (list[PDFStylingScoreList]) – Styling scores of each PDF file of the domain.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- domain: str¶
- pdfs: list[PDFStylingScoreList]¶
- class py_pdf_term.stylings.PDFStylingScoreList(pdf_path: str, pages: list[PageStylingScoreList])¶
Bases:
object
Path of a PDF file and styling scores of technical terms of the PDF file.
- Parameters:
pdf_path (str) – Path of a PDF file.
pages (list[PageStylingScoreList]) – Styling scores of each page of the PDF file.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- pdf_path: str¶
- pages: list[PageStylingScoreList]¶
- class py_pdf_term.stylings.PageStylingScoreList(page_num: int, ranking: list[ScoredTerm])¶
Bases:
object
Page number and styling scores of technical terms of the page.
- Parameters:
page_num (int) – Page number of a PDF file.
ranking (list[ScoredTerm]) – List of pairs of lemmatized term and styling score. The list is sorted by the score in descending order.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- page_num: int¶
- ranking: list[ScoredTerm]¶
- class py_pdf_term.stylings.StylingScorer(styling_score_clses: list[type[BaseStylingScore]] | None = None)¶
Bases:
object
Scorer for styling scores. The styling scores are combined by multiplication of each score.
- Parameters:
styling_score_clses – Styling scorers to be combined. If None, the default scorers are used. The default scorers are FontsizeScore and ColorScore.
- score_domain_candidates(domain_candidates: DomainCandidateTermList) DomainStylingScoreList ¶
Calculate styling scores for each candidate term in a domain.
- Parameters:
domain_candidates – List of candidate terms in a domain. The target of analysis.
- Returns:
List of styling scores for each candidate term in a domain. The scores are sorted in descending order.
- Return type:
- score_pdf_candidates(pdf_candidates: PDFCandidateTermList) PDFStylingScoreList ¶
Calculate styling scores for each candidate term in a PDF file.
- Parameters:
pdf_candidates – List of candidate terms in a PDF file. The target of analysis.
- Returns:
List of styling scores for each candidate term in a PDF file. The scores are sorted in descending order.
- Return type:
py_pdf_term.stylings.scores subpackage¶
- class py_pdf_term.stylings.scores.BaseStylingScore(page_candidates: PageCandidateTermList)¶
Bases:
object
Base class for styling scores. A styling score is expected to focus on a single styling feature, such as font size, font family, and font color. The score is calculated per a page of a PDF file, not per a domain of PDF files.
- Parameters:
page_candidates – List of candidate terms in a page of a PDF file. The target of analysis.
- abstractmethod calculate_score(candidate: Term) float ¶
Calculate the styling score of a candidate term.
- Parameters:
candidate – Candidate term to calculate the styling score. This term is expected to be included in the list of candidate terms passed to the constructor.
- Returns:
The styling score of the candidate term.
- Return type:
float
- class py_pdf_term.stylings.scores.ColorScore(page_candidates: PageCandidateTermList)¶
Bases:
BaseStylingScore
Styling score for font color. The more rarely the color appears in the page, the higher the score is.
- Parameters:
page_candidates – List of candidate terms in a page of a PDF file. The target of analysis.
- calculate_score(candidate: Term) float ¶
Calculate the styling score of a candidate term.
- Parameters:
candidate – Candidate term to calculate the styling score. This term is expected to be included in the list of candidate terms passed to the constructor.
- Returns:
The styling score of the candidate term.
- Return type:
float
- class py_pdf_term.stylings.scores.FontsizeScore(page_candidates: PageCandidateTermList)¶
Bases:
BaseStylingScore
Styling score for font size. The larger the font size is, the higher the score is. The score is normalized by the mean and the standard deviation of font sizes in the page.
- Parameters:
page_candidates – List of candidate terms in a page of a PDF file. The target of analysis.
- calculate_score(candidate: Term) float ¶
Calculate the styling score of a candidate term.
- Parameters:
candidate – Candidate term to calculate the styling score. This term is expected to be included in the list of candidate terms passed to the constructor.
- Returns:
The styling score of the candidate term.
- Return type:
float
py_pdf_term.techterms package¶
- class py_pdf_term.techterms.DomainTechnicalTermList(domain: str, pdfs: list[PDFTechnicalTermList])¶
Bases:
object
Domain name of PDF files and technical terms of the domain.
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
pdfs (list[PDFTechnicalTermList]) – Technical terms of each PDF file of the domain.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- domain: str¶
- pdfs: list[PDFTechnicalTermList]¶
- class py_pdf_term.techterms.PDFTechnicalTermList(pdf_path: str, pages: list[PageTechnicalTermList])¶
Bases:
object
Path of a PDF file and technical terms of the PDF file.
- Parameters:
pdf_path (str) – Path of a PDF file.
pages (list[PageTechnicalTermList]) – Technical terms of each page of the PDF file.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- pdf_path: str¶
- pages: list[PageTechnicalTermList]¶
- class py_pdf_term.techterms.PageTechnicalTermList(page_num: int, terms: list[ScoredTerm])¶
Bases:
object
Page number and technical terms of the page.
- Parameters:
page_num (int) – Page number of a PDF file.
terms (list[ScoredTerm]) – Technical terms of the page.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- page_num: int¶
- terms: list[ScoredTerm]¶
- class py_pdf_term.techterms.TechnicalTermExtractor(max_num_terms: int = 10, acceptance_rate: float = 0.75)¶
Bases:
object
Technical term extrator based on ranking method scores and styling scores.
- Parameters:
max_num_terms – Maximum number of terms in a page of a PDF file to be extracted. The N-best candidates are extracted as technical terms. The default value is 10.
acceptance_rate – Acceptance rate of the ranking method scores. The candidates whose ranking method scores are lower than the acceptance rate are filtered out even if they are in the N-best candidates. The default value is 0.75.
- extract_from_domain(domain_candidates: DomainCandidateTermList, term_ranking: MethodTermRanking, domain_styling_scores: DomainStylingScoreList) DomainTechnicalTermList ¶
Extract tecnical terms in PDF files in a domain. The terms are sorted in appearance order, not in score order.
- Parameters:
domain_candidates – List of candidate terms in a domain. The target of extraction.
term_ranking – Ranking method scores for each candidate term in a domain.
domain_styling_scores – Styling scores for each candidate term in a domain.
- Returns:
List of technical terms in PDF files in a domain. The terms are sorted in appearance order, not in score order.
- Return type:
- extract_from_pdf(pdf_candidates: PDFCandidateTermList, term_ranking: MethodTermRanking, pdf_styling_scores: PDFStylingScoreList) PDFTechnicalTermList ¶
Extract tecnical terms in a PDF file. The terms are sorted in appearance order, not in score order.
- Parameters:
pdf_candidates – List of candidate terms in a PDF file. The target of extraction.
term_ranking – Ranking method scores for each candidate term in a domain.
pdf_styling_scores – Styling scores for each candidate term in a PDF file.
- Returns:
List of technical terms in a PDF file. The terms are sorted in appearance order, not in score order.
- Return type:
py_pdf_term.endtoend package¶
- class py_pdf_term.endtoend.DomainPDFList(domain: str, pdf_paths: list[str])¶
Bases:
object
Domain name and PDF file paths of the domain
- Parameters:
domain (str) – Domain name. (e.g., “natural language processing”)
pdf_paths (list[str]) – PDF file paths of the domain.
- classmethod validate(domain_pdfs: DomainPDFList) None ¶
- domain: str¶
- pdf_paths: list[str]¶
- class py_pdf_term.endtoend.PDFTechnicalTermList(pdf_path: str, pages: list[PageTechnicalTermList])¶
Bases:
object
Path of a PDF file and technical terms of the PDF file.
- Parameters:
pdf_path (str) – Path of a PDF file.
pages (list[PageTechnicalTermList]) – Technical terms of each page of the PDF file.
- classmethod from_dict(obj: dict[str, Any]) Self ¶
- to_dict() dict[str, Any] ¶
- pdf_path: str¶
- pages: list[PageTechnicalTermList]¶
- class py_pdf_term.endtoend.PyPDFTermMultiDomainExtractor(xml_config: XMLLayerConfig | None = None, candidate_config: CandidateLayerConfig | None = None, method_config: MultiDomainMethodLayerConfig | None = None, styling_config: StylingLayerConfig | None = None, techterm_config: TechnicalTermLayerConfig | None = None, bin_opener_mapper: BinaryOpenerMapper | None = None, lang_tokenizer_mapper: LanguageTokenizerMapper | None = None, token_classifier_mapper: TokenClassifierMapper | None = None, token_filter_mapper: CandidateTokenFilterMapper | None = None, term_filter_mapper: CandidateTermFilterMapper | None = None, splitter_mapper: SplitterMapper | None = None, augmenter_mapper: AugmenterMapper | None = None, method_mapper: MultiDomainRankingMethodMapper | None = None, styling_score_mapper: StylingScoreMapper | None = None, xml_cache_mapper: XMLLayerCacheMapper | None = None, candidate_cache_mapper: CandidateLayerCacheMapper | None = None, method_ranking_cache_mapper: MethodLayerRankingCacheMapper | None = None, method_data_cache_mapper: MethodLayerDataCacheMapper | None = None, styling_cache_mapper: StylingLayerCacheMapper | None = None, cache_dir: str = DEFAULT_CACHE_DIR)¶
Bases:
object
Top level class of py-pdf-term. This class extracts technical terms from a PDF file with cross-domain information.
- Parameters:
xml_config – Config of XML Layer.
candidate_config – Config of Candidate Term Layer.
method_config – Config of Method Layer.
styling_config – Config of Styling Layer.
techterm_config – Config of Technial Term Layer.
bin_opener_mapper – Mapper from xml_config.open_bin to a function to open a input PDF file in the binary mode. This is used in XML Layer.
lang_tokenizer_mapper – Mapper from an element in candidate_config.lang_tokenizers to a class to tokenize texts in a specific language with spaCy. This is used in Candidate Term Layer.
token_classifier_mapper – Mapper from an element in candidate_config.token_classifiers to a class to classify tokens into True/False by several functions. This is used in Candidate Term Layer.
token_filter_mapper – Mapper from an element in candidate_config.token_filters to a class to filter tokens which are likely to be parts of candidates. This is used in Candidate Term Layer.
term_filter_mapper – Mapper from an element in candidate_config.term_filters to a class to filter terms which are likely to be candidates. This is used in Candidate Term Layer.
splitter_mapper – Mapper from an element in candidate_config.splitters to a class to split too long terms or wrongly concatenated terms. This is used in Candidate Term Layer.
augmenter_mapper – Mapper from an element in candidate_config.augmenters to a class to augment candidates. The augumentation means that if a long candidate is found, sub-terms of it could also be candidates. This is used in Candidate Term Layer.
method_mapper – Mapper from method_config.method to a class to calculate method scores of candidate terms. This is used in Method Layer.
styling_score_mapper – Mapper from an element in styling_config.styling_scores to a class to calculate scores of candidate terms based on their styling such as color, fontsize and so on. This is used in Styling Layer.
xml_cache_mapper – Mapper from xml_config.cache to a class to provide XML Layer with the cache mechanism. The xml cache manages XML files converted from input PDF files.
candidate_cache_mapper – Mapper from candidate_config.cache to a class to provide Candidate Term Layer with the cache mechanism. The candidate cache manages lists of candidate terms.
method_ranking_cache_mapper – Mapper from method_config.ranking_cache to a class to provide Method Layer with the cache mechanism. The method ranking cache manages candidate terms ordered by the method scores.
method_data_cache_mapper – Mapper from method_config.data_cache to a class to provide Method Layer with the cache mechanism. The method data cache manages analysis data of the candidate terms such as frequency or likelihood.
styling_cache_mapper – Mapper from styling_config.cache to a class to provide Styling Layer with the cache mechanism. The styling cache manages candidate terms ordered by the styling scores.
cache_dir – Path like string where cache files to be stored. For example, path to a local directory, a url or a bucket name of a cloud storage service.
- extract(domain: str, pdf_path: str, multi_domain_pdfs: list[DomainPDFList]) PDFTechnicalTermList ¶
Extract technical terms from a PDF file.
- Parameters:
domain – Domain name which the input PDF file belongs to. This may be the name of a course, the name of a technical field or something.
pdf_path – Path like string to the input PDF file which terminologies to be extracted. The file MUST belong to domain.
multi_domain_pdfs – List of path like strings to the PDF files which classified by domain. There MUST be an element in multi_domain_pdfs whose domain equals to domain.
- Returns:
Terminology list per page from the input PDF file.
- Return type:
- class py_pdf_term.endtoend.PyPDFTermSingleDomainExtractor(xml_config: XMLLayerConfig | None = None, candidate_config: CandidateLayerConfig | None = None, method_config: SingleDomainMethodLayerConfig | None = None, styling_config: StylingLayerConfig | None = None, techterm_config: TechnicalTermLayerConfig | None = None, bin_opener_mapper: BinaryOpenerMapper | None = None, lang_tokenizer_mapper: LanguageTokenizerMapper | None = None, token_classifier_mapper: TokenClassifierMapper | None = None, token_filter_mapper: CandidateTokenFilterMapper | None = None, term_filter_mapper: CandidateTermFilterMapper | None = None, splitter_mapper: SplitterMapper | None = None, augmenter_mapper: AugmenterMapper | None = None, method_mapper: SingleDomainRankingMethodMapper | None = None, styling_score_mapper: StylingScoreMapper | None = None, xml_cache_mapper: XMLLayerCacheMapper | None = None, candidate_cache_mapper: CandidateLayerCacheMapper | None = None, method_ranking_cache_mapper: MethodLayerRankingCacheMapper | None = None, method_data_cache_mapper: MethodLayerDataCacheMapper | None = None, styling_cache_mapper: StylingLayerCacheMapper | None = None, cache_dir: str = DEFAULT_CACHE_DIR)¶
Bases:
object
Top level class of py-pdf-term. This class extracts technical terms from a PDF file withoout cross-domain information.
- Parameters:
xml_config – Config of XML Layer.
candidate_config – Config of Candidate Term Layer.
method_config – Config of Method Layer.
styling_config – Config of Styling Layer.
techterm_config – Config of Technial Term Layer.
bin_opener_mapper – Mapper from xml_config.open_bin to a function to open a input PDF file in the binary mode. This is used in XML Layer.
lang_tokenizer_mapper – Mapper from an element in candidate_config.lang_tokenizers to a class to tokenize texts in a specific language with spaCy. This is used in Candidate Term Layer.
token_classifier_mapper – Mapper from an element in candidate_config.token_classifiers to a class to classify tokens into True/False by several functions. This is used in Candidate Term Layer.
token_filter_mapper – Mapper from an element in candidate_config.token_filters to a class to filter tokens which are likely to be parts of candidates. This is used in Candidate Term Layer.
term_filter_mapper – Mapper from an element in candidate_config.term_filters to a class to filter terms which are likely to be candidates. This is used in Candidate Term Layer.
splitter_mapper – Mapper from an element in candidate_config.splitters to a class to split too long terms or wrongly concatenated terms. This is used in Candidate Term Layer.
augmenter_mapper – Mapper from an element in candidate_config.augmenters to a class to augment candidates. The augumentation means that if a long candidate is found, sub-terms of it could also be candidates. This is used in Candidate Term Layer.
method_mapper – Mapper from method_config.method to a class to calculate method scores of candidate terms. This is used in Method Layer.
styling_score_mapper – Mapper from an element in styling_config.styling_scores to a class to calculate scores of candidate terms based on their styling such as color, fontsize and so on. This is used in Styling Layer.
xml_cache_mapper – Mapper from xml_config.cache to a class to provide XML Layer with the cache mechanism. The xml cache manages XML files converted from input PDF files.
candidate_cache_mapper – Mapper from candidate_config.cache to a class to provide Candidate Term Layer with the cache mechanism. The candidate cache manages lists of candidate terms.
method_ranking_cache_mapper – Mapper from method_config.ranking_cache to a class to provide Method Layer with the cache mechanism. The method ranking cache manages candidate terms ordered by the method scores.
method_data_cache_mapper – Mapper from method_config.data_cache to a class to provide Method Layer with the cache mechanism. The method data cache manages analysis data of the candidate terms such as frequency or likelihood.
styling_cache_mapper – Mapper from styling_config.cache to a class to provide Styling Layer with the cache mechanism. The styling cache manages candidate terms ordered by the styling scores.
cache_dir – Path like string where cache files to be stored. For example, path to a local directory, a url or a bucket name of a cloud storage service.
- extract(pdf_path: str, domain_pdfs: DomainPDFList) PDFTechnicalTermList ¶
Extract technical terms from a PDF file.
- Parameters:
pdf_path – Path like string to the input PDF file which terminologies to be extracted. The file MUST belong to domain.
domain_pdfs – List of path like strings to the PDF files which belong to a specific domain.
- Returns:
Terminology list per page from the input PDF file.
- Return type:
py_pdf_term.endtoend.caches subpackage¶
- class py_pdf_term.endtoend.caches.BaseCandidateLayerCache(cache_dir: str)¶
Bases:
object
Base class for candidate layer caches. A candidate layer cache is expected to store and load candidate terms per a PDF file.
- Parameters:
cache_dir – Directory path to store cache files.
- abstractmethod load(pdf_path: str, config: CandidateLayerConfig) PDFCandidateTermList | None ¶
Load candidate terms from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load candidate terms.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- Returns:
Loaded candidate terms. If there is no cache file, this method returns None.
- Return type:
PDFCandidateTermList | None
- abstractmethod remove(pdf_path: str, config: CandidateLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- abstractmethod store(candidates: PDFCandidateTermList, config: CandidateLayerConfig) None ¶
Store candidate terms to a cache file.
- Parameters:
candidates – Candidate terms to store.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.BaseMethodLayerDataCache(cache_dir: str)¶
Bases:
Generic
Base class for method layer data caches. A method layer data cache is expected to store and load metadata to generate term rankings per a domain of PDF files.
- Parameters:
cache_dir – Directory path to store cache files.
- abstractmethod load(pdf_paths: list[str], config: BaseMethodLayerConfig, from_dict: Callable[[dict[str, Any]], RankingData]) RankingData | None ¶
Load metadata to generate term rankings from a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to load metadata. The order of the paths is important. The order should be the same as that when the store method is called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
from_dict – Function to convert a dictionary to a RankingData object.
- Returns:
Loaded metadata to generate term rankings. If there is no cache file, this method returns None. The returned metadata is converted to a RankingData object by the from_dict function.
- Return type:
RankingData | None
- abstractmethod remove(pdf_paths: list[str], config: BaseMethodLayerConfig) None ¶
Remove cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to remove cache files. The order of the paths is important. The order should be the same as that when the store method called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- abstractmethod store(pdf_paths: list[str], ranking_data: RankingData, config: BaseMethodLayerConfig) None ¶
Store metadata to generate term rankings to a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to store metadata. The order of the paths is important. The order should be the same as that when the load method to be called.
ranking_data – Metadata to generate term rankings to store.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.BaseMethodLayerRankingCache(cache_dir: str)¶
Bases:
object
Base class for method layer ranking caches. A method layer ranking cache is expected to store and load term rankings per a domain of PDF files.
- Parameters:
cache_dir – Directory path to store cache files.
- abstractmethod load(pdf_paths: list[str], config: BaseMethodLayerConfig) MethodTermRanking | None ¶
Load term rankings from a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to load term rankings. The order of the paths is important. The order should be the same as that when the store method is called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- Returns:
Loaded term rankings. If there is no cache file, this method returns None.
- Return type:
MethodTermRanking | None
- abstractmethod remove(pdf_paths: list[str], config: BaseMethodLayerConfig) None ¶
Remove cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to remove cache files. The order of the paths is important. The order should be the same as that when the store method called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- abstractmethod store(pdf_paths: list[str], term_ranking: MethodTermRanking, config: BaseMethodLayerConfig) None ¶
Store term rankings to a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to store term rankings. The order of the paths is important. The order should be the same as that when the load method to be called.
term_ranking – Term rankings to store.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.BaseStylingLayerCache(cache_dir: str)¶
Bases:
object
Base class for styling layer caches. A styling layer cache is expected to store and load styling scores per a PDF file.
- Parameters:
cache_dir – Directory path to store cache files.
- abstractmethod load(pdf_path: str, config: StylingLayerConfig) PDFStylingScoreList | None ¶
Load styling scores from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load styling scores.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- Returns:
Loaded styling scores. If there is no cache file, this method returns None.
- Return type:
PDFStylingScoreList | None
- abstractmethod remove(pdf_path: str, config: StylingLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- abstractmethod store(styling_scores: PDFStylingScoreList, config: StylingLayerConfig) None ¶
Store styling scores to a cache file.
- Parameters:
styling_scores – Styling scores to store.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.BaseXMLLayerCache(cache_dir: str)¶
Bases:
object
Base class for XML layer caches. An XML layer cache is expected to store and load XML elements per a PDF file.
- Parameters:
cache_dir – Directory path to store cache files.
- abstractmethod load(pdf_path: str, config: XMLLayerConfig) PDFnXMLElement | None ¶
Load XML elements from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load XML elements.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.
- Returns:
Loaded XML elements. If there is no cache file, this method returns None.
- Return type:
PDFnXMLElement | None
- abstractmethod remove(pdf_path: str, config: XMLLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.
- abstractmethod store(pdfnxml: PDFnXMLElement, config: XMLLayerConfig) None ¶
Store XML elements to a cache file.
- Parameters:
pdfnxml – The XML elements to store.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.CandidateLayerFileCache(cache_dir: str)¶
Bases:
BaseCandidateLayerCache
Candidate layer cache that stores and loads candidate terms to/from a file.
- Parameters:
cache_dir – Directory path to store cache files.
- load(pdf_path: str, config: CandidateLayerConfig) PDFCandidateTermList | None ¶
Load candidate terms from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load candidate terms.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- Returns:
Loaded candidate terms. If there is no cache file, this method returns None.
- Return type:
PDFCandidateTermList | None
- remove(pdf_path: str, config: CandidateLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- store(candidates: PDFCandidateTermList, config: CandidateLayerConfig) None ¶
Store candidate terms to a cache file.
- Parameters:
candidates – Candidate terms to store.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.CandidateLayerNoCache(cache_dir: str)¶
Bases:
BaseCandidateLayerCache
Candidate layer cache that does not store and load candidate terms.
- Parameters:
cache_dir – This argument is ignored.
- load(pdf_path: str, config: CandidateLayerConfig) PDFCandidateTermList | None ¶
Load candidate terms from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load candidate terms.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- Returns:
Loaded candidate terms. If there is no cache file, this method returns None.
- Return type:
PDFCandidateTermList | None
- remove(pdf_path: str, config: CandidateLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- store(candidates: PDFCandidateTermList, config: CandidateLayerConfig) None ¶
Store candidate terms to a cache file.
- Parameters:
candidates – Candidate terms to store.
config – Configuration for the candidate layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.MethodLayerDataFileCache(cache_dir: str)¶
Bases:
BaseMethodLayerDataCache
,Generic
Method layer data cache that stores and loads metadata to to generate term rankings to/from a file.
- Parameters:
cache_dir – Directory path to store cache files.
- load(pdf_paths: list[str], config: BaseMethodLayerConfig, from_dict: Callable[[dict[str, Any]], RankingData]) RankingData | None ¶
Load metadata to generate term rankings from a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to load metadata. The order of the paths is important. The order should be the same as that when the store method is called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
from_dict – Function to convert a dictionary to a RankingData object.
- Returns:
Loaded metadata to generate term rankings. If there is no cache file, this method returns None. The returned metadata is converted to a RankingData object by the from_dict function.
- Return type:
RankingData | None
- remove(pdf_paths: list[str], config: BaseMethodLayerConfig) None ¶
Remove cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to remove cache files. The order of the paths is important. The order should be the same as that when the store method called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- store(pdf_paths: list[str], ranking_data: RankingData, config: BaseMethodLayerConfig) None ¶
Store metadata to generate term rankings to a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to store metadata. The order of the paths is important. The order should be the same as that when the load method to be called.
ranking_data – Metadata to generate term rankings to store.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.MethodLayerDataNoCache(cache_dir: str)¶
Bases:
BaseMethodLayerDataCache
,Generic
Method layer data cache that does not store and load metadata to generate term rankings.
- Parameters:
cache_dir – This argument is ignored.
- load(pdf_paths: list[str], config: BaseMethodLayerConfig, from_dict: Callable[[dict[str, Any]], RankingData]) RankingData | None ¶
Load metadata to generate term rankings from a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to load metadata. The order of the paths is important. The order should be the same as that when the store method is called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
from_dict – Function to convert a dictionary to a RankingData object.
- Returns:
Loaded metadata to generate term rankings. If there is no cache file, this method returns None. The returned metadata is converted to a RankingData object by the from_dict function.
- Return type:
RankingData | None
- remove(pdf_paths: list[str], config: BaseMethodLayerConfig) None ¶
Remove cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to remove cache files. The order of the paths is important. The order should be the same as that when the store method called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- store(pdf_paths: list[str], ranking_data: RankingData, config: BaseMethodLayerConfig) None ¶
Store metadata to generate term rankings to a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to store metadata. The order of the paths is important. The order should be the same as that when the load method to be called.
ranking_data – Metadata to generate term rankings to store.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.MethodLayerRankingFileCache(cache_dir: str)¶
Bases:
BaseMethodLayerRankingCache
Method layer ranking cache that stores and loads term rankings to/from a file.
- Parameters:
cache_dir – Directory path to store cache files.
- load(pdf_paths: list[str], config: BaseMethodLayerConfig) MethodTermRanking | None ¶
Load term rankings from a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to load term rankings. The order of the paths is important. The order should be the same as that when the store method is called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- Returns:
Loaded term rankings. If there is no cache file, this method returns None.
- Return type:
MethodTermRanking | None
- remove(pdf_paths: list[str], config: BaseMethodLayerConfig) None ¶
Remove cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to remove cache files. The order of the paths is important. The order should be the same as that when the store method called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- store(pdf_paths: list[str], term_ranking: MethodTermRanking, config: BaseMethodLayerConfig) None ¶
Store term rankings to a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to store term rankings. The order of the paths is important. The order should be the same as that when the load method to be called.
term_ranking – Term rankings to store.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.MethodLayerRankingNoCache(cache_dir: str)¶
Bases:
BaseMethodLayerRankingCache
Method layer ranking cache that does not store and load term rankings.
- Parameters:
cache_dir – This argument is ignored.
- load(pdf_paths: list[str], config: BaseMethodLayerConfig) MethodTermRanking | None ¶
Load term rankings from a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to load term rankings. The order of the paths is important. The order should be the same as that when the store method is called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- Returns:
Loaded term rankings. If there is no cache file, this method returns None.
- Return type:
MethodTermRanking | None
- remove(pdf_paths: list[str], config: BaseMethodLayerConfig) None ¶
Remove cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to remove cache files. The order of the paths is important. The order should be the same as that when the store method called.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- store(pdf_paths: list[str], term_ranking: MethodTermRanking, config: BaseMethodLayerConfig) None ¶
Store term rankings to a cache file.
- Parameters:
pdf_paths – Paths to PDF files in a domain to store term rankings. The order of the paths is important. The order should be the same as that when the load method to be called.
term_ranking – Term rankings to store.
config – Configuration for the method layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.StylingLayerFileCache(cache_dir: str)¶
Bases:
BaseStylingLayerCache
Styling layer cache that stores and loads styling scores to/from a file.
- Parameters:
cache_dir – Directory path to store cache files.
- load(pdf_path: str, config: StylingLayerConfig) PDFStylingScoreList | None ¶
Load styling scores from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load styling scores.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- Returns:
Loaded styling scores. If there is no cache file, this method returns None.
- Return type:
PDFStylingScoreList | None
- remove(pdf_path: str, config: StylingLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- store(styling_scores: PDFStylingScoreList, config: StylingLayerConfig) None ¶
Store styling scores to a cache file.
- Parameters:
styling_scores – Styling scores to store.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.StylingLayerNoCache(cache_dir: str)¶
Bases:
BaseStylingLayerCache
Styling layer cache that does not store and load styling scores.
- Parameters:
cache_dir – This argument is ignored.
- load(pdf_path: str, config: StylingLayerConfig) PDFStylingScoreList | None ¶
Load styling scores from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load styling scores.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- Returns:
Loaded styling scores. If there is no cache file, this method returns None.
- Return type:
PDFStylingScoreList | None
- remove(pdf_path: str, config: StylingLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- store(styling_scores: PDFStylingScoreList, config: StylingLayerConfig) None ¶
Store styling scores to a cache file.
- Parameters:
styling_scores – Styling scores to store.
config – Configuration for the styling layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.XMLLayerFileCache(cache_dir: str)¶
Bases:
BaseXMLLayerCache
A XML layer cache that stores and loads XML elements to/from a file.
- Parameters:
cache_dir – Directory path to store cache files.
- load(pdf_path: str, config: XMLLayerConfig) PDFnXMLElement | None ¶
Load XML elements from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load XML elements.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.
- Returns:
Loaded XML elements. If there is no cache file, this method returns None.
- Return type:
PDFnXMLElement | None
- remove(pdf_path: str, config: XMLLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.
- store(pdfnxml: PDFnXMLElement, config: XMLLayerConfig) None ¶
Store XML elements to a cache file.
- Parameters:
pdfnxml – The XML elements to store.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.
- class py_pdf_term.endtoend.caches.XMLLayerNoCache(cache_dir: str)¶
Bases:
BaseXMLLayerCache
An XML layer cache that does not store and load XML elements.
- Parameters:
cache_dir – This argument is ignored.
- load(pdf_path: str, config: XMLLayerConfig) PDFnXMLElement | None ¶
Load XML elements from a cache file.
- Parameters:
pdf_path – Path to a PDF file to load XML elements.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.
- Returns:
Loaded XML elements. If there is no cache file, this method returns None.
- Return type:
PDFnXMLElement | None
- remove(pdf_path: str, config: XMLLayerConfig) None ¶
Remove a cache file.
- Parameters:
pdf_path – Path to a PDF file to remove a cache file.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.
- store(pdfnxml: PDFnXMLElement, config: XMLLayerConfig) None ¶
Store XML elements to a cache file.
- Parameters:
pdfnxml – The XML elements to store.
config – Configuration for the XML layer. The configuration is used to determine the cache file path.