04/16/2021
By Karen Volis

The Kennedy College of Sciences, Department of Computer Science, invites you to attend a doctoral dissertation defense by Hao Zhang on "Unsupervised Contextual Network Ranking of Sentences and Supervised Boilerplate Detection of Complex Webpages."

Ph.D. Candidate: Hao Zhang
Time: Monday, April 26, 2021
Time: 9 a.m. EST
Location: This will be a virtual defense via Zoom.

Committee Chair (Advisor): Jie Wang, Professor, Computer Science Department, University of Massachusetts
Lowell.

Committee Members:

  • Steve Homer, Professor, Department of Computer Science, Boston University
  • Karen (Jingrong) Lin, Associate Professor, Accounting Department, University of Massachusetts Lowell
  • Benyuan Liu, Professor, Department of Computer Science, University of Massachusetts Lowell

Abstract:
Sentence ranking over a given document is an important task in text mining. It facilitates hierarchical reading that enables the reader to read a layer of the most important sentences first, then subsequent layers of the next important sentences until the entire document is read. Moreover, ranking sentences can be applied in various text mining tasks, including text retrieval and extractive summarization.

We present in this dissertation new algorithms that outperform SummBank benchmarks. We first present Semantic Sentence Rank (SSR) using semantic features that can be computed readily for a given language. In particular, SSR extracts content words and phrases from a text document and uses semantic measures to construct, respectively, a semantic phrase graph over phrases and words, and a semantic sentence graph over sentences. It then scores the phrase graph and sentence graph separately based on the text document's article structure. SSR ranks sentences based on their scores and topic diversity through semantic subtopic clustering. To achieve higher accuracy, we devise Contextual Network Rank (CNR), which integrates word-level and sentence-level information into one contextual network and assigns edge weights by combining syntactic and semantic information. We score each node on the contextual network with respect to the underlying article structure, based on which we score each sentence by adding the scores of the corresponding nodes normalized by a BM25 normalizer. Finally, we rank sentences based on topic analysis and a bi-objective 0-1 knapsack maximization problem. We implement CNR using the T5 neural network, dependency trees, PageRank, Affinity Propagation, and dynamic programming. We show that, via numerical analysis, CNR is better than SSR, which outperforms previous state-of-the-art unsupervised models. Moreover, CNR outperforms the combined ranking of all human judges on the SummBank benchmarks in all categories. CNR also achieves state-of-the-art results on the DUC-02 benchmarks.

To apply layer reading on webpages calls for extraction of the main text from webpages with complex layouts and irrelevant contents. For this purpose, we devise SemText, a hierarchical neural network model for context extraction without using handcrafted features required for existing methods. In particular, we present a universal representation of the text contained in an HTML file as a sequence of three-word strings. SemText consists of a depth-wise convolutional feature-extraction model to encode a 3-tuple of word strings into a feature map, and a neural sequence labeling model to classify text blocks by cultivating the semantic meanings registered inside feature maps and between them. We train SemText on three published datasets of news webpages and fine-tune it using the small number of development data in CleanEval and GoogleTrends-2017. We show that SemText achieves state-of-the-art accuracy on these datasets. We further demonstrate the robustness of SemText by showing that it also detects boilerplates effectively on out-of-domain community-based Q&A webpages.

All interested students and faculty members are invited to attend the defense via remote access.