Classification of Linked Data Sources Using Semantic Scoring
Yumusak, Semih and Dogdu, Erdogan and Kodaz, Halife
Loading
Abstract
Linked data sets are created using semantic Web technologies and they
are usually big and the number of such datasets is growing. The query
execution is therefore costly, and knowing the content of data in such
datasets should help in targeted querying. Our aim in this paper is to
classify linked data sets by their knowledge content. Earlier projects
such as LOD Cloud, LODStats, and SPARQLES analyze linked data sources in
terms of content, availability and infrastructure. In these projectss,
linked data sets are classified and tagged principally using VoID
vocabulary and analyzed according to their content, availability and
infrastructure. Although all linked data sources listed in these
projects appear to be classified or tagged, there are a limited number
of studies on automated tagging and classification of newly arriving
linked data sets. Here, we focus on automated classification of linked
data sets using semantic scoring methods. We have collected the SPARQL
endpoints of 1,328 unique linked datasets from Datahub, LOD Cloud,
LODStats, SPARQLES, and SpEnD projects. We have then queried textual
descriptions of resources in these data sets using their rdfs: comment
and rdfs: label property values. We analyzed these texts in a similar
manner with document analysis techniques by assuming every SPARQL
endpoint as a separate document. In this regard, we have used WordNet
semantic relations library combined with an adapted term
frequency-inverted document frequency (tfidf) analysis on the words and
their semantic neighbours. In WordNet database, we have extracted
information about comment/label objects in linked data sources by using
hypernym, hyponym, homonym, meronym, region, topic and usage semantic
relations. We obtained some significant results on hypernym and topic
semantic relations; we can find words that identify data sets and this
can be used in automatic classification and tagging of linked data
sources. By using these words, we experimented different classifiers
with different scoring methods, which results in better classification
accuracy results.... Show more Show less