Classification of Linked Data Sources Using Semantic Scoring

YUMUŞAK, Semih; DOĞDU, Erdoğan; KODAZ, Halife

Classification of Linked Data Sources Using Semantic Scoring

Yazar/lar YUMUŞAK, Semih
DOĞDU, Erdoğan
KODAZ, Halife
Yayın Türü Makale
Yayın Tarihi 2018
Tek Biçim Adres https://hdl.handle.net/20.500.12498/1040

Linked data sets are created using semantic Web technologies and they are usually big and the number of such datasets is growing. The query execution is therefore costly, and knowing the content of data in such datasets should help in targeted querying. Our aim in this paper is to classify linked data sets by their knowledge content. Earlier projects such as LOD Cloud, LODStats, and SPARQLES analyze linked data sources in terms of content, availability and infrastructure. In these projects, linked data sets are classified and tagged principally using VoID vocabulary and analyzed according to their content, availability and infrastructure. Although all linked data sources listed in these projects appear to be classified or tagged, there are a limited number of studies on automated tagging and classification of newly arriving linked data sets. Here, we focus on automated classification of linked data sets using semantic scoring methods. We have collected the SPARQL endpoints of 1,328 unique linked datasets from Datahub, LOD Cloud, LODStats, SPARQLES, and SpEnD projects. We have then queried textual descriptions of resources in these data sets using their rdfs:comment and rdfs:label property values. We analyzed these texts in a similar manner with document analysis techniques by assuming every SPARQL endpoint as a separate document. In this regard, we have used WordNet semantic relations library combined with an adapted term frequency-inverted document frequency (tfidf) analysis on the words and their semantic neighbours. In WordNet database, we have extracted information about comment/label objects in linked data sources by using hypernym, hyponym, homonym, meronym, region, topic and usage semantic relations. We obtained some significant results on hypernym and topic semantic relations; we can find words that identify data sets and this can be used in automatic classification and tagging of linked data sources. By using these words, we experimented different classifiers with different scoring methods, which results in better classification accuracy results.

Koleksiyonlar Fakülteler
Mühendislik ve Doğa Bilimleri Fakültesi
Bilgisayar Mühendisliği

Erişime Açık

Görüntülenme

4

22.03.2024 tarihinden bu yana

İndirme

1

22.03.2024 tarihinden bu yana

Son Erişim Tarihi

19 Nisan 2024 05:39

Google Kontrol

Tıklayınız

Tam Metin İndirmek için tıklayın Ön izleme

Eser Adı (dc.title)	Classification of Linked Data Sources Using Semantic Scoring
Yayın Türü (dc.type)	Makale
Yazar/lar (dc.contributor.author)	YUMUŞAK, Semih
Yazar/lar (dc.contributor.author)	DOĞDU, Erdoğan
Yazar/lar (dc.contributor.author)	KODAZ, Halife
Atıf Dizini (dc.source.database)	Wos
Atıf Dizini (dc.source.database)	Scopus
Yayın Tarihi (dc.date.issued)	2018
Kayıt Giriş Tarihi (dc.date.accessioned)	2019-07-10T08:17:12Z
Açık Erişim tarihi (dc.date.available)	2019-07-10T08:17:12Z
ISSN (dc.identifier.issn)	1745-1361
Özet (dc.description.abstract)	Linked data sets are created using semantic Web technologies and they are usually big and the number of such datasets is growing. The query execution is therefore costly, and knowing the content of data in such datasets should help in targeted querying. Our aim in this paper is to classify linked data sets by their knowledge content. Earlier projects such as LOD Cloud, LODStats, and SPARQLES analyze linked data sources in terms of content, availability and infrastructure. In these projects, linked data sets are classified and tagged principally using VoID vocabulary and analyzed according to their content, availability and infrastructure. Although all linked data sources listed in these projects appear to be classified or tagged, there are a limited number of studies on automated tagging and classification of newly arriving linked data sets. Here, we focus on automated classification of linked data sets using semantic scoring methods. We have collected the SPARQL endpoints of 1,328 unique linked datasets from Datahub, LOD Cloud, LODStats, SPARQLES, and SpEnD projects. We have then queried textual descriptions of resources in these data sets using their rdfs:comment and rdfs:label property values. We analyzed these texts in a similar manner with document analysis techniques by assuming every SPARQL endpoint as a separate document. In this regard, we have used WordNet semantic relations library combined with an adapted term frequency-inverted document frequency (tfidf) analysis on the words and their semantic neighbours. In WordNet database, we have extracted information about comment/label objects in linked data sources by using hypernym, hyponym, homonym, meronym, region, topic and usage semantic relations. We obtained some significant results on hypernym and topic semantic relations; we can find words that identify data sets and this can be used in automatic classification and tagging of linked data sources. By using these words, we experimented different classifiers with different scoring methods, which results in better classification accuracy results.
Yayın Dili (dc.language.iso)	en
Tek Biçim Adres (dc.identifier.uri)	https://hdl.handle.net/20.500.12498/1040

Yayın Görüntülenme

Erişilen ülkeler

Erişilen şehirler

Bu site altında yer alan tüm kaynaklar Creative Commons Alıntı-GayriTicari-Türetilemez 4.0 Uluslararası Lisansı ile lisanslanmıştır.