Semi-automated keyword assignment to variables in datasets
1 : Finnish Social Science Data Archive (FSD)
The Finnish Social Science Data Archive (FSD) has been exploring the possibility to assign keywords to variables and questions belonging to datasets. Keywords could be useful in improving the discoverability of variables and building question banks. All datasets are documented using DDI Codebook and indexed with keywords selected from controlled vocabularies on study level but doing this manually to the over 250 000 variables is not feasible. Therefore, automating a part of the process would be necessary.
One possibility to tackle this task could be using Annif, a tool for automated subject indexing developed at the National Library of Finland. It implements several algorithms, both lexical, matching the words appearing in the document text to controlled vocabulary terms, and associative, relying on statistical or machine learning methods utilizing information from manually indexed documents. Fusion approaches that combine different kinds of algorithms can also be used to improve performance.
This presentation will share the promising results of automated subject indexing of variables with Annif. Since variables have not yet been extensively indexed, study abstracts were also used to produce comparable results. How the DDI format could be used to improve the results will also be briefly discussed.
One possibility to tackle this task could be using Annif, a tool for automated subject indexing developed at the National Library of Finland. It implements several algorithms, both lexical, matching the words appearing in the document text to controlled vocabulary terms, and associative, relying on statistical or machine learning methods utilizing information from manually indexed documents. Fusion approaches that combine different kinds of algorithms can also be used to improve performance.
This presentation will share the promising results of automated subject indexing of variables with Annif. Since variables have not yet been extensively indexed, study abstracts were also used to produce comparable results. How the DDI format could be used to improve the results will also be briefly discussed.