Studia Ekonomiczne. Zeszyty Naukowe Uniwersytetu Ekonomicznego w Katowicach ISSN 2083-8611 Nr 234 · 2015
Krzysztof Węcel
Uniwersytet Ekonomiczny w Poznaniu Katedra Informatyki Ekonomicznej krzysztof.wecel@ue.poznan.pl
LINKED GEODATA FOR PROFILING OF TELCO USERS
Summary: There is a growing interest in location-based profiling of users de-fined as com- bining geo-data with anonymous on-line profiles. The profile of an entity usually consists of concepts accompanied by a weight specifying a relative importance of the given concept for making an analysed entity distinct. The proposed profiling method of telco users is a two-step approach. First, profiles of mobile tower stations (BTS) are created based on crowdsourced geographical information. Second, they are used to generalise the behaviour of a calling user, which is determined from Call Detail Records (CRD). The linked data cloud is considered as an additional knowledge source in the user modelling process.
Keywords: linked data, user profiling, linked geodata, call detail record, mo-bile user, telco, cdr, bts, lgd, osm.
Introduction
There is a growing interest in location-based profiling defined as combining geo-data with anonymous on-line profiles. New methods for capturing informa- tion on where are the users and how their position changes over time are con- stantly developed. This information is becoming increasingly more valuable for a growing number of location-based or location-aware services. Some research- ers try to estimate value of mobile data information utilising proximity-based advertising valuation (Baccelli, Bolot, 2011). Tourists have been identified as the most rewarding target group.
Call Detail Record (CDR) is the most widely used source of mobile loca- tion data in academic research (Song et al., 2010). Presented location is not very precise as it is “rounded” to the co-ordinates of the nearest base transceiver sta- tion (BTS). Various granularities of location impact the value of location infor-
Krzysztof Węcel 200
mation (Baccelli, Bolot, 2011). CDR has been identified to be sufficient for drawing conclusions at the area level (Qu, Zhang, 2013), hence it is also suffi- cient for our purposes.
Linked data cloud is very often considered as additional knowledge source in user modelling process. Concepts defined in various ontologies can be used to characterise entities. Profile of the entity consists then of the concept accompa- nied by a weight specifying relative importance of the given concept for making analyse entity distinct.
In this paper we focus on cell tower granularity of location information and annotate it with geographical ontology derived from OpenStreetMap. The goal of the method is to provide profiles of the users based on profiles of the BTS sta- tions. Therefore, the profiling process has been split into several steps. First, the information about BTS location and its neighbourhood has to be retrieved and analysed. Then, based on this information a summary of BTS profiles is pre- pared. We propose an improvement in profiling process by leveraging TF-IDF ranking to address the issue of uneven distribution of categories describing mo- bile tower locations (skewness). In the last step we can characterise users that log-in into specific BTS stations.
Section 2 presents related research. Section 3 explains our general approach to profiling of entities based on geographical context. Section 4 introduces a method for characterisation of BTS stations, from data collection, through analy- sis, to data aggregation. Section 5 provides a method for profiling of telco users.
1. Related research
In the literature, there are various approaches to profiling of mobile users.
Some authors base purely on telco data, i.e. data available for mobile operators.
Majority of methods leverage the social media where users manifest their opin- ions, feelings, reveal location etc. The most sophisticated approaches add mining for generalisation of patterns and classification of users. Having access to ano- nymised call data, we base our method on this data.
Most methods base on social data like Twitter, Foursquare, Flickr, or Insta- gram as data is relatively easily retrievable (API available). These services are then widely used and generate large volumes of data. One of the challenges is how to model the use of such social (and mobile) applications by various users.
It is essential to understand the semantics of messages posted by them. Two trends are observed here: extraction of meaning and location.
Linked geodata for profiling of telco users 201
In many studies in order to enrich and disambiguate information gathered from user, semantic technologies are considered. They are particularly useful for providing context, and geographical context is one of the most important. Abel et al. introduced a user modelling framework that utilises semantic background knowledge and use it for point of interest (POI) specification (Abel et al., 2012).
Two knowledge sources of linked data are considered: GeoNames and DBpedia.
They demonstrate that user modelling quality improves when LOD-based back- ground knowledge is considered. DBpedia is unfortunately too coarse-grained for our purposes.
Instead of analysing separate check-ins, some approaches build activity- travel profile – a spatial trajectory is built from mobile phone call records only.
The tricky part is in classification of trajectories, where data mining methods can be applied (Görnerup, 2012). Not only the sequences have to be derived but in order to make sense they have to be classified into typical activity-travel pat- terns. Their relative frequencies constitute an activity travel profile (Liu et al., 2014). Our method also bases on a number of BTSes visited. However, we do not consider sequences of visits as our experiments have shown that this would not produce meaningful results.
Trajectories, or sequences of visited locations, are not very useful unless they are confronted with activities of the users. Lie et al. investigated to what ex- tent the behavioural routines could reveal the activities being performed at mo- bile phone call locations (Liu et al., 2013). The real value is in annotating loca- tions with activity purposes but for this additional information is required.
Although they have devised mathematical models to quantitatively characterize travel patterns, the motivating activities “were still in a less-explored stage” (Liu et al., 2013). Our method provides just another context for reasoning about pos- sible activities of the users, e.g. shopping, playing tennis.
According to (Cano et al., 2013) little has been done in modelling location entities. Therefore, they proposed to profile geographical areas by providing topical categorisation. Cano et al. used additional source – linked data to inter- link information from social stream with geographical objects (Cano et al., 2013). They have introduced LinkedPOI ontology, which uses DBpedia catego- ries to profile geographic space and proposed geo-lattice Awareness Stream model as one of the ways to represent location. The process consisted of filter- ing, enriching, structuring and interlinking microposts from Twitter, Facebook, and TripAdvisor. Our method in fact models location entities but as we do not analyse microposts we do not have to guess the correct location – it is provided in CDR.
Krzysztof Węcel 202
Qu proposed a framework and corresponding analytic methods to use User Generated Mobile Location Data (UGMLD) for Trade Area Analysis (Qu, Zhang, 2013). They have defined three key processes: “identifying the activity centre of a mobile user, profiling users based on their location history, and mod- elling users’ preference probability.” Application of the method is meant for analysis of customers’ visits to business venues. However, they rejected CDR as the data source in their research as it was too coarse. Our method is able to util- ise CDR in order to provide profiles of certain locations although we cannot provide profiles of certain venues belonging to bigger chains.
When specific locations are considered, Chen et al. also presented a method for profiling businesses at specific locations that was based on mining informa- tion from social media (Chen et al., 2014). They matched geo-tagged tweets against locations from Foursquare to build a profile of mentioned businesses.
Going back to the user, Ostuni et al. presented Cinemappy – a location- based application that computes film recommendations by exploiting contextual information related to current location of the user, leveraging information from DBpedia (Ostuni et al., 2013). Similarly, DBpedia was used by (Cano et al., 2013) who proposed a semantic travel mash-up as possible application. Ap- proach for museums was presented in (Ruotsalo et al., 2013).
One of the obstacles by user modelling is uneven distribution of categories of business, e.g. there are much more visits to a cinema than to second-hand and there are much more shops than theatres. It was first noted by Qu who explained this by social motivation, not necessarily by the differences in the number of various categories (Qu, Zhang, 2013). In their work the categories were very fine grained, for example Foursquare has a hierarchical category structure with 9 top categories and ca. 400 sub-categories. For the clarity of interpretation the catego- ries have been later collapsed to 6 groups. Our methods addresses this issue by applying specific methods from information retrieval domain.
2. Approach to geographical linked data-based profiling
This section describes user profiling based on BTS characteristics derived from the geographical linked data. We follow the idea of location-based user profiling – one of the approaches is geoprofiling, a commonly used method to approximate user characteristics based on neighbourhood demographic data.
In most approaches, there is a venue or a place given and authors are look- ing for coordinates. This is particularly important when text is analysed, espe- cially in social media, e.g. Twitter. In order to simplify disambiguation, some
Linked geodata for profiling of telco users 203
portals allow the so called check-ins where users select precise location, e.g.
Foursquare, Facebook. We take the reverse process: starting from coordinates we are interested in the objects nearby, thus describing the geographical context, further referred to as location profile.
Our experiments have been carried out on anonymised data, where the only reasonable data for linking was location of BTS towers. There was just one type of information that was widely used and could supplement our records – geo- graphical information. There are several open data sources concerning geo- graphical information that we could use, DBpedia and OpenStreetMap being the most prominent. Taking into account the granularity of available data and the re- quirement to display results on the map we made a decision to base our method on OpenStreetMap and its triplified counterpart – LinkedGeoData (Auer, Leh- mann, Hellmann, 2009). As a crowdsourced data, it is kept relatively up to date and but without breaking fluctuations.
LinkedGeoData (LGD) provides an ontology for classification of locations.
There are ca. 1200 categories grouped into ca. 45 top-level categories. Compar- ing LGD to Foursquare, the latter has ten top-level, 436 second-level, and 266 third-level categories1. Foursquare’s maps in fact use OpenStreetMap2. In our approach more general categories are advantageous as they can make interpreta- tion of generalisation results easier. Sub-categories could be more valuable as they can distinguish users better but the solution is then less stable from the sta- tistical point of view. This is a well-known trade-off of specificity vs. sensitivity (Fawcett, 2006). In order to best characterise the BTS stations, we have re- stricted our further analysis to some predefined objects (see Fig. 3).
The reasoning behind our profiling approach is presented in the following user story. A user often visits sport amenities. They are in the scope of some BTS stations. Profiles of such stations contain sport amenities with higher frequency than an average station. This is further reflected in a user profile where sport amenities gain higher weight when user trajectory is aggregated. Some visited venues are additionally annotated with a kind of sport, for example tennis3. Looking into calendars of sport events we can even reason further in which kind of sport user is interested or whom the user is supporting (whether team or indi- vidual).
1 https://developer.foursquare.com/categorytree.
2 https://foursquare.com/about/osm.
3 This depends on data availability in OpenStreetMap.
Krzysztof Węcel 204
3. Characteristics of BTS 3.1. Retrieval
In our experiments we have used BTS towers located in Poland, with ca.
8000 unique locations, stored in MySQL. At the beginning the information about BTS locations has been retrieved. Using a Python script, for each location a SPARQL query was prepared to retrieve list of objects in the neighbourhood along with their categories. As a source of data for our queries we have used LinkedGeoData which is a derivative of OpenStreetMap. Two main categories of objects are distinguished therein: nodes (just a point according to GIS termi- nology) and ways (lines or polygons). Separate queries for nodes and ways had to be prepared because the Virtuoso’s built-in distance function has different be- haviour for nodes and ways. As there were ca. 8000 locations, two kinds of ob- jects, two means for object capturing (bounding box and circle) and 3 various distances, we had to post ca. 80 thousand queries.
Below sample SPARQL query is presented:
PREFIX lgdm:<http://linkedgeodata.org/meta/>
PREFIX geom:<http://geovocab.org/geometry#>
PREFIX ogc:<http://www.opengis.net/ont/geosparql#>
SELECT distinct ?class ?way WHERE { ?way a lgdm:Way .
?way a ?class .
?way geom:geometry [ogc:asWKT ?geo ] . filter(bif:st_within( ?geo,bif:st_point(%f,%f),%f )) }
Listing 1. Sample SPARQL query using geospatial functions
We have decided to use LGD’s endpoint instead of OSM’s API as it was possible to use Virtuoso built-in SPARQL functions for spatial queries. For the retrieval of nodes, it was possible to provide detailed query and the radius was in fact expressed in kilometres. For example, Figure 1a presents various venues lo- cated in a circle of 1 km diameter.
Retrieval of ways was more complicated. Function bif:st_within was not returning what it was expected for Way objects when the other parameter was a point. Several methods to get satisfactory results have been tested, includ- ing generation of boxes to simulate containment (see Fig. 1b and 1c). Finally, the circle overlap after toleration parameter tuning to 0.01 has been chosen as a method to query for neighbouring ways (Fig. 1d).
F
3
a t i T c t ( u
(a)
(c) Fig.
3.2
and thei inte The cati tion (0.0 user
) no
) wa . 1.
. A
Va d qu ir n eres e Fi A ion.n an 06).
r pr des
ays, Var tion
Ana
Variouant neig
sting g. 2 Anot
. Ge nd
Su rofi
, cir
box riou nal
lys
ous tita ghbog to 2b s ther ene 1.2 uch
les.
rcle
x, 5 us v Fai
sis
asp ative ourh o obsho r as eral re asy .
e, 1k
km venu
r
pec e. F hoo bser ws pec fin stau ym
L
km d
ues s
ts o For od.
rve a c ct an ndin
uran mme
Link
diam
sele
of lo ex Th tha clos
naly ngs
nts;
try ked
mete
ecte
oca amp e si at th er l yse are
; on ne
geo
er
d ba
ation ple ize he c look ed is e as n th eds
odat
ased
n ch Fi
of city k.
s nu fol he o s to
ta fo
d on
har g. 2
the y mo
umb llow oth
be for p
n ob
ract 2a e ci ost ber ws:
er e e ad
profi
(b
(d) bjec
eris sho rcle pac of on end ddre
filing
b) w
) wa ct ty
stic ows e sp cke
ven av d ar esse
g of
ways
ays, ype
s ca s B peci ed w nue
era re u ed w
f tel
, bo
circ and
an b TS ifie with es o
ge univ
whe co u
ox, 1
cle, d dis
be a sta es th h ho f gi the vers en b
user
1km
dis stan
ana atio he n otel
iven re a sitie bui
rs
m
stanc nce,
alys ons num ls in n ca are es ( ildin
ce 0 nea
sed, tha mbe n Po ateg 4.0 (0.0 ng
0.01 ar P
bo at h er o olan gory 0 sh 04) the 1 Pozn
oth have of h nd i y pe hops an loc
nań
qua e ho hote
is G er B s pe nd c
cati Inte
alita otel els.
Gda BTS
er l cine ion
20
erna
ativ ls i It i ansk S lo loca ema an 5
a-
ve in is k.
o- a- as nd
2
F
r a a O h a m s t a p p 206
(a) Fig.
resu a co and Obj hen a w mar sent the a ty prof park 6
Pol . 2. N
A ults omp d GI ject nce way
rked ted
mo A ypic
file king
land Num
Altho pr preh IS t ts li the in d a as ost b After
cal e is
gs a d
mbe
oug rovi hen typ ike ey a alm s a
nod bala r ex BT pre and
er o
gh n ided nsiv
e a par are mo no des anc xper TS l esen d 15
f ho
nod d b ve lo and
rkin mo st ode.
an ed rim loca nted 5.3%
otels
des y t oca the ng o ore
110 . On nd o
enti ment
atio d in
% le s lo
and thes ation
e Op or l
pop 0 th
n th only
ity:
ts w on, n th eisu
cate
d w se t n pr
pen eisu pul hou he y 4 24 we c inc he F ure-
ed w
ways two
rofi nStr ure lar usan
oth as a 4899 conc clud Fig.
-rela K
with
s w o ty ile.
reet are as w nds her
a w 9 w clud ding
3.
ated Krzy
hin B
were ypes Th tMa eas way
cas end way.
ways ded g in On d.
yszto
(b) BTS
e di s of here ap c
are ys.
ses;
d ar Th s vs d tha nput n av
of W
Gd S sta
istin f ob e is
com e be Fo
; on re A he s
s. 2 at it t fr vera
Węc
dańs ation
ngu bjec
a s mmu
etter or e
nly ATM sam
887 t is om age,
cel
k ns
uish cts tron unit r re exam y 3.
Ms me ap
71 n use m bo , 22
ed hav ng p ty h epre mpl 5 t – o ppl nod eful oth 2.1%
due ve pre has esen le, p thou over lies des.
l to nod
% o e to
to fere
ad nted
par usan
r 6 for pre des of o
o te be enc dapt d by rkin
nds 500 r tra epa
an obje
echn me ce b ted y sh ng i
s pa 0 en am are c nd w ects
nica erge betw cer how
s re arki ntiti sto cha way s wi
al li ed t ween
rtain wing epr ing ies ops.
arac ys.
ithi imi to p n c n p g th ese
ar are Sh teri Com n B
itati pro ateg patte he a nte reas e re hops istic mb BTS
ions vid gor erns area d a s ar epre s ar cs o ine S ar
s, de
ry s.
a, as re e- re of ed re
F
4 4
f s s o k p T d d
w Fig.
4. P 4.1
ficie solu sult of l kind pos TF- doc deci
whe . 3.
Pro . T
Si ent ute ts w loca d of In e t -IDF cum
TF ided
ere Ave
ofile F-I
imp to val wou atio f ob n or
to u F is ment
F ca d to
ni,j
erag
es o IDF
ple cor lues ld b ons.
bjec rder use s ac t fre an b o use
is n ge d
of l F-in
agg rrec s. F be b
In cts m r to e T ctua equ be c e re
num distr
loca nsp
greg ctly For
bias fac mor all TF-I ally uenc calc elati
mbe L
ribut
atio pire
gati pro exa sed ct, w
re o levi IDF y a p
cy ( cula ive f
r of Link
tion
ons ed
ion ofil amp
if w we ofte iate F w
pro (IDF ated
freq
f tim ked
n of
s an me
of e lo ple, we are en th e the weig oduc F).
d in quen
mes geo
f obj
nd eth
geo ocat as had e in
han e ef ghti ct o diff ncy
s th odat
jects
us od
ogra tion the d no ntere n an
ffec ing of tw ffere y (w
at t ta fo
s am
ers for
aph ns. R
ere ot in
este n av
ct o sc wo ent w whic
the t for p
mon
s r lo
hica Rel are nclu ed i vera of u chem
sta way ch is
term profi
ng p
oca
l ca lativ e m
ude in i age unev
ma atist ys, e s no
m ti
filing
rede
atio
ateg ve v much ed th info
use ven
kn tics e.g.
orma
i occ g of
efin
on p
gori valu h m his orm er.
n dis now s: te . bo alis
cur f tel
ned
pro
ies a ues more cor matio strib wn
erm oole ed t
rred co u
30 c
ofili
assi are e sh rrec on, but
fro m fre
ean o to 1
d in user
cate
ing
ign e m hop ctio if g tion
m equ
or r 1.0),
doc rs
egor
g
ned more ps th on in giv n of inf uenc
raw , ex
cum ries
to p e im han n th en f ca form cy ( w fre xpre
men of o
plac mpo n lib
he c use teg mati
(TF eque esse
nt dj
obje
ces ortan
brar char ers
orie ion F) an
ency d as
j. ects
is n nt th ries
ract visi es w n re
nd y. W s fo
s
not han the teri
it s we etrie
inv We h ollow
20
suf n ab e re stic om pro eva vers hav ws:
7
f- b- e- cs me o- al.
se ve
2
n
w b o P n T s
F T 208
nati
whe ber of t Pitc not TF- stric
Fig.
Tab 8
ID ion
ere of A the ch –
ver -IDF
cted
. 4.
ble 1 DF i
pow
|D|
doc An e
mo – is ry u F m d lis
Per 1. S
is a wer
is cum entry ost p s 4.2 usef meth st o
rcen amp
a me as
a n men
y to pop 234 ful hod of 3
ntage ple
easu they
num ts c o ID pula 4. C to d ar 0 ob
e of pro
ure y ca
mber cont DF c ar c Con dist re g
bje
f loc file
of t an b
r of tain calc ateg nclu
ting give
cts.
catio s of
term bett
f do ning cula gor usion
guis en in
.
ons f BT
m sp ter c
ocu g th atio ry – n: s sh l
n th
con TS l
K
peci char
ume he te on i – Sh
sho loca he T
ntain loca
Krzy
ifici ract
ents erm s pr hop
ps, atio Tab
ning ation
yszto
ity – teris
in m ti.
rese p – i pre ons.
ble
g ve ns c
of W
– le se a
cor ente is 0 esen
Sa 1. T
enue calcu
Węc
ess f a do
rpu ed i 0.76 nt i amp The
es (n ulat
cel
freq ocum
us, a in F 66 a in a ple r ey h
nod ted w
quen men
and Fig.
and alm resu have
des) with
nt t nt. I
d de 4.
d for ost ults e be
of g h TF
erm t is
enom TF r le ha s ob
een
give F-ID
ms h cal
min -ID east lf o btain n bu
en c DF
have lcula
nato DF f po of th
ned uilt
categ e bi
ated
or c fact opul he d fro
bas
gory igge d as
cont tor i lar c
loc om sed
y er d s fol
tain in t cate atio the on
discr llow
ns n the ego ons e ab n the
rimi ws:
num cas ory , ar bov e re i-
m- se
– re ve e-
n a S I m q i 5 t T m n b t
F
4
t 1
4
neig are Som In b mos que inte 5 m thos Thi mea neig blue the
(a) Fig.
4.2
tain 10 B
4 Pl ri
Th ghb
nod me l
bold st c ent ( erest monu se c
s is Fi asur ghb e – wh
Pol . 5.
. U
Fo ning
BTS
leas od c
hey bour des loca d w char (loc ting um cate to ig.
re.
bour his hole
land Mo
User
or e g da S lo
e no can b
y pr rhoo (po atio we h
ract catio
g to ents egor som 5 p
Ea rho stor e Po
d ost p
r pr
eval ata f ocat
ote th be c
rofi od oint ns h hav teris on i o o s an ries me e pres ch od.
rica olan
popu
rof
luat for tion
hat t consi
files of t ts), hav ve m
stic id 3 bse nd 8
wi exte sent BT Th al ob nd (
ular
filin
tion 3 m ns a
this ider
L
in the inc ve ju mark for 32), erve 8 b th s ent ts a TS l he m
bjec (Fig
r ann
ng
n, w mon and
is il red, a
Link
n th loc clud ust
ked r a , the e th ank sma jus a vi loca mos cts, g. 5a
nota
we h nths the
llust any
ked
he catio
ding one d th giv e on hat ks. W aller
tifie isua atio st im
ora a) a
ation
have s. F en r
tratio othe
geo
Tab on g 2 e ob he c ven ne w
13 Wei r nu ed – alisa on i
mp ang and
ns o
e us rom rand
on o er se
odat
ble id 3
bu bjec cate n loc
with sh ight umb – su atio is a orta ge – Gd
of B
sed m th
dom
of th et of
ta fo
1 34 t us st ct (e egor cati h th ops ts c ber uch on o anno ant – sc dańs
BTS
d the his d mly
he m f obj
for p
sho ther top e.g.
ry w ion.
he h s in can of c
cat of t otat obj choo sk (
loc
e da data sel
metho ject
profi
oul re a s, 1 id 3 with . W high n lo be cate tego top ted bjec
ols, (Fig
(b) catio
atab aba lect
od an type
filing
d b are 1 pl 36) h th When
hest ocat
com ego orie cla wi ts a , gr g. 5
Gda ons
base ase w ted u
nd s es ca
g of
be 4 g lace , ot he h
n m t ID tion mpa
ries es cl asse ith are reen b) a
ańsk
e w we use
some an b
f tel
inte geog e of ther high man DF m
n id ared s ge
lear es r the cod n –
are
k
with hav ers4.
e de be us
co u
erpr grap f wo r hav hest ny c mak d 41 d be et hi
rly p rank e to
ded pub atta
a s ve q . Th
ecisi sed.
user
rete phic orsh ve m t m cate kes 1 ar
etw igh pro ked op-r d as blic ach
am que he c
ons rs
ed cal hip man meas
egor top re ween
er w ofile d ac rank s fo c tra hed.
mple eried
calc
are
as obj , an ny m sure ries p of less n lo weig e the ccor ked ollow
ansp .
d 1 d u cula
arbi
fol ject nd 1
mor e, th are f the
s im ocat
ght e lo rdin
ca ws:
por
0.0 sers atio
itrar
llow ts, a 1 fu re ( hus e eq e ran mpo tion
s (c ocat ng t ateg
red rt. F
000 s w n o
ry. A
ws:
all o uel (e.g s be
qua nkin orta ns, b
case tion to T gory
d – Figu
use with of th
Any
in of t
sta . id eing ally ng.
ant but e id n.
TF- y in – sh ures
ers at l he c
othe
20
th them ation
41 g th fre It i tha the 36) -IDF n th hops
s fo
con leas char
er pe
9
he m n.
).
he e-
is an en
).
F he s, or
n- st r-
e-
2
a a n v
F
b w h v i s 210
acte as a num visu
Fig.
bett with hav valu in F shou
0
eris a w mbe uali
. 6.
Fo ter e h th ve n ue o Fig.
uld tics weig
er o ise p
Geo or c emp his k norm of g
7.
d fur s of ghte
f co pro
ogra com pha kind mal give Gr rthe
f us d s onn file
aphi mpar asiz
d o lised en o roup er im
sers um nect es o
ical riso ze d
f ch d th obje ping mpr
s is m of tion of us
Lin on o diffe hart he v ect g o rov
s ra f pro ns in
sers
nked of u eren ts is valu typ f ce ve th
athe ofil nitia s as
d D user nces s th
ues pe in
erta he r
r st les ated s a p
Data- r pr
s an hat a s in n pr ain read
K
trai of v d w pie
-Ba rofil nd s axes ch rofi cate dab
Krzy
ght visi with cha
sed les sim s sh harts ile h ego ility
yszto
tfor ited
a g art.
pro mu milar hou s in has orie y of
of W
rwar d BT give
Fig
ofile uch ritie uld h
n su va s or f th
Węc
rd.
TS en B g. 6
e of mo es b hav uch alue
r ev he ch cel
Th loc BTS pre
f the ore betw ve th a w 1.0 ven
har he p
cati S. S
esen
e sam suit wee
he s way 0. T n red
rt.
prof ons Sim
nts
mpl tabl n u sam y, t The
duc file s, w ilar
pro
e us le i user me m
that sam cing
of wher rly t ofile
sers s a s. O mea
t us me g nu
f a re t to l es f
s rad One asur ser
use umb
use the oca for
dar e of
re. T wit ers ber
er is we atio sam
cha f the The th t are
of s pr eigh
ns, mple
art t e pr eref the com cat
rep ht is
we e us
that robl fore hig mp tego
are s th e ca sers
t ca lem e, w ghes
are orie
ed he an s.
an ms we st ed es
F
C
L u e i o p J t b t t t c a Fig.
Con
Lin user erat ing or r prof Jord tion bles that to p the cate an e
. 7.
ncl
In ked rs. T torsany reve
So file dan n is s ou t are prep
hie egor exte
Com
lusi
n th dGeThe s. N y bu enu
o fa e. In n, 20
mo ur a e pr pare erar ries ensi
mpa
ion
his eoD e m Neveusin e es ar w n th
012 odel app rov e ap rchy s wh
ion ariso
ns a
pa Data meth erth nes stim wee fu 2). I
lled proa vide ppro y of hen of
on o
and
ape a an hod hele s re mati hav utur It is d as ach ed b opri f ca n an theL
of u
d fu
er w nd ahas ess, equi ion ve re w s a g s a f for by l
iate ateg nnot e me
Link
user
utur
we also s bethe irin . app we p gen fini r pr
oca e mi gori
tati etho
ked
pro
re w
ha the een e el ng p plieplan nera
ite m rofi atio ixtu es.
ng od t
geo
ofile
wo
ave e TF n aplabo prof
d a n to ative
mix iling
ns.
ure.
As spe to h
odat
es on
ork
e c F-ID pplieorat file a sim o us
e pr xtur
g: u Wh Th of ecif hier
ta fo
n ra
ont DF- ed i ted of mp se L
rob re o user hat he m f no
fic v rarc
for p
adar
tribu -ba in t me nei ple a Late
abi over rs a
we meth
w L venu chic
profi
cha
ute ased he p etho ighb app ent
list r an are e ne hod Link
ue cal L
filing
arts
d d me
par od c bou proa Dir tic m n un
mo eed d ca ked (e.g LDA
g of
– re
the etho rticu can urho ach rich mod nder ode
to an a dGe g. b A (h
f tel
estr
m od f ular n be ood
for hlet del rlyi elled
det also eoD both hLD
co u
ricte
meth for r se e us
, e.
r ag All wh ing d as term o be Data h sh DA
user
ed to
hod pro ettin sed g. f ggr loca here
set s a mine e im a co hop A).
rs
o tw
d fo ofili ngs
un for m rega
atio e ea t of mi e ar mpro onta
and wo u
for ing
of nive mar atio on ( ach f top
ixtu re w oved ains
d am users
ex of mo ersa rke n to LD item pics ure weig
d b s all men s
plo loc obile
lly eting
o o DA) m o s. T of ght y c l int nity
oitat catio
e te for g pu obta
(B of a This cat ts al ons term y). T
tion ons elco r pr urp ain lei, co res tego llow side med The
21
n o s an o op rofil ose use Ng llec sem orie win erin diat ere i 1
of nd p-
l- es er g, c- m- es ng ng te is
Krzysztof Węcel 212
References
Abel F., Hauff C., Houben G.-J., Tao K. (2012), Leveraging User Modeling on the So- cial Web with Linked Sata [in:] Web Engineering, Springer-Verlag, Berlin- Heidelberg, pp. 378-385.
Auer S., Lehmann J., Hellmann S. (2009), LinkedGeoData: Adding a Spatial Dimension to the Web of Data, ISWC 2009, Vol. 5823, Springer, Heidelberg, pp. 731-746.
Baccelli F., Bolot J. (2011), Modeling the Economic Value of the Location Data of Mo- bile Users, INFOCOM, IEEE, pp. 1467-1475.
Blei D.M., Ng A.Y., Jordan M.I. (2012), Latent Dirichlet Allocation, “Journal of Machine Learning Research”, Vol. 3(4-5), pp. 993-1022, doi:10.1162/jmlr.2003.3.4-5.993.
Cano A.E., Dadzie A.-S., Burel G., Ciravegna F. (2013), Topica-Profiling Locations through Social Streams. Semantic Technology, Springer-Verlag, Berlin-Heidelberg, pp. 290-305.
Chen F., Joshi D., Miura Y., Ohkuma T. (2014), Social Media-based Profiling of Busi- ness Locations, Proceedings of the 3rd ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia, Orlando, FL, pp. 1-6.
Fawcett T. (2006), An Introduction to ROC Analysis, “Pattern Recognition Letters”, Vol. 27(8), pp. 861-874, doi:10.1016/j.patrec.2005.10.010.
Görnerup O. (2012), Scalable Mining of Common Routes in Mobile Communication Network Traffic Data [in:] J. Kay, P. Lukowicz, H. Tokuda, P. Olivier, A. Krüger (eds.), “Pervasive Computing”, Vol. 7319, Springer-Verlag London, pp. 99-106, doi:10.1007/978-3-642-31205-2_7.
Liu F., Janssens D., Cui J., Wang Y., Wets G., Cools M. (2014), Building a Validation Measure for Activity-based Transportation Models Based on Mobile Phone Data,
“Expert Systems with Applications”, Vol. 41(14), pp. 6174-6189, doi: 10.1016/
j.eswa.2014.03.054.
Liu F., Janssens D., Wets G., Cools M. (2013), Annotating Mobile Phone Location Data with Activity Purposes Using Machine Learning Algorithms, “Expert Systems with Applications”, Vol. 40(8), pp. 3299-3311. doi:10.1016/j.eswa.2012.12.100.
Ostuni V.C., Gentile G., Di Noia T., Mirizzi R., Romito D., Di Sciascio E. (2013), Mobile Movie Recommendations with Linked Data [in:] Availability, Reliability, and Security in Information Systems and HCI, Springer, Berlin-Heidelberg, pp. 400-415.
Qu Y., Zhang J. (2013), Trade Area Analysis Using User Generated Mobile Location Data, Proceedings of the 22nd International Conference on World Wide Web, Re- public and Canton of Geneva, Switzerland, International World Wide Web Confer- ences Steering Committee, pp. 1053-1064, http://dl.acm.org/citation.cfm?id=
2488388.2488480 (accessed: 30.08.2015).
Ruotsalo T., Haav K., Stoyanov A., Roche S., Fani E., Deliai R., Mäkelä E., Kauppinen T., Hyvönen E. (2013), SMARTMUSEUM: A Mobile Recommender System for the Web of Data. “Web Semantics: Science, Services and Agents on the World Wide Web”, Vol. 20(0), pp. 50-67, doi:10.1016/j.websem.2013.03.001.
Song C., Qu Z., Blumm N., Barabási A.-L. (2010), Limits of Predictability in Human Mobility, “Science”, Vol. 327(5968), pp. 1018-1021.
Linked geodata for profiling of telco users 213
POWIĄZANE GEODANE DLA PROFILOWANIA UŻYTKOWNIKÓW TELCO
Streszczenie: Obserwuje się rosnące zainteresowanie geograficznym profilowaniem użytkowników, rozumianym jako łączenie danych geograficznych z anonimowymi pro- filami użytkowników. Profil jednostki zazwyczaj składa się z pojęć geograficznych oznaczonych wagami, odzwierciedlającymi względną ważność poszczególnych pojęć dla odróżniania użytkowników. Proponowana metoda profilowania użytkowników sieci komórkowych jest dwuetapowa. W pierwszej kolejności tworzone są profile stacji prze- kaźnikowych (BTS) na podstawie społecznie dostarczonych informacji geograficznych.
Następnie te profile są wykorzystywane do uogólnienia zachowania użytkownika, wyni- kającego z analizy logów jego połączeń (CDR). Chmura danych powiązanych (linked data) jest wykorzystywana jako dodatkowe źródło wiedzy w procesie modelowania użytkownika.
Słowa kluczowe: dane powiązane, profilowanie użytkownika, powiązane geodane, logi połączeń, użytkownik mobilny, telco, cdr, bts, lgd, osm.