TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol. 12, No. 9, September
2014, pp. 64
4
5
~ 645
3
DOI: 10.115
9
1
/telkomni
ka.
v
12i9.516
3
6445
Re
cei
v
ed
No
vem
ber 2
1
, 2013; Re
vi
sed
Apr 18, 201
4; Accept
ed Ju
ne 6, 2014
Implementation of a File System with Encryption and
De-duplication
Xiaoning Jiang, Wenhu
a Zhou, Yang Leng
Coll
eg
e of Information a
nd El
e
c
tronic Eng
i
ne
erin
g, Z
hejia
ng
Gongsha
ng U
n
iversit
y
Han
g
zho
u
, P.R.Chin
a, 057
1-2
887
77
58
Corresp
on
din
g
author, e-mai
l
: jian
g
x
ia
oni
ng
@zjgsu.
edu.cn
, vanora.zho
u
@
gmai
l.com,
fei
y
an
ge
nator
@gmai
l
.com
A
b
st
r
a
ct
W
i
th the rapi
d adv
ance
of soci
ety, esp
e
cial
ly the
d
e
vel
o
p
m
e
n
t of computer
tec
hno
log
y
,
netw
o
rk
techno
logy
an
d i
n
for
m
ati
on t
e
chn
o
l
ogy, t
her
e is
an i
n
cre
a
sin
g
de
ma
nd for
s
ystems th
at c
a
n
provi
de secur
e
data storage i
n
a cost
-effective ma
nn
er. In this pap
er, w
e
prop
ose a prot
otype file syste
m
calle
d EDF
S
(
E
ncryptio
n an
d
De-d
upl
icatio
n
F
ile Syst
e
m
), w
h
ich prov
ides
both d
a
ta sec
u
rity an
d spac
e
efficiency
in
storage system
s. We
adopt de-
duplic
ation tec
h
nology
called
CDC (Content-Defined
Chu
n
kin
g
) to c
hunk
the fi
le
a
nd c
o
mpute
its
un
iqu
e
fi
nger
p
r
int, lo
okup
diff
erent
blocks, t
hen
store
differ
ent
blocks
on
ser
v
er to
ens
ure
storag
e
effici
ency. T
h
i
n
kin
g
of sec
u
rity,
w
e
use c
onv
e
r
gent
encry
pti
on to
encrypt
data bl
ocks, w
h
ich
not only
ens
ure data sec
u
rity
but also i
m
pr
ove
the possi
bility of
de-
duplic
ati
o
n.
F
i
nally, w
e
des
cribe
a pr
ototype i
m
ple
m
enta
t
ion of E
D
F
S
b
a
sed
on F
U
SE
and
pres
ent a
n
an
alysis
of t
h
e
potenti
a
l effecti
v
eness, usi
ng r
eal d
a
ta
obta
i
n
ed fro
m
a runti
m
e d
a
tab
a
se.
Ke
y
w
ords
:
ch
unki
ng, conv
er
gent encry
pt
io
n, file system,
backu
p
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
With the ri
sin
g
of informati
on-n
a
lized le
vel
of human
so
ciety, the wo
rld’
s data
ca
pacity is
gro
w
ing
at an alarmi
ng ra
te, which i
ndi
cate
s that
the era of g
r
ea
t explosion
of information
ha
s
alrea
d
y come
. Accordi
ng t
o
stati
s
tics, u
n
til no
w the
global
amo
u
n
t
of data
ha
s increa
se
d b
y
8
times from 20
05 to 2012. It is expected t
hat the
global
amount of da
ta could rea
c
h 35 quad
rilli
on
in 2020. Ho
wever, study found that up
to 60% of
these data is
redu
nda
nt an
d the traditio
nal
data comp
re
ssion [2]
can
o
n
ly eliminate
intra-file
red
u
ndan
cy. So i
n
order to
sol
v
e the proble
m
mentione
d a
b
o
ve, de-dupli
c
ation, al
so
known a
s
si
ngl
e-in
stan
ce
st
orag
e, ha
s b
een p
r
o
p
o
s
ed
as
an efficient way to best utilize the given
amount of
st
orag
e. De
-du
p
licatio
n iden
tifies comm
o
n
seq
uen
ce
s
o
f
bytes
calle
d chun
ks b
o
t
h within
a
n
d
between
file
s, an
d o
n
ly
store
s
a
sin
g
le
instan
ce
of
e
a
ch
chun
k re
gardl
ess of th
e nu
mbe
r
of times it o
c
curs. By doi
ng
so,
de-du
plicat
ion
can d
r
am
atically redu
ce th
e spa
c
e requi
red to sto
r
e a
large data
se
t.
Data se
cu
rit
y
[3] is ano
ther field of
great impo
rtance that must be taken into
con
s
id
eratio
n
in mo
de
rn
storage
sy
ste
m
s. Bu
t u
n
fo
rtunately, tra
d
itionally en
cryption a
nd
de-
dupli
c
ation a
r
e, to a great
extent, diametrically
op
po
sed to ea
ch
other. De-d
u
p
licatio
n makes
use
of d
a
ta
simila
rity to remove
red
u
n
dant info
rmat
ion in
sto
r
ag
e spa
c
e
whil
e the
goal
o
f
cryptog
r
a
phy
is quite th
e
cont
rary, which
aims
at makin
g
ci
ph
er text indisti
ngui
sha
b
le from
theoreti
c
ally
random
d
a
ta.
As a
result, the g
oal
of
a
se
cure
de-du
plicatio
n
syst
em i
s
to
en
sure
data se
cu
rity without comp
romi
sing the
spa
c
e e
ffici
en
cy achi
eved from de
-du
pe
techni
que
s.
In this
wo
rk,
we
develo
p
a
prototype
file
sy
ste
m
for secu
re
de
-du
p
lication,
whi
c
h allo
ws
data to be encrypted in
depe
ndently without invali
dating data
de-d
upli
c
atio
n. Also, we put
particula
r effo
rt in pe
rformi
ng some
phy
sical
expe
rim
ents to
analy
z
e the
pe
rformance b
ehav
ior
of the given tech
niqu
e. We
ex
amine
the relation
shi
p
s among aver
a
ge chun
k si
ze
, de-du
plicatio
n
ratio und
er dif
f
erent chun
ki
ng algo
rithm
s
, chun
k si
ze
s and data
sets.
2. Related
w
o
rk
There mainl
y
exist three approa
che
s
for e
limin
ating red
und
ant informati
on: delta
encodin
g
, de
-dupli
c
atio
n, and comp
re
ssion. Current
system
s whi
c
h aim at a
c
hieving effici
ent
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 9, September 20
14: 64
45 - 645
3
6446
stora
ge a
n
d
highly utilized network
band
widt
h u
s
ually rely upon on
e of the mentio
ned
techni
que
s in
depe
ndently or use them
in a co
mbin
ed mann
er.
Delta en
co
ding, also
kno
w
n
as
data differenci
ng, is a
way of storin
g or tr
an
smitting data in
th
e
form
of differen
c
e
s
bet
ween
seq
uential
d
a
ta. It is used in m
any
appli
c
ation
s
i
n
clu
d
ing
so
u
r
ce
control
a
nd ba
ckup [
4
].
Rsyn
c is
a uti
lity
free
software an
d network
p
r
oto
c
ol t
hat
synchro
n
i
z
e
s
files and
dire
ctori
e
s
fro
m
one lo
cation t
o
anothe
r whi
l
e minimizin
g
data
tran
sfer by
using
d
e
lta
encoding when
ap
pro
p
ri
ate
[5].
In ord
e
r t
o
i
m
prove
spa
c
e efficie
n
cy,
ba
ckup a
p
p
lication
s
al
so
take
advant
age
of
informatio
n redun
dan
cy to great
ly decrease the am
ount of data
to be backe
d up [6]. There
are
three pri
m
ary chunki
ng
strategi
es of
data de-du
plicatio
n: wh
ole-file
chun
king, fixed-si
ze
chu
n
ki
ng
an
d vari
able
-
si
ze chun
kin
g
.
Whol
e-file
ch
unki
ng t
r
eat
s ea
ch
file a
s
a
sin
g
le
blo
ck,
typically uses the blo
ck’
s h
a
sh va
l
ue a
s
its identifier.
Therefore,
if
more th
an o
ne files
ha
sh
to
the sam
e
val
ue, they are
assumed to
have iden
tical contents and will only need to be
stored
once. Farsite [7] and the Windo
ws Singl
e Instan
ce
Store [8] both p
e
rform
de-du
plicatio
n on p
e
r-
file bas
es
. A
s
for fixed-s
i
z
e
c
h
unk
i
ng, vari
eties
of pre
c
edi
ng
works exploit
it for backup
appli
c
ation
s
and la
rg
e-sca
l
e file sy
stem
s, such a
s
[9]
and [1
0]. Le
ssf
s [11]
,
a h
i
gh
pe
rforma
nce
inline data de
-dupli
c
atin
g file system for
Linux, is
be
co
ming more an
d more p
opul
ar in ente
r
pri
s
e
solutio
n
s fo
r
redu
cin
g
di
sk backu
ps an
d minimi
zi
ng
virtual ma
ch
ine sto
r
a
ge i
n
pa
rticula
r
.
In
addition, Op
e
nded
up, also
called SDF
S
[12] is a file-sy
s
tem tha
t
suppo
rts in
-line an
d bat
ch
mode
de
-du
p
lication
on
bo
th Linux
and
Wind
ows,
al
o
ng
with VM
ware
virtuali
z
e
d
environm
en
ts
.
It can
do
de
-d
uplication
pro
c
e
s
s eithe
r
l
o
cally, on
the
netwo
rk,
or in
the
clou
d
(in
c
ludi
ng Am
azon
S3). The va
riable
-
size chun
king, al
so kn
own
a
s
conte
n
t-d
e
fined
chu
n
ki
n
g
split
s files into
variable
-
len
g
th chun
ks wit
h
in a slidin
g wind
ow u
s
ing
a hash valu
e. Variable
-
le
ngth chu
n
ki
n
g
is
widely u
s
ed i
n
variou
s a
p
p
lication
doma
i
ns of redu
nd
ancy elimi
nati
on such a
s
b
a
ckup
s [13], file
system
s [14], and data t
r
ansfe
rs [1
5]. In additi
on, some wo
rks prop
osed
to adapt
the
m
o
st
optimal ch
un
king
scheme
s
ba
sed o
n
t
he inner fe
atures of the file it
self. Acco
rdi
ng to this, Liu
et
al. pro
posed
a sche
me [
16] that appl
ies di
ffe
rent
chu
n
ki
ng me
thods
on the
basi
s
of th
e
metadata
of individual file
s. Jaeh
ong
et
al. pr
o
p
o
s
ed context-a
w
a
r
e
ch
un
king which sh
ares
t
he
basi
c
ide
a
wit
h
Liu’s
work, but only use f
ile
extensio
n rathe
r
than all
file metadata [17].
Beside
s de
-d
uplication, da
ta security is
another
key
factor that must be taken into
con
s
id
eratio
n
to build
se
cu
re
system
s. Keyed
en
crypti
on ha
s b
een
put into u
s
e t
o
add
re
ss dat
a
se
cre
c
y
by
many
dist
rib
u
t
e
d sy
st
em
s,
s
u
ch a
s
O
c
ea
nStore [18] a
nd e-Va
ult [19]. Cryptogra
phic
techni
que
s
utilized
by the
system
s m
e
n
t
ioned
above
make the
a
s
sumption
th
at all in
comi
ng
data is al
rea
d
y encrypted.
Neverth
e
less, none of
th
ese system
s attempt
to
exploit
redu
nda
ncy
deletion to a
c
hieve
stora
ge efficien
cy. Some
syste
m
s ad
apt se
curity
mod
e
ls at expense of
spa
c
e ove
r
he
ad rath
er tha
n
provide
se
curity and
de
-dupli
c
ated
storag
e efficien
cy. For in
stan
ce,
PASIS [20] and POTS
HA
RDS [21] achieve long-term
security by using se
cret sharing, which
lead
s to a very high stora
g
e
overhe
ad.
3. The EDFS Scheme
3.1. Chunkin
g Module
Chu
n
ki
ng is
a method of
sca
nnin
g
a
file and spli
tting it into short elem
ent
s. Each
element i
s
ca
lled a
chu
n
k
and i
s
a u
n
it
of red
unda
ncy detection. A
s
me
ntioned
above, the
r
e
are
three main ch
unki
ng strate
gies of
data
d
e
-du
p
licat
ion:
whol
e-file
ch
unki
ng,
fixed-sized
chu
n
ki
n
g
and content
-defined
chu
n
k
ing. Whole
-
fi
le chu
n
ki
ng i
s
sim
p
le an
d fast, but it can only detect
file
dupli
c
ate sin
c
e the entire file is viewed a
s
a whol
e
blo
ck [22]. For fixed-si
ze
chu
n
kin
g
, a file
will
be brea
k into
fixed size pie
c
e
s
. Ho
weve
r, beca
u
se bo
unda
rie
s
a
r
e
cho
s
e
n
by offset, this m
e
th
od
is very
sen
s
iti
v
e to the inse
rtion an
d del
etion op
eratio
n. Inse
rting o
r
deletin
g eve
n
a si
ngle
byte
will shift all the blo
ck b
o
unda
rie
s
followin
g
the m
odified poi
nt, resulting in
entirely different
blocks. To eff
e
ctively add
ress this
pro
b
l
e
m, EDFS ad
opt the sli
d
in
g win
d
o
w
(a
byte seq
uen
ce of
a
given
len
g
t
h) chun
kin
g
algo
rithm (Figure
1
)
cal
l
ed conte
n
t-d
e
fined ch
un
king,
al
so call
ed
v
a
riable
-
size chu
n
ki
ng.
EDFS uses
a
fingerp
r
int fu
nction
whi
c
h
derive
s
from
Rabi
n’s fing
erprint alg
o
rith
m [23] to
comp
ute
fing
erp
r
ints. The
s
e
finge
rp
rints are sig
natu
r
e
s
fo
r bo
und
e
d
wi
ndo
w
of
byte stream.
The
fingerp
r
int fun
c
tion is a
s
foll
ows:
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Im
plem
entation of a File System
with En
cryptio
n
and
De-d
upli
c
atio
n (Xiaoni
ng Ji
ang)
6447
n
l
n
i
i
D
r
S
)
(mod
(
1
)
Whe
r
e
i
S
is a byte stream,
n
s
s
s
s
S
...
3
2
1
,
l
is si
ze of the cho
s
en sli
d
i
ng wind
ow,
r
and
D
are
fixed values, and
D
r
. We call this equ
ation
a “targ
e
t pattern
”.
Figure 1. Sliding Win
d
o
w
Chun
king
If the sum of
the
l
length b
y
tes
in
the sli
d
ing wind
ow divides
D
with
r
re
mainde
r, ED
FS
will ma
rk the
cu
rre
nt en
d
of the wi
ndo
w a
s
a
brea
king poi
nt. Th
e bytes
between the
current
brea
kin
g
poin
t
and the pre
v
ious brea
kin
g
point
com
p
ose a chun
k. Otherwise, the sliding
wind
ow
will be shifte
d by one byte to generate
another fing
erp
r
int and compa
r
e agai
n. As we can
see,
CDC alg
o
rith
m hold
s
a
clea
r adva
n
tage ove
r
fixed-size si
nce the bou
nd
ary point
s a
r
e
determi
ned
o
n
the b
a
si
s o
f
the co
nt
ent
of files, not t
he offset f
r
o
m
the be
ginn
ing. As a
re
sult,
insertion or deletion will only affect the specif
i
c
blocks where data i
n
serted into or deleted from.
3.2. Encr
y
p
tion Algorith
m
Traditio
nally, combi
n
ing
th
e sp
ace effici
ency
of de-d
uplication with
the se
cre
c
y
asp
e
ct
s
of encryptio
n is pro
b
lem
a
tic sin
c
e g
ene
rally, di
fferent client
s hold di
fferent key
s
, formul
a:
)
,
(
_
block
key
encrypt
text
cipher
(
2
)
Mean
s afte
r
encryption, th
e same
source blo
c
k m
a
y gene
rate different
blo
c
ks. In
this wo
rk, we
adopt convergent en
crypti
on (Fig
ure
2) to addre
s
s
the issue. Usi
ng this meth
od, client
s g
a
in
encryption ke
y by using a functio
n
to cal
c
ulate the h
a
s
h value of th
e given blo
ck.
)
),
(
(
_
block
block
hash
encrypt
text
cipher
(
3
)
Figure 2. Con
v
ergent En
cryption Modul
e
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 9, September 20
14: 64
45 - 645
3
6448
Therefore, th
e sa
me pl
aint
ext results in
identic
al ci
ph
er text reg
a
rd
less of who d
oes th
e
encryption. In addition,
EDFS utilizes XTS-AES
to encrypt
the chunk
since it provides
transpa
rent e
n
cryptio
n
. We
can
i
n
sert an
e
n
cr
yptio
n
/decryption module
into an
exi
s
ting d
a
ta
path without
cha
ngin
g
data layout or messag
e
forma
t
s of other co
mpone
nts un
der the
s
e pat
hs.
What'
s
mo
re,
Dworkin [24]
prove
s
that “In the absen
ce of authent
i
c
ation
or acce
ss co
ntrol,
XT
S-
AES provides more protection than the other
approved confi
dentia
lity-only
modes agai
nst
unauth
o
ri
zed manipul
ation
of the encrypted data.”
3.3. FUSE
FUSE (File
system in userspa
c
e
)
is a framew
ork (Figure 3)
for unix-li
ke
o
peratin
g
sy
st
em
s.
I
t
was of
f
i
ci
ally
merge
d
int
o
t
h
e main
st
rea
m
Linux ke
rnel
tree in
kernel
versio
n 2.6.1
4
.
What’
s
more
important, F
U
SE has di
stinctive feat
ures as follows: simple library API, secure
impleme
n
tation, usable
b
y
non privile
ged u
s
e
r
s
a
nd be
prove
d
very sta
b
le
over time.
Usi
ng
FUSE, non
-privile
ged
use
r
s can
create/a
c
ce
ss thei
r own
file
system
s
without
modifying ke
rnel co
de.
Thi
s
is
achieve
d
by run
n
ing
file system
cod
e
in
u
s
er
spa
c
e
while t
he
FUSE mod
u
le provid
es o
n
l
y a "bridge" t
o
the actu
al
kernel inte
rfa
c
e
s
. Wh
at’s
more, F
U
SE has
alrea
d
y inte
grated
seve
ral pro
g
ra
m
m
ing lan
gua
ges,
whi
c
h
provide
s
mo
re choi
ce
s
for
develop
ers.
Figure 3. Architecture of FUSE
4. Protot
y
p
e Sy
stem
We devel
op a prototype file
system cal
l
ed EDFS ba
sed o
n
FUS
E
. The targets are a
s
follows
:
se
curity: the ability to resist attacks at a
mode
rate inte
nsity level;
efficien
cy: de
tect an
d elimi
nate d
a
ta red
unda
nc
y
with
out re
stri
cting
to files
of fixed
set
format;
perfo
rman
ce:
better re
ad
and
write
sp
eed
com
p
a
r
e
d
to
som
e
fil
e
sy
stem
s th
at ha
s
alrea
d
y exists;
usa
b
ility: transpa
rent de
-d
uplication
an
d encryption that unde
r co
n
t
rol;
prob
ability: b
e
availa
ble
across different pl
atfo
rm
s
without
co
nsid
erin
g val
i
dity of
unde
rling file
system.
4.1. Sy
stem
Organi
zation
The
system i
s
comp
osed
of seve
ral mo
dule
s
sh
o
w
n i
n
Figu
re 4.
De-du
p
lication
con
s
i
s
ts
of ch
un
king
module, tm
p
_
rep
o
sito
ry,
map_d
b a
nd
meta_db. B
u
ffer poll
offers seve
ral
se
rvice
pro
c
ed
ures f
o
r the incomi
ng data. Each pro
c
ed
ur
e serve
s
an in
dividual byte sequ
en
ce (fi
l
e).
Furthe
r m
o
re
, chun
kin
g
m
odule
pa
rtitions the
file int
o
a n
u
mbe
r
of chu
n
ks
an
d sto
r
e th
em
in
tmp_re
p
o
s
itory. Map DB an
d Metadata
DB ate a
ll in-m
emory data
b
a
s
e
s
for ma
ppi
ng the current
block in m
e
mory to the
right pla
c
e
o
n
disk
an
d reco
rdin
g file
metadata in
RAM befo
r
e
de
-
dupli
c
ation a
nd en
cryptio
n
, respe
c
tively. Block
Man
ager
re
ceive
s
blo
c
ks fro
m
tmp_re
p
o
s
itory,
gene
rate
s the
i
r fingerp
r
int
s
and ch
ecks if they alr
eady exist in the fingerpri
n
t table. If not, deliver
the blo
ck to
com
p
re
ssio
n mod
u
le, o
t
herwi
se
,
del
ete the blo
c
k, upd
ating
metadata
DB,
blocku
sa
ge DB and fileblock DB.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Im
plem
entation of a File System
with En
cryptio
n
and
De-d
upli
c
atio
n (Xiaoni
ng Ji
ang)
6449
Figure 4. Architecture of EDFS
4.2. Data e
s
n
g
ine
In order to i
m
prove
effici
ency
of
stora
ge
a
nd
que
ry, we
create
se
co
nda
ry d
i
recto
r
ie
s
unde
r the ro
ot stora
ge di
recto
r
y, usi
n
g the ha
sh
v
a
lue of the b
l
ock a
s
the file’s na
me. F
o
r
example,
if a
blo
ck’
s h
a
sh valu
e (S
HA-256
) is
9a32
a228
319
266a
0780
e00
bb6a
9d7f6
b
f1
8e8d
cd
3a35
a
564
as h
(
b),
then
it will
be
sto
r
ed
i
n
dire
ctory ‘/dta
/
9/A’ where ‘9
’ stand
s for the first by
te of
h(b
)
and ‘A’ stands for the
se
con
d
byte of
h(b
)
.
4.3. Metad
a
ta Engine
Metadata
en
gine
con
s
i
s
ts of thre
e in
dividual file
s: blo
c
kusage.
db
,
fileblo
c
k.db a
n
d
metadata.db.
We u
s
e com
m
odity DBMS
(Berkely DB) to implement
them.
Blockusage.d
b
is re
spo
n
si
ble for de
scri
be the
attrib
ute of every block. Its Key is the
hash value of
the store
d
block and it
s V
a
lue is
stru
ctured a
s
follo
w:
typedef stru
ct
{
unsi
gne
d lon
g
long si
ze;
/* size of the
block */
unsi
gne
d lon
g
long inu
s
e;
/* usage freq
uen
cy */
} ;
Fileblo
ck.d
b
sho
w
s the
relation
ship
b
e
twee
n a
file
and
the
blo
c
ks th
at com
pose it.
Beside
s, it al
so
re
cords th
e inod
e info
rmation of the
file on u
nde
rlying file syst
em. With in
o
d
e
and bl
ockn
r, we
can
ea
sily resto
r
e the
whol
e f
ile using chun
ked
blocks. Its Value i
s
the h
a
s
h
value of the given block an
d the
key
’
s
st
ruct
u
r
e i
s
as f
o
llow:
typedef stru
ct
{
unsigned l
ong lon
g
inod
e;
/* inode of the backe
d file */
unsigned l
ong lon
g
blo
c
knr;
/* seque
nce o
f
the block a
m
ong all blo
c
ks */
} ;
Metadata.db i
s
mainly used
to store prop
erti
es of ba
cked files. It can be used to che
c
k
the corre
c
tne
ss of a ne
wly resto
r
e
d
file by comp
a
r
ing t
he re
co
rde
d
summary info
rmation. Its ke
y
is inod
e of the backe
d file and its value i
s
structu
r
ed a
s
follow:
typedef stru
ct
{
s
t
ruc
t
s
t
at s
t
buf;
/* the ‘s
tat’ s
t
ruc
t
*/
unsigned l
ong lon
g
real
_si
z
e;
/* file size after de
-du
p
licat
ion and e
n
cryption */
char filena
me[MAX_PO
SIX_FILENAME_LEN + 1]
;
/* file name */
unsigned l
ong lon
g
md5
s
um
;
/* md5
s
um of file as plaintext */
} ;
What’
s
mo
re,
we u
s
e T
o
ky
o Ca
binet, a l
i
bra
r
y of routi
nes fo
r man
a
g
ing a d
a
tab
a
se, to
impleme
n
t metadata
DB a
nd map
DB
in
RAM. By doi
ng this,
we
ca
n sto
r
e
some
most fre
que
ntly
used blocks in memory, which
will dramatically improve IO
performance and
speed up query
evaluation.
4.4. Procedu
r
e of Basic
Opera
t
ions
In this se
ction
,
we describe
the algorithm
s of three mo
stly used b
a
si
c ope
ration
function
s of EDFS over
file
s: open, read
and write.
Algorithm 1
:
edfs_
ope
n(p
a
t
h, fi)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 9, September 20
14: 64
45 - 645
3
6450
1. procedu
re
open
(path, fi)
2. step 1: find the record
wher
e
key=pat
h in in-mem
ory DB p2i
3.
if p2i[
k
ey]=NULL
4.
the
n
go to s
t
ep 4
5.
els
e
i
←
p2i[key].in
ode
6. step 2: find the record
where
key=i in
in-mem
ory DB mt
7.
if mt
[k
ey]
≠
NU
LL
8.
the
n
go to s
t
ep 5
9. step 3: find the record
where
key=i in
DB md
10.
i
f
md
[k
ey]
≠
NU
LL
11.
th
en s
←
md[
k
ey].s
tat
12. step 4: find the file f where f.sta
t=s from root direct
ory re
cursivel
y
13. step 5: up
date DB mt, p2i and in-me
m
ory DB o2b,
set
14.
m
t[k
ey].s
tat
←
s
15.
m
t[k
ey].maxblk
++
16.
p2i[key].inode
←
i
17.
o2b[key].of
←
fi.of
f
s
e
t
18.
o2b[key].blk
←
mt
[k
ey].maxblk
17. step 6: all
o
cate a idl
e
ci
rcul
e buffer from the poll, and set
18.
tag
←
i
19. end proce
dure
Algorithm 2
: edfs
_
read(path, offs
et, fi)
1. procedu
re
edfs_
rea
d
(pa
t
h, offset, fi)
2. buf[size] <-- 0
3. step 1: find the record
where
key=pat
h in in-mem
ory DB p2i, set
4.
i <-- p
2
i[key].inode
5. step 2: find the record
where
key.inod
e=i
an
d key.o
f
=offset in in-memory DB o
2b, set
6.
blk <-- o2b[key].blk
7.
find th
e record wh
ere key.inod
e=i
and key.bl
k=blk in in-mem
ory DB wt
8.
if
wt[k
ey]=
NULL
9.
then find the reco
rd whe
r
e
key.blk=bl
k in
DB fb
10.
h <
-
- fb[k
ey].hash
11.
find the block
b where f.name=h u
nde
r root
dire
ctory recu
rsively
12.
t <
-- dec
o
de(b)
13.
t <
-- dec
o
mpress
(t)
14.
buf <
-
- t
15.
els
e
buf <--
wt[k
ey]
16.
blk_offset <-- offset - o2b[key].of
17.
find the re
co
rd wh
ere key=h
a
sh
(blk) in DB bl
k_u
s
a
g
e
18.
if
blk
_
offs
et >
blk
_
us
age[key].s
iz
e
19.
then offset <-- offset + bl
k_
usa
ge[key].size
20.
go to s
t
ep 2
21.
return buf
22. end proce
dure
Algorithm 3
: edfs
_
write(path, offs
et, fi)
1. procedu
re
edfs_
write
(
b
u
f, size, offset, fi)
2. thread 1: find the re
co
rd
where key.in
ode
=fi.inode
and key.of=of
f
set in in-me
m
ory DB
o2b, set
3.
blk
<
-- o2b[k
e
y].blk
4.
blk_offset <-- offset - o2b[key].of
5.
find t
he rec
o
rd where key.blk=
blk
in DB fb
6.
h <
-- fb[k
ey].has
h
7.
find the blo
ck b
wh
er
e f.name
=
h
under
root di
recto
r
y re
cu
rsively
8.
t <
-- decode(b)
9.
t <
-- decompres
s(t)
10.
tmp
_buf <-- t
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Im
plem
entation of a File System
with En
cryptio
n
and
De-d
upli
c
atio
n (Xiaoni
ng Ji
ang)
6451
11.
for
n <-- 0 upto b
l
k_offset
12.
d
o
write tmp_
b
u
f[n] to circul
e buffer tagg
ed fi.node
13.
write size of data
in buf to circule buffer tag
ged fi.node
14.
get
the last sto
r
e
d
che
c
ksum cs if any from in-mem
ory DB wt
15.
if
c
s
=
NULL
16.
s
e
t first bound
ary point at the begin
n
ing o
f
buffer
17.
els
e
18.
s
e
t the chun
ki
ng
point inh
e
rited from the last ch
un
king
operation
19. repeat: calculate cs
of data within
the given win
dow
20.
i
f
c
s
mat
c
hes
'target pattern'
21.
then store th
e finding blo
c
k in wt
22.
e
l
s
e
23.
if reache the
end of the bu
ffer
24.
then stote a reco
rd in in
-memo
r
y DB
tmpct <fi.inod
e, che
c
ksum
>
25.
else slidi
ng the win
d
o
w
by one byte, go to repeat
26. thread 2:
get a re
cord r from wt
27.
h' <-- ha
sh
(r.bl
ock)
28.
inuse <-- bl
k_
usage[h']
29.
if
inus
e=
0 then
30.
c
r
<
-
- compress
(r.bloc
k
)
31.
e
r
<
-
- enc
o
de(c
r)
32.
i
n
s
e
rt a rec
o
rd of r.block
in DB blk
_
us
age
33.
els
e
34.
b
l
k
_
us
age[h']++
35.
ins
e
rt a rec
o
rd <h', blk>
into DB fb
36.
update the according re
co
rd in
DB md
37. end proce
dure
5. Expermen
t
5.1. Experment Env
i
ron
m
ent
Table
1
sho
w
s the te
stin
g environme
n
t we
u
s
ed t
o
test th
e p
e
r
forma
n
ce of
EDFS vs.
Lessfs a
nd Rsync. We pe
rform comp
reh
ensive a
nalysis on vario
u
s
asp
e
ct
s of de
-dupli
c
atio
n.
Table 1. EDF
S
Performan
c
e Testing Env
i
ronm
ent
CPU
Intel(R) C
o
re
(TM
)
2 Qu
ad CPU
Q9
500 @ 2.83
GHz
Fuse 2.9.1
Tok
y
ocabinet
1.4.32
Memor
y
4GB
Berkele
y
DB
4.8
Linux Co
re
2.6.38-8
-
gene
ric
Lessfs
1.5.12
Gcc
4.5.12
Rs
y
n
c
3.0.7
We
use fou
r
sampl
e
d
a
ta
sets from
a
runtime d
a
tab
a
se
to me
asure th
e d
e
-d
u
p
licatio
n
efficien
cy of three different algorith
m
s –
Rsync.
Lessfs a
nd
EDFS. We
use the follo
wing
comm
and to
export four d
a
ta sets from t
he actual run
t
ime Mysql every other d
a
y:
m
ysqldum
p –
f
lush-l
og
s –u
root –p za
bbi
x > `date
+
’
%m
%d
’
`
.sql
It is important
to note that
about on
e gig
abit dat
a will be inserted in
to the databa
se every day,
at
the meantime
,
data of the earlie
st day wil
l
be drop
ped.
First, we
do p
a
ir wi
se
synchroni
zatio
n
u
s
ing
rsyn
c to see the diffe
rence betwee
n
sam
p
le
data
sets.
Ta
ble 2
sho
w
s
the result. We can
see th
at acco
rdi
ng
to slidi
n
g
win
dow alg
o
rith
m
adopt by
rsyn
c, only a q
u
a
r
ter data i
s
diff
erent,
which
mean
s that th
e de-dupli
c
ati
on ratio
will b
e
up to about 7
5
%.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 9, September 20
14: 64
45 - 645
3
6452
Table 2. Pair
Wise Synch
r
onization of rsync
pair total
size(by
t
e)
Sent(b
yte)
speedup
0914.sql vs. 0915.sql
4,289,154,20
7
942,146,224
4.55
0915.sql vs. 0916.sql
4,249,344,00
0
1,001,053,58
5
4.24
0916.sql vs. 0917.sql
4,300,787,82
5
1,050,324,61
8
4.09
We exami
ne
the deg
ree
o
f
duplication
of
Les
sf
s f
o
r
chu
n
k
si
ze
128K
B
,
64K
B
,
32K
B,
16KB and 8K
B resp
ectivel
y
(Table 3
)
. The re
sult
sh
ows that hardly any
dupli
c
ate blo
c
ks
were
detecte
d u
s
i
ng fixed
-
si
ze
ch
un
king
m
e
thod
ado
pt by Le
ssfs.
It’s re
asona
b
l
e be
ca
use f
o
r
databa
se
s, chang
es
ca
used by in
se
rtion a
r
e di
strib
u
ted a
c
ross
t
he whole
DB
system,
whi
c
h
means that the fixed-si
ze c
hunking blocks will be
shifted.
Table 3. De
-d
uplication of Lessfs o
n
Sample Data Se
ts
file
size(
b
y
t
e)
chunk_no dedup_no
128KB
64KB
32KB
16KB
8KB
128KB 64KB 32KB
16KB 8KB
0914.sql 417886270
8
31883
63765
127529
255058
510116
0
0
0
0
0
0915.sql 428915420
7
32724
65448
130895
261790
523579
0
1
2
4
8
0916.sql 424934400
0
32420
64840
129680
259360
518719
0
0
0
0
0
0917.sql 430078782
5
32813
65625
131250
262500
524999
0
1
2
4
8
As for EDFS, we
chose to close the
capab
ilities of compression
and encryption since
we a
r
e
aimin
g
at de
-du
p
li
cation. T
able
4 de
scrib
e
s
the de
-du
pe
result. We
ca
n co
ncl
ude
that
within
a
DB fi
le, just li
ke
fi
xed-si
ze
chu
n
kin
g
, the
r
e
are
ha
rdly a
n
y
dupli
c
ate
chun
ks.
Ho
we
ver,
there’
s an o
b
v
ious si
milarit
y
between a
pair. Two
sa
mple DB files, 4 GB each,
will occu
py only
5.5GB physi
cal disk. For t
he se
co
nd file, 2.5GB
red
unda
nt data is rem
o
ved, which m
ean
s the
de-d
up rate can be up to 6
0
%.
Table 4. De
-d
uplication of EDFS on DB
Set
file
Size(b
y
t
e)
dedup_size(b
y
te
)
dedup_ratio
chunk_no
dedup_no
0914.sql 4,178,862,70
8
4,178,862,70
8
0
4,769
0
0915.sql 4,289,154,20
7
1,690,221,71
6
2.537
4,683
2,133
0916.sql 4,249,344,00
0
1,930,648,08
9
2.2
4,512
2,010
0917.sql 4,300,787,82
5
1,667,745,43
6
2.578
4,321
2,027
5.2. Testing
Analy
s
is
In databa
se
field, EDFS i
s
not as go
od
as
Rs
yn
c on the
aspe
ct
of de-
d
upli
c
atio
n, but is
much
bette
r than
L
e
ssfs. Beside
s, Rsy
n
c uses
slidi
ng windo
w chun
king algo
rithm whi
c
h will
gene
rate
so
many sm
all b
l
ocks
call
ed f
r
agm
ent, re
sulting in too
much
metad
a
ta whi
c
h i
n
turn
severely deg
rade
pe
rform
ance of IO
and
CPU.
On the oth
e
r han
d, EDFS can
pro
v
ide
transpa
rent encryption while Rsyn
c o
n
ly transfe
r
plain text, which is very
dang
ero
u
s o
v
er
netwo
rk.
Referrin
g to
writing
an
d
readin
g
p
e
rfo
r
mance, Le
ssfs i
s
b
e
tter fo
r the
rea
s
on
t
hat CDC
chu
n
ki
ng will have to shift
bytes one by one wh
en se
arching b
oun
dary point
s. So as to read
ing,
EDFS must d
o
a DB que
ry
to find the inode of t
he bl
ock. What’
s
worse, if data
across several
blocks, accordingly, query
operation mu
st be don
e a
cou
p
le of times. Ho
weve
r, to some exte
nt,
trading
spa
c
e
for time reall
y
make
s sen
s
e.
6. Conclusio
n
In this pap
er, we pro
p
o
s
e
a prototype
file
system called EDFS
(Encryptio
n a
nd De
-
dupli
c
ation Fi
le System),
whi
c
h p
r
ovid
es b
o
th
d
a
ta
se
cu
rity and
spa
c
e
effici
ency in
sto
r
a
ge
system
s. This sch
eme d
e
m
onstrates
th
at security ca
n be co
mbine
d
with de
-dupli
c
ation in a wa
y
that provid
es a vari
ety o
f
se
curity
ch
ara
c
teri
stics.
In ad
dition,
we
de
scrib
e
several
n
e
w
techni
que
s th
at result in st
orag
e efficie
n
c
y and
se
cu
ri
ty. Accordi
ng
to testing results, The E
D
F
S
scheme
can
be ea
sily appl
ied to backu
p
/
st
orag
e environment of dat
aba
se field.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Im
plem
entation of a File System
with En
cryptio
n
and
De-d
upli
c
atio
n (Xiaoni
ng Ji
ang)
6453
Ackn
o
w
l
e
dg
ements
This work was supp
orte
d by the National Natu
ral Scien
c
e
Found
ation
of China
(612
723
06),
the Publi
c
Proje
c
t of S
c
ien
c
e
and
Tech
nolo
g
y
Plan in Z
h
e
jiang Provin
ce
(201
2C210
01
), the Public Project of Scien
c
e an
d
Techn
o
logy
Plan in Zhejiang Provin
ce
(201
3C310
42
), and the
Col
l
ege Stude
nts’ Scie
nce
an
d Technol
ogy
Innovation Activities Plan
of
Zhejian
g
Pro
v
ince (1
130
JQ421
205
8G).
Referen
ces
[1] FUSE.
Filesystem
in Users
p
ace
[EB/OL]. ht
t
p
://fuse.sourceforge.net/.
[2]
Haita
ng W
a
n
g
. Researc
h
of Graph Com
p
re
ssion i
n
Information Stora
ge.
TEL
K
OMNIKA
Indones
ia
n
Journ
a
l of Elec
trical Eng
i
ne
eri
n
g
. 201
1; 11(8)
: 4367~
43
71.
[3]
Ali H
a
ssa
n S
o
dhro, Y
e
Li, M
ada
d Al
i S
h
a
h
.
Nove
l Ke
y
S
t
orage
a
nd M
a
nag
ement
Sol
u
tion
for th
e
Securit
y
of W
i
reless S
ensor
Net
w
orks.
TELKOMNIKA
Indones
ian J
our
n
a
l of El
ectrical
Engi
ne
erin
.
201
3; 11(6): 33
83~
33
90.
[4]
F
anpe
ng Y
u
,
Sido
ng Z
h
an
g, Xin Z
h
ou. In
centive
M
e
ch
a
n
ism Al
gorith
m
for P2P F
i
l
e
S
y
stem i
n
Camp
us N
e
t
w
orks.
TEL
K
OMNIKA
Indo
n
e
sia
n
Jo
urn
a
l
of Electric
al
Eng
i
ne
erin
g
.
20
12; 1
0
(6):
147
7~
14
84.
[5]
Andre
w
T
r
idge
l
l
, Paul Mackerr
as.
The rsync algorithm
. T
R
-CS-96-
05. 19
9
6
.
[6]
B Z
hu, K
Li,
H
Patterson.
Av
oi
din
g
th
e D
i
sk B
o
ttleneck
i
n
th
e
Data
D
o
mai
n
Ded
upl
icatio
n
F
ile Syste
m
,
Proc.
F
A
ST
’08: Sixth USENI
X
Conf. F
ile
and
Storage T
e
chn
o
lo
gies. 2
008;
1-14.
[7]
JR Douceur, A Ady
a
, WJ Bo
losk
y
,
D Simon, M
T
heimer.
R
e
cl
aim
i
ng
spa
c
e
from
du
pl
ica
t
e
fi
l
e
s
i
n
a
serverless
dist
ributed file sys
tem
Pr
oce
edi
n
g
s of the
2
2
n
d
Intern
ation
a
l
Confer
enc
e o
n
Distri
bute
d
Comp
uting S
y
s
t
ems (ICDCS ’02), Vi
en
na, A
u
stria. 200
2; 6
17–
62
4.
[8]
W
J
Bolosk
y
,
S Corbin, D
Goebe
l, JR Douce
u
r.
Singl
e
instance stor
age i
n
W
i
ndo
w
s
2000.
In
Procee
din
g
s of
the 4th USENIX W
i
nd
o
w
s S
ystems S
y
mp
osi
u
m. USENIX. 2
000; 13
–2
4.
[9]
S Quinlan, S Dor
w
ard.
Venti
:
A New
Approach to Archiv
al Storag
e.
Proc. Conf. File and Storage
T
e
chnolog
ies (
F
AST
’02). 200
2; 89-10
1.
[10]
B Hon
g
, DDE
Lon
g.
Du
plic
ate Data E
l
i
m
i
n
a
t
ion i
n
a S
an
F
ile Syste
m
.
P
r
oc. 21st IEEE
/ 12th NAS
A
Goddar
d Co
nf. Mass Storage
S
y
stems a
nd T
e
chn
o
lo
gi
es (MSST
). 2004; 30
1-31
4.
[11] Lessfs.
Open source d
a
ta de-
dup
licati
on
[EB/OL]. http:
/
/
w
ww
.
l
e
ssfs.com/
w
o
rdpr
ess/.
[12] Opend
ed
up.
Get m
o
re from
your disk
[EB/OL]. http://
w
w
w
.
opendedup.org/.
[13]
LP Cox
,
CD
Murray
,
BD
Noble. Pastiche: Maki
ng Backup Cheap
and Ea
sy
. SIGOPS Operatin
g
S
y
stems Rev.,
200
2; 36(SI): 285-2
98.
[14]
B Z
hu, K
Li,
H
Patterson. Av
oi
din
g
th
e D
i
sk B
o
ttl
eneck
i
n
th
e
Data
D
o
mai
n
Ded
upl
icatio
n
F
ile S
y
stem
.
Proc. F
A
ST
’08: Sixth USENI
X
Conf. F
ile and
Storage T
e
chn
o
lo
gies. 2
008;
1-14.
[15]
JC Mog
u
l, YM
Cha
n
, T
Kell
y
.
D
e
si
gn
, Imp
l
em
en
ta
ti
on
, an
dEva
l
u
a
t
i
o
n o
f
D
u
p
l
i
c
a
t
e Tra
n
sfe
r
D
e
te
ctio
n
in http.
Proc. Symp. Net
w
o
r
ke
d S
y
stems D
e
s
i
gn Impl
ement
ation (NS
D
I ’04
)
,p. 4, 2004.
[16]
C Li
u, Y Lu, C
Shi, G Lu,
D D
u
, D W
ang. AD
MAD: Appl
icati
on-
Driv
en M
e
tadata A
w
a
r
e D
e
-Du
p
lic
ati
o
n
Archival St
ora
ge S
y
stem. Proc. Fifth IEEE
Int’l
Worksho
p
Storage
Net
w
o
r
k Architecture
and P
a
ral
l
e
l
I/Os (SNAPI ’08). 2008; 29-35.
[17]
Jaeh
on
g Min,
Dae
y
o
u
ng Y
oon, a
nd Y
o
u
j
ip W
on.
Effici
ent De
du
plic
ation T
e
chni
qu
e
s
for Moder
n
Backup Oper
at
ion, Com
puters
.
IEEE
Transac
tions.
201
1; 60
(6): 824 – 84
0.
[18]
S Rhea, P Eaton, D Geels
,
H W
eathers
poo
n, B Z
hao
, J Kubiato
w
i
cz.
Pond: the
OceanStore
prototype.
Proc
eed
ings of the
Secon
d
USENI
X
Co
nfer
e
n
ce
on F
ile a
nd Sto
r
age T
e
chnol
o
g
ies (F
AST
).
200
3; 1–1
4.
[19]
A I
y
e
ngar, R
Cah
n
, JA Gara
y
,
C J
u
tla.
D
e
sig
n
an
d i
m
p
l
e
m
e
n
tation
of
a secur
e
dist
ribute
d
dat
a
repos
itory.
In
Procee
din
g
s of
the 1
4
th IF
IP Internatio
na
l In
formation
Sec
u
rit
y
Co
nfere
n
c
e (SEC
’98).
199
8; 123
–1
35
.
[20]
GR Goods
on,
JJ W
y
l
i
e, GR
Ganger, MK
R
e
iter.
Efficie
n
t
By
z
a
ntin
e-toler
ant er
asur
e-co
ded
stora
ge.
In
Procee
din
g
s of
the 200
4 Int’l Confer
ence
on
Depe
nd
abl
e Systems a
nd Ne
t
w
ork
i
n
g
(DSN
200
4). 200
4.
[21]
MW
Storer, K
M
Green
an,
E
L
Mil
l
er, K
Vor
uga
nti.
POT
S
HARDS: s
e
cur
e
l
ong-ter
m
storag
e w
i
tho
u
t
encrypti
on.
In Procee
din
g
s of
the 200
7 USE
N
IX An
nua
l T
e
chnic
a
l Co
nfer
ence. 20
07; 14
3–1
56.
[22]
W
ANG Can, QIN Z
h
i-guan
g, PENG Jing
, W
ANG Juan.
A Novel E
n
cryption Scheme for Dat
a
Ded
upl
icatio
n System.
In Pro
c
eed
ings
of Internati
ona
l C
o
n
f
erence
on C
o
mmunicati
ons,
Circuits an
d
S
y
stems (ICC
CAS 201
0). IEEE Com
puter
Societ
y
.
201
0: 265-
269.
[23]
Miche
a
l O Ra
bin.
F
i
n
gerpr
int
i
ng
by ran
d
o
m
poly
n
o
m
i
a
ls.
Center for R
e
s
earch i
n
C
o
mp
uting T
e
chn.,
Aiken C
o
mput
ation L
a
b
o
rator
y
, Univ., 19
81.
[24]
Morris D
w
ork
i
n
.
NIST
Specia
l Publ
icatio
n 80
0-38E .NIST
Spec
i
a
l Pub
lic
ati
on. 201
0; 800:
38E.
Evaluation Warning : The document was created with Spire.PDF for Python.