Automated Keyword Extraction From One-Day Vulnerabilities at Disclosure

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Automated Keyword Extraction from “One-day”

Vulnerabilities at Disclosure
Clément Elbaz Louis Rilling Christine Morin
Univ Rennes, Inria, CNRS, IRISA DGA Univ Rennes, Inria, CNRS, IRISA
Rennes, France Rennes, France Rennes, France
[email protected] [email protected] [email protected]

Abstract—Common Vulnerabilities and Exposures (CVE) Vulnerability Scoring System (CVSS) score and vector [6].
databases such as Mitre’s CVE List and NIST’s NVD database NIST security experts take at least a few days to analyze and
identify every disclosed vulnerability affecting any public soft- annotate a vulnerability, and often weeks (see Section III-A). It
ware. However, during the early hours of a vulnerability disclo-
sure, the metadata associated with these vulnerabilities is either is common to find vulnerabilities that have been disclosed for
missing, wrong, or at best sparse. This creates a challenge for ro- several days that are still not analyzed by NVD. For example
bust automated analysis of new vulnerabilities. We present a new CVE-2019-9084, disclosed on the CVE List on 06/07/2019,
technique based on TF-IDF to assess the software products most has no NVD analysis as of 06/11/2019. This delay means that
probably affected by newly disclosed vulnerabilities, formulated in order to reliably analyze one-day vulnerabilities, one should
as an ordered list of relevant keywords. For doing so we rely
only on the human readable description of a new vulnerability not rely at all on enriched metadata provided by databases such
without any need for its metadata. Our evaluation results suggest as NVD. Instead one should focus on the data available when
real world applicability of our technique. the vulnerability is first disclosed on Mitre’s CVE List, which
consists of three elements only: a unique CVE identifier, a
I. I NTRODUCTION
free-form human readable description, and at least one public
The disclosure of a vulnerability is the most critical part reference [7].
of its life cycle. As a confidential zero-day, a vulnerability The vulnerability analysis ecosystem presented above makes
is a high value asset used sparingly to attack high value it expensive for organizations to analyze one-day vulnerabili-
targets. On the other hand, well known public vulnerabilities ties at disclosure. On the one hand achieving real-time threat
can be mitigated using standard security practices such as evaluation of new vulnerabilities through manual analysis
applying software updates diligently, or using a signature- requires extensive man power as hundreds of vulnerabilities
based intrusion detection system (IDS). Bilge et al. [1] showed are disclosed daily. On the other hand there is not enough
that at disclosure, the usage of exploits of a vulnerability in the machine-readable metadata available at disclosure for auto-
wild increases as high as five orders of magnitude while tran- mated analysis. Real-time threat analysis is therefore pro-
sitioning from a zero-day to a public vulnerability. A software hibitively expensive for most organizations, although it would
patch is sometimes already available, but its adoption may benefit them as severe vulnerabilities such as Shellshock have
not be widespread. At this early stage the vulnerability is not been massively exploited within hours of their disclosure [8].
understood well enough to author a proper signature rule for Automating real-time threat evaluation for newly disclosed
an IDS. All these factors contribute to making the disclosure vulnerabilities would make it affordable for more organiza-
a dangerous time, since a lot of systems are vulnerable in tions. This would allow cloud service providers (CSP) and
practice. We call one-day these newly disclosed vulnerabilities information systems to react in real-time to vulnerability
that are still in the critical part of their life cycle. One-day disclosures. Examples of automated reactions include reconfig-
should not be taken literally here: a vulnerability disclosure uring security policies by elevating logging levels for critical
can be 72 hours old and still be at its most threatening period. systems, switching these systems into degraded mode or even
The vulnerability disclosure process is coordinated by the shutting them down while waiting for a remediation to be
Common Vulnerabilities and Exposures (CVE) system over- applied. Such a reaction service could help the CSP to protect
seen by Mitre’s Corporation [2]. Newly disclosed vulnerabili- both its internal systems and tenants (the latter constituting a
ties are first published on the CVE List data feed managed by potential source of revenues for the CSP).
Mitre. They are then forwarded to other security databases, We propose an automated system that uses free-form de-
such as NIST’s NVD database [3] or SCAP data feeds [4], scriptions of newly-disclosed one-day vulnerabilities to extract
where they will eventually be annotated by multiple security the most probable affected software from the description,
experts. These annotations include metadata such as the af- and can do so in near real-time (at most seconds after the
fected software, as described by an entry from the Common disclosure). Identifying which systems are vulnerable can be
Platform Enumeration (CPE) [5]. It also includes a Common achieved by extracting relevant keywords from the free-form
978-1-7281-4973-8/20/$31.00
c 2020 IEEE vulnerability description and forwarding them to an alert
service monitoring specific keywords related to these systems
(such as names of public software used in the system).
Text description of analyzed vulnerability
Our system associates CVE vulnerabilities to keywords
extracted from past CPE URIs to quickly point out the most Available CPE URIs Word filtering

probable affected software. To the best of our knowledge this is


Set of unordered keywords
the first attempt at doing so while only relying on the free-form
description of vulnerabilities, without using their metadata. Available vulnerability
TF-IDF weighting of keywords
descriptions
In Section II we discuss related work and the real world
challenges of working with vulnerability data. In Section III Ordered list of weighted keywords

we present our approach. In Section IV we evaluate the accu-


Domain-specific heuristics
racy of our proposed technique. We conclude in Section V.
Improved ordered list of weighted keywords

II. R ELATED W ORK AND O PEN P ROBLEMS Process


Keyword list truncation
Data
Most cloud providers provide Intrusion Prevention System
Input / Output Final ordered list of weighted keywords
(IPS) or Web Application Firewall (WAF) capabilities among
their commercial offering [9] [10] [11]. However, to the best of
our knowledge, the process of monitoring new vulnerabilities
and adding related rules is always done manually [12]. Ex- Fig. 1: Overview of the vulnerability description processing pipeline.
tracting information and insights from the CVE corpus is not
a new idea. Multiple works brought meaningful insights using of new vulnerabilities, only considering vulnerabilities and
statistical analyses of historical vulnerabilities in the NVD metadata publicly available at disclosure.
database. Frei et al. [13] found a statistical correlation between
the availability of exploits and patches and the number of III. O UR A PPROACH
days since disclosure. Clark et al. [14] brought to light a In this section we present our keyword extraction pipeline.
“honeymoon effect” where more recent software is less subject Its input is the free-form description of a new vulnerability. It
to new vulnerabilities than older software, everything else is analyzed using all vulnerability descriptions and metadata
being equal. Ganz et al. [15] attempted to automatically enrich available at the time of disclosure. It outputs an ordered list of
the quality of the metadata in NVD, by blending the existing keywords, where each keyword is given a weight represent-
metadata with textual analysis of the description. However ing its estimated relevance. As an intuitive example, Table
their technique still requires the availability of existing meta- I presents a sample of vulnerabilities, including their free-
data for the enriched vulnerability, while ours does not. The form description and the corresponding keywords extracted
closest work to ours is by Jacobs et al. [16] who proposed the by our analysis technique. We consider the explainability of
Exploit Prediction Scoring System (EPSS). Like our work, automated analysis as a paramount quality of security systems.
EPSS is to be used at vulnerability disclosure: they try to Therefore we deliberately chose to avoid elaborated machine
determine a new vulnerability’s probability of exploitation in learning methods (such as deep learning) when their accuracy
the next twelve months. They use a logistic regression model comes at the expense of explainability. A high-level overview
trained using both public and non-public data-sources that can of the proposed vulnerability analysis pipeline is shown in
infer probabilities for new vulnerabilities using only public Figure 1. We now describe each stage of the pipeline in more
data-sources. Our work does not require any non-public data- details, starting with our choice of data sources.
source. All of these studies point out inconsistent metadata as a
major difficulty when working with the corpus. By being a the- A. Data Sources Considerations
oretical database authored manually by security experts, CPE The CVE and the CPE corpus are linked together using a
cannot map perfectly to actual software binaries and packages metadata called the CPE URI defined in NIST-IR 7695 [19].
in a production system [17], [18]. This situation creates a lot of A CPE URI is a unique reference to a specific entry in the
discrepancies when doing analysis, such as associating a CVE CPE database, a specific version of a piece of software. An
vulnerability to incorrect or inexisting CPE entries. Moreover, example of the relationships between CVE, CPE URI and CPE
even if this situation was solved and it was possible to fully entries can be found in Figure 2.
map CVE to CPE, and CPE to actual software, this mapping It is tempting to consider these corpus as two relational
would still be done manually by security experts. This is tables linked together using a foreign key. However we dis-
however impractical when dealing with one-day vulnerabilities covered two drawbacks of this approach. The first one is that
because the analysis arrives too late in the vulnerability life the life cycles of the two databases are very different, resulting
cycle. Last, all these studies except [16] considered the NVD in an ever-increasing number of “dead” CPE URIs entries
database as static historical data to be studied retroactively. referenced in vulnerabilities metadata that do not actually exist
To the best of our knowledge, our technique and evaluation in the CPE database. Figure 3 illustrates the problem. While
protocol are the first to focus on disclosure time analysis both databases were mostly kept consistent from 2011 to 2016,
CVE ID Disclosure date Description
CVE-2016-6808 04/12/2017 Buffer overflow in Apache Tomcat Connectors (mod jk) before 1.2.42.
Weight Keywords
27.54 tomcat connectors
12.41 mod jk
11.01 connectors
8.37 tomcat
CVE-2017-0155 04/12/2017 The Graphics component in the kernel in Microsoft Windows Vista SP2; Windows Server 2008 SP2
and R2 SP1; and Windows 7 SP1 allows local users to gain privileges via a crafted application, aka
“Windows Graphics Elevation of Privilege Vulnerability.”
Weight Keywords
18.46 windows server 2008
12.30 windows vista
12.28 windows 7
12.18 graphics
11.84 server 2008
11.74 windows server
8.70 windows
5.95 r2
CVE-2015-3421 07/21/2017 The eshop checkout function in checkout.php in the Wordpress Eshop plugin 6.3.11 and earlier does not
validate variables in the “eshopcart” HTTP cookie, which allows remote attackers to perform cross-site
scripting (XSS) attacks, or a path disclosure attack via crafted variables named after target PHP variables.
Weight Keywords
26.29 eshop plugin
19.10 eshop
12.33 checkout php
5.71 wordpress
CVE-2015-5194 07/21/2017 The log config command function in ntp parser.y in ntpd in NTP before 4.2.7p42 allows remote attackers
to cause a denial of service (ntpd crash) via crafted logconfig commands.
Weight Keywords
14.69 ntp
5.07 y
3.50 parser
3.44 config
TABLE I: A sample of keyword extraction results, using our analysis pipeline (with all heuristics enabled and a keyword list truncation target of 95% of the
norm).

CVE-2018-1336 Published at disclosure by Mitre


Published on 08/02/2018
An improper handing of overflow in the UTF-8 Published after security analysis by NVD
decoder with supplementary characters can lead to
an infinite loop in the decoder causing a Denial of Published independently by NVD
Service. Versions Affected: Apache Tomcat
9.0.0.M9 to 9.0.7, 8.5.0 to 8.5.30, 8.0.0.RC1
to 8.0.51, and 7.0.28 to 7.0.86.

CPE URI 1 0..1 CPE entry


cpe:2.3:a:apache:tomcat:7.0.28:*:*:*:*:*:*:* Apache Software Foundation
Tomcat 7.0.28

Fig. 3: Historical rate of software names used in vulnerabilities CPE URIs


Fig. 2: The relationships between the vulnerability CVE-2018-1336 and its that are missing from the CPE dictionary.
associated CPE URI and entries. The vulnerability, its metadata, and the CPE
entries have three different publication processes.

of-inclusion field for CPE entries in the CPE database. While


the inconsistencies grew substantially in 2017 and 2018, to the this would have no impact in a production system, it prevents
point that in 2018 73.8% of the software names mentioned in us from properly evaluating our results using a journaled view
CPE URIs included in vulnerabilities metadata are not present of the corpus. In order to properly simulate an analysis at
in the CPE dictionary. From 2007 to 2018, on average 66.3% disclosure time, we want to only consider CVE and CPE
of the software names are missing. We want to emphasize published “in the past” compared to the disclosure date of
that this is an important problem that, if left unchecked, will the analyzed vulnerability.
greatly decrease the relevance and real world usefulness of We solved both problems by discarding the CPE database
the CPE dictionary. A second drawback is the lack of a date- completely and instead use the data embedded in the fields
CVE-2018-1336 Published at disclosure by Mitre
Published on 08/02/2018
An improper handing of overflow in the UTF-8 Published after security analysis by NVD
decoder with supplementary characters can lead to
an infinite loop in the decoder causing a Denial of
Service. Versions Affected: Apache Tomcat
9.0.0.M9 to 9.0.7, 8.5.0 to 8.5.30, 8.0.0.RC1
Field ‘vendor’ of CPE URI
to 8.0.51, and 7.0.28 to 7.0.86.
1 apache
*
Field ‘product’ of CPE URI
* 1
*
tomcat
*
CPE URI 1 Field ‘version’ of CPE URI
cpe:2.3:a:apache:tomcat:7.0.28:*:*:*:*:*:*:* * 7.0.28

Fig. 4: When discarding the CPE dictionary we get a more robust data life
cycle while retaining most of the inherent data. Fig. 5: Number of days between vulnerability disclosure and analysis in NVD
from 2007 to 2018.
Description SQL injection vulnerability in register.php
in GeniXCMS before 1.0.0 allows remote
attackers to execute arbitrary SQL com-
mands via the activation parameter.
Keyword set activation, before, commands, genixcms, in, Time
(in alphabetical parameter, php, register, remote, sql, the, to,
order) via, vulnerability
Description of Metadata of
TABLE II: Description and extracted keyword set for CVE-2016-10096, a vulnerability V1 vulnerability V1

vulnerability disclosed on 01/01/2017. The filtering list included all CPE URIs
published between 01/01/2007 and 12/24/2016. Description of Metadata of
vulnerability V2 vulnerability V2

Description of Metadata of
vulnerability V3
of the CPE URIs, as described in Figure 4. As CPE URIs vulnerability V3

are part of a CVE vulnerability’s metadata, we can reuse the


Metadata publication delay
date-of-publication field of the vulnerability for the included
CPE URIs. However as we saw in Section I, metadata is Data available when analyzing vulnerability V3 Disclosure of vulnerability
at disclosure Security analysis of vulnerability
not published at disclosure time, but authored by security
experts several days after. Figure 5 shows the number of days
between vulnerability disclosure and analysis publication in
NVD from 2007 to 2018. Historically the median analysis Fig. 6: In this example, when analyzing the text description of vulnerability
V3 at disclosure time we have access to the text descriptions for V1 and V2
duration has been zero day while the 9th decile has been two and to the metadata for V1, but not the metadata for V2 or V3.
days. While this remained true until 2016 (for the median)
and 2012 (for the 9th decile), there have been sharp drops
in NVD analyses timeliness since then. In 2018 the median extracted from the CPE URI are the software vendor, soft-
and 9th decile analysis duration reached 35 days and 63 days ware product, and target software. After extraction all these
respectively. Therefore we decided to set the notion of a fixed fields are tokenized into individual words. When analyzing
metadata publication delay for all vulnerabilities which we vulnerabilities descriptions, only words belonging to this filter
fixed at sixty days. This means that when a vulnerability list are considered, the others are discarded. This filtering is a
is disclosed on day N , we consider that its metadata will trade-off: it vastly reduces analysis noise but may filter out
be published on day N + 60. Conversely, when analyzing some relevant information. Specifically a new vulnerability
a vulnerability disclosed on day N , we have access to all affecting a never-seen-before software product will not have
vulnerability descriptions up to day N and to all vulnerability any relevant CPE URI in available historical data therefore
metadata up to day N − 60. The choice of 60 days ensures the name of the product will be filtered out. However, we
analysis conditions that are overall realistic (albeit simplified) argue this is the right trade-off in our context, as our goal is
but strictly worse than any recorded median case, and close to feed keyword alerts to a security monitoring system: it is
to the worst recorded 9th decile. Therefore if our analysis probably more relevant to output an alert because we did not
technique performs well during evaluation, we can be highly find any highly relevant keyword than outputing an alert on
confident that it will perform as well or better in the real world. a keyword that has never been seen before and is probably
Figure 6 shows a simplified example of how time impacts the not monitored. This filtering gives us a set of keywords for
data available when analyzing vulnerabilities at disclosure. every vulnerability. However this set is unordered and contains
a lot of irrelevant keywords, as illustrated in Table II. As we
B. Word filtering can see the filtering list includes very common words such
All available CPE URIs (considering the metadata pub- as “before” that are not related to the present vulnerability
lication delay) are parsed as a keyword filter list. Fields (“before” was added to the filtering list through vulnerability
Keyword list Keyword list
an intuitive example, let us consider an actual software named
(after TF-IDF weighting) (after heuristics)
Keyword Weight Keyword Weight
IBM Tivoli Service Request Manager. The word “Tivoli” is
genixcms 6.36 genixcms 12.71 much more specific than “Request”, therefore its weight should
activation 5.24 sql 5.46 be higher. Every keyword in the set is now weighted, which
register 3.40 activation 5.24 allows us to order them by relevance. The left side of Table
sql 2.73 register 3.40 III gives an example of such weighting.
commands 1.62 commands 1.62
parameter 1.22 parameter 1.22 D. Domain-specific Heuristics
php 1.19 php 1.19
vulnerability 0.57 vulnerability 0.57 A number of additional heuristics can be applied to the
before 0.56 before 0.56 existing ordering to improve it even further. In this section
the 0.21 the 0.21 we propose three of them which we detail below. We want to
in 0.18 in 0.18
remote 0.18 remote 0.18
emphasize that an important part of our contribution is to give
via 0.12 via 0.12 security experts the ability to formulate such kind of heuristics
to 0.01 to 0.01 and reliably evaluate their accuracy. While these heuristics are
TABLE III: Keywords order and weight for CVE-2016-10096, after TF-
very simple and domain-specific, we show in Section IV that
IDF weighting (left) and heuristics (right). TF-IDF corpus included all each of them increases the accuracy of the analysis.
vulnerabilities disclosed between 01/01/2007 and 01/01/2017. Capitalization Multiple-words keywords. When a CPE URI field contains
heuristic doubled the scores of “genixcms” and “sql” (spelled “GeniXCMS”
and “SQL” in the description).
entries that are multiple words long, such as “linux kernel”,
this heuristic treats “linux”, “kernel”, and “linux kernel” as
three different individual terms with individual TF-IDF values.
CVE-2011-5107 affecting the Wordpress plugin Alert Before Therefore this heuristic allows the existence of keywords that
You Post). It is clear that the mere presence of a keyword in a are actually multiple words long. A match on “linux kernel” is
vulnerability description is not enough to assess its relevance considered more relevant than two separate matches on “linux”
for the given vulnerability. and “kernel”, so this heuristic also multiplies the score of a
keyword linearly by the number of words it is made of.
C. TF-IDF Weighting Capitalized words. We observed that software names in
Instead of treating the presence of a keyword in a vulnera- vulnerability descriptions are often capitalized. This heuristic
bility as a binary event, we weight each keyword by the term doubles the score of every keyword that is capitalized in
frequency-inverse document frequency (TF-IDF) [20] value the vulnerability description. The software industry has a
of the word, in the context of the CVE corpus. TF-IDF is somewhat peculiar grasp of English capitalization rules, so this
a numerical statistic reflecting the importance of a word to heuristic is triggered for any capitalized letter in a word and
a document, in the context of a corpus. In our context, we not just the first one: “iPhone” or “openSUSE” are considered
consider the set of keywords extracted from a CVE description capitalized.
as an individual document, and the set of these sets as a corpus. Words starting by “lib-”. We empirically observed that
TF-IDF is formally defined in Equation 1: words starting by “lib-” that are not “library” are rare in
English but are commonly used as software names (libxml2,
TF-IDF(t, d, D) = TF(t, d) × IDF(t, D), (1) libssh, libpng, etc.). This heuristic doubles the score of every
where t is a word and d is a document belonging to a corpus of keyword starting by “lib-” that is not “library”.
documents D. TF is based on the number of occurrence of the The right side of Table III gives an example of how applying
word in the document. Several formulas have been proposed, all three heuristics alter the weights and order of keywords.
such as using the raw count directly or treating the presence We evaluate these heuristics in Section IV.
or absence of a word as a boolean event. We chose to use the E. Keyword List Truncation
logarithmic version [21] of TF, defined in Equation 2, as it
better reflects the diminishing returns of repeating the same The list of keywords extracted from a vulnerability can
term several times. be long. Assuming the analysis did a good job at sorting
the relevant keywords first, the end of the list is empty of
TF(t, d) = log(1 + |t ∈ d|), (2) meaningful information. It is therefore desirable to truncate the
list and only keep the beginning, as this operation retains most
IDF is defined in Equation 3: of the information while removing most of the noise. However
it is not straightforward to decide where to cut. Keyword lists
|D|
IDF(t, D) = log , (3) lengths vary greatly (from 0 to 196 keywords in our evaluation
|d ∈ D : t ∈ d|
dataset) and the amount of relevant keywords inside them too.
where |D| is the number of documents in the corpus, and |d ∈ This makes static truncation threshold not appropriate, as it
D : t ∈ d| is the number of documents of the corpus containing could remove too much information or retain too much noise.
the word t. TF-IDF therefore allows more specific words to Instead we propose a dynamic truncation scheme based on
have a bigger impact on the mapping than common words. As the euclidean norm. At the truncation step of the pipeline, we
Untruncated keyword list Truncated keyword list
(target norm = 95%)
Keyword Weight Keyword Weight
genixcms 12.71 genixcms 12.71
sql 5.46 sql 5.46
activation 5.24 activation 5.24
register 3.40
commands 1.62
parameter 1.22
php 1.19
vulnerability 0.57
before 0.56
the 0.21 Fig. 7: First relevant keyword position.
in 0.18
remote 0.18
via 0.12
to 0.01 of any software name or software vendor fields present in
Norm Norm the CPE URIs included in this vulnerability metadata. As
15.38 14.79 our analysis gives us a sorted list of keywords, we can
TABLE IV: Untruncated and truncated weighted keyword lists for CVE-2016- expect relevant keywords to be placed at the beginning of
10096, with a target norm of 95% for the truncated list. the list before irrelevant ones. Therefore the position of the
first relevant keyword is directly tied to the usability of the
results. It should be emphasized that the number of CPE
view the untruncated weighted keyword list as an euclidean URIs included in a vulnerability metadata can vary greatly,
vector and we compute its norm. We then compute a truncation as well as the reason these CPE URIs were included in the
budget by defining a target for the truncated norm, such as first place. Usually at least one CPE URI is included as a
staying above 95% of the untruncated norm. Because most of machine-readable summary of the affected software described
the norm of the vector comes from the most relevant keywords, in the text description of the vulnerability. However security
it is possible to cut out most irrelevant keywords while staying analysts sometimes include additional CPE URIs for software
under budget. Table IV gives a practical example of such a absent from the text description (but still relevant for the
truncation. We evaluate experimentally the impact of keyword vulnerability) to provide context after the fact. This metric
list truncation in Section IV-D. sidesteps the problem of choosing the most appropriate CPE
IV. E VALUATION URI as an evaluation ground truth and instead focuses on
extracting the relevant information actually present in the
A. Experimental Setup vulnerability text description. The experimental results for
We analyzed all 31156 CVE vulnerabilities disclosed be- this metric are described in Figure 7, for the base TF-IDF
tween January 1st, 2017 and January 1st, 2019, using all 57640 weighting, each heuristic described in Section III-D, and all
CVE vulnerabilities disclosed between January 1st, 2007 and heuristics combined. We can see that in all cases at least 70%
December 31st, 2016 as past historical data. This experimental of vulnerabilities have a relevant keyword in the top three
setup simulates the behavior of a production system put online keywords of their ordered list, at least 80% of vulnerabilities
on January 1st, 2017, initially fed with historical informa- have a relevant keyword in their top five keywords, and 90% of
tion from ten years before, which then monitored all newly vulnerabilities have a relevant keyword in the top ten. How to
disclosed vulnerabilities continuously for the next two years. interpret these scores? In a control trial we randomly sampled
Each vulnerability was analyzed using the data available on 200 vulnerabilites and asked a security expert to guess the
its disclosure day only, as described in Figure 6. As discussed software product(s) affected by a vulnerability from the top
in Section III-A we chose a metadata publication delay of three keywords without reading the vulnerability description.
sixty days. Any non-zero metadata publication delay implies He gave 161 (80%) correct answers, which is very close to
the actual metadata of a vulnerability including its CPE URIs our metric’s (81%) in the same configuration (all heuristics
is not available during analysis. We can therefore use the CPE combined). 3% of the vulnerabilities have no relevant keyword
URIs from the vulnerability metadata as a ground truth for at all. For these vulnerabilities there is not a single common
evaluation. We propose two metrics to evaluate the quality of word between the description and the CPE URIs of the
the ordered list of keywords: the position of the first relevant vulnerability, creating a plateau for the metric. We randomly
keyword, and the number of keywords necessary to reconstruct sampled 20 of these vulnerabilities to find out the reason for
the software name. We evaluate our solution using the first the absence of keyword matching. In 18 cases the vulnerability
metric in Section IV-B and using the second in Section IV-C. disclosure is about a software that has never been seen before
at the time. In the two other cases, the software was seen
B. Position of the First Relevant Keyword only one day and four days before, a time period within the
As our goal is to identify the software affected by a vulner- metadata delay we analyzed in Section III-A. In all cases this
ability, we formally define a relevant keyword as a substring leads to the CPE index not being populated with the proper
Fig. 8: Number of keywords necessary for name reconstruction.
Fig. 9: Distribution of keyword list lengths before and after truncation with
a target of preserving at least 95% of the norm.
software name, which is filtered out at the keyword extraction
stage. Examples of such vulnerabilities are CVE-2016-1132
(first vulnerability disclosed for the Shoplat iOS application) 42% of the time. 27% of the vulnerabilities do not have
and CVE-2016-1198 (Photopt Android application). Regard- enough keywords in their description to reconstruct a software
ing individual heuristics, we can see that our capitalization name at all. This leads to a lower plateau for the metric
heuristic brings a substantial accuracy increase compared to compared to first relevant keyword position. We sampled 20
the base TF-IDF weighting. The number of vulnerabilities of these vulnerabilities at random to investigate the cause
with a relevant keyword at position 3 or below is increased of reconstruction failure. In 14 cases the affected software
from 76% to 86%. The multiple-words heuristic decreases the has never been seen before, leading to the same problem as
accuracy under this metric. The reason is that multiple-words described in Section IV-B. In the six other cases the software
keywords are aggressively pushed to the beginning of the list name is worded differently in the vulnerability description and
most of the time in front of single word keywords. When the associated CPE URIs. As an example, CVE-2017-3814’s
the software and vendor names are only one word long, they description describes a vulnerability affecting the software
might lose one rank because of an irrelevant multiple-words Cisco Firepower System Software while the associated CPE
keyword. However the reconstruction metric sheds a different URIs are referencing Cisco Firepower Management Center.
light on this heuristic’s accuracy, as discussed in the next One of these 6 cases, while technically a wording problem,
section. The lib heuristic, while being strictly superior to the can be attributed to excessive strictness in our parsing logic.
base TF-IDF weighting, provides such insignificant gains that Regarding individual heuristics, the multiple-words heuristic
it probably doesn’t justify its maintenance cost. All heuristics now provides the biggest improvement. This makes sense,
combined provide a measurable improvement over the base as having multiple-words keywords provides opportunities to
TF-IDF weighting without heuristics. drastically shorten reconstruction of multiple-words software
name. For instance, in a single word setup, the software
C. Number of Keywords Necessary for Software Name Recon- “linux kernel” takes at least two keywords to be reconstructed
struction (“linux” and “kernel”), while it can be fully reconstructed
Our second metric is about measuring our ability to fully in a single multiple-words keyword (“linux kernel”). The
reconstruct a software name using the smallest amount of key- capitalization heuristic again brings a substantial improvement
words. Formally, this means finding a permutation of a subset under this metric. This time again the lib heuristic brings
of keywords equal to a full software name string from a CPE a very small improvement such that its maintenance cost is
URI of the vulnerability, then measuring the highest keyword probably not justified. All heuristics combined together yield
position in this group. As an intuitive example, if we want to the best accuracy of all configurations. Using this setup we can
reconstruct the software name “linux kernel” and our keyword reconstruct the full name of an affected software in 9 keywords
list is, in order, “kernel”, “overflow”, “linux”, and “buffer”, we or less for 71% of the vulnerabilities in the evaluation dataset.
can reconstruct the software name using the first 3 keywords
(disregarding “overflow”). This metric is strictly more difficult D. Keyword List Truncation Evaluation
than the previous one, as we now want to reconstruct full In this section we evaluate two effects of the truncation step:
strings instead of substrings, and we are focusing on the the keyword list length reduction ratio and a possible accuracy
software name only and not the software vendor. However it is loss due to excessive truncation. All our evaluations were done
also more indicative of real world usefulness, as reconstructing with a truncation target of preserving at least 95% of the
a complete software name provides more useful information original norm. Figure 9 shows the distribution of keyword list
than finding a substring of it. The experimental results for this lengths before and after truncation. The median untruncated
metric are described in Figure 8. As expected reconstructing keyword list length is between 23 and 24 words long, while
a full software name is more difficult than finding a relevant the median truncated keyword list length is between 8 and
keyword. Using the base TF-IDF weighting, a software name 9 words long. The average reduction ratio is 3.22. We can
can be reconstructed using the first three keywords only conclude that keyword truncation has a substantial effect on
Indexing ten years of historical data, which would be a one-
time operation on a production system, takes between 5 and
30 seconds on the same hardware, depending on the heuristics
used. The multiple-words heuristic creates more keywords to
index, increasing the time of indexing when this heuristic is
activated. This fast turnaround time enables a security expert to
easily formulate a new heuristic hypothesis, quickly reindex
the full historical dataset using the new heuristic and get a
prompt evaluation of how this heuristic increases or decreases
the accuracy of the analysis. All the code and data used for
Fig. 10: Impact of keyword truncation on first relevant keyword position.
(Truncated norm target = 95%)
our experiment are available at [22].
V. C ONCLUSION
We introduced a method to automatically extract from a
CVE vulnerability the most relevant keywords with regard
to the affected software, relying only on its human readable
description. Our results are promising, as a simple technique
brings results that are accurate enough to be useful in the
real world. As discussed in Section I, our keyword extraction
technique is the first step toward automated reaction to new
vulnerabilities disclosures.
In future work we intend to use this keyword extraction
Fig. 11: Impact of keyword truncation on number of keywords necessary for technique to build a complete threat analysis system at disclo-
name reconstruction. (Truncated norm target = 95%) sure. The goal is to assess automatically at disclosure time how
much a vulnerability is a threat for a given information system.
We can consider the final weighted keyword list (truncated or
keyword list length and is particularly effective at bringing not) as an euclidean vector, which makes euclidean distance
keyword lists to sizes more easily readable by humans. Does between two analyses an interesting similarity metric between
this reduction have an impact on the keyword list accuracy? two vulnerabilities. This would help assessing the threat re-
Figure 10 and 11 show the impact of truncation on the two sulting from a new vulnerability by comparing it to older,
accuracy metrics studied before. We can see that while the annotated vulnerabilities, providing an immediate automated
effects of truncation are negligible on most vulnerabilities, the risk analysis mechanism for one-day vulnerabilities.
relevant keywords of a few hard-to-analyze vulnerabilities are The automated risk analysis and reaction mechanisms made
lost during keyword truncation. 1446 vulnerabilities (4.64%) possible by our technique could become invaluable tools for
went from having a low ranking first relevant keyword to security engineers defending cloud infrastructure and informa-
having no relevant keyword at all. 649 vulnerabilities (2.08%) tion systems against day-to-day threats.
had a software name that could be reconstructed before the
truncation (albeit with difficulty) but not after. The proper ACKNOWLEDGMENT
trade-off between truncation and accuracy probably depends Experiments presented in this paper were carried out us-
on the nature of the keyword consumers downstream. Humans ing the Grid’5000 testbed, supported by a scientific inter-
might prefer shortened keyword lists, as reading a 23 word est group hosted by Inria and including CNRS, RENATER
long keyword list is probably less convenient than reading and several Universities as well as other organizations (see
the actual vulnerability description. Meanwhile machine mon- https://www.grid5000.fr).
itoring systems might or might not prefer untruncated lists,
R EFERENCES
depending on their ability to properly handle keyword noise
and detect weak signals in low-ranked keywords. In either [1] L. Bilge and T. Dumitraş, “Before We Knew It: An Empirical Study
of Zero-day Attacks in the Real World,” in Proceedings of the 2012
case it is not straightforward to make good use of a relevant ACM Conference on Computer and Communications Security (CCS’ 12).
keyword at rank #20 or #25 when all preceding keywords have New York, NY, USA: ACM, 2012, pp. 833–844.
been irrelevant, which makes a good case for truncation. [2] Common Vulnerabilities and Exposures (CVE). https://cve.mitre.org/.
[3] National Vulnerability Database. https://nvd.nist.gov/.
[4] Security Content Automation Protocol. https://csrc.nist.gov/projects/
E. Performance of the Analysis Pipeline security-content-automation-protocol.
[5] NVD - CPE. https://nvd.nist.gov/products/cpe.
While performance was not a major concern for us at this [6] Common Vulnerability Scoring System. https://www.first.org/cvss/.
stage, analyzing a day worth of vulnerability historical data [7] CVE and NVD Relationship. https://cve.mitre.org/about/cve and nvd
takes under a second on a commodity laptop with 16 Gb of relationship.html.
[8] Cloudflare - Inside Shellshock: How hackers are using it to exploit
RAM and an Intel Core i7-7600U CPU @ 2.80GHz, making systems. https://blog.cloudflare.com/inside-shellshock/.
the pipeline suitable for near real-time analysis at disclosure. [9] AWS WAF - Web Application Firewall. https://aws.amazon.com/waf/.
[10] Google Cloud Armor. https://cloud.google.com/armor/.
[11] Cloudflare Web Application Firewall. https://www.cloudflare.com/waf/.
[12] Cloudflare - Stopping SharePoint’s CVE-2019-0604. https://blog.
cloudflare.com/stopping-cve-2019-0604/.
[13] S. Frei, M. May, U. Fiedler, and B. Plattner, “Large-scale Vulnerability
Analysis,” in Proceedings of the 2006 SIGCOMM Workshop on Large-
scale Attack Defense (LSAD ’06). New York, NY, USA: ACM, 2006,
pp. 131–138.
[14] S. Clark, S. Frei, M. Blaze, and J. Smith, “Familiarity Breeds Contempt:
The Honeymoon Effect and the Role of Legacy Code in Zero-day
Vulnerabilities,” in Proceedings of the 26th Annual Computer Security
Applications Conference (ACSAC ’10). New York, NY, USA: ACM,
2010, pp. 251–260.
[15] L. Glanz, S. Schmidt, S. Wollny, and B. Hermann, “A Vulnerabil-
ity’s Lifetime: Enhancing Version Information in CVE Databases,”
in Proceedings of the 15th International Conference on Knowledge
Technologies and Data-driven Business (i-KNOW ’15). New York,
NY, USA: ACM, 2015.
[16] J. Jacobs, S. Romanosky, B. Edwards, M. Roytman, and I. Adjerid,
“Exploit Prediction Scoring System (EPSS),” in Black Hat 2019,
2019. [Online]. Available: http://i.blackhat.com/USA-19/Thursday/
us-19-Roytman-Predictive-Vulnerability-Scoring-System-wp.pdf
[17] A. Dulaunoy. (2016) The Myth of Software and Hardware Vulnerability
Management. https://www.foo.be/2016/05/The Myth of Vulnerability
Management/.
[18] L. A. B. Sanguino and R. Uetz, “Software Vulnerability Analysis Using
CPE and CVE,” CoRR, vol. abs/1705.05347, 2017.
[19] NIST IR 7695 — Common Platform Enumeration: Naming
Specification Version 2.3. http://csrc.nist.gov/publications/nistir/
ir7695/NISTIR-7695-CPE-Naming.pdf.
[20] K. S. Jones, “A statistical interpretation of term specificity and its
application in retrieval,” Journal of Documentation, vol. 28, pp. 11–21,
1972.
[21] Term Frequency - Inverse Document Frequency statistics. https://jmotif.
github.io/sax-vsm site/morea/algorithm/TFIDF.html.
[22] Firres. https://gitlab.inria.fr/celbaz/firres noms.

You might also like