0% found this document useful (0 votes)
72 views2 pages

Spacy Cheat Sheet Python For Data Science: Spans Visualizing

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 2

> Spans > Visualizing

Python For Data Science


Accessing spans If you're in a Jupyter notebook, use displacy.render otherwise,

use displacy.serve to start a web server and show the visualization in your browser.

spaCy Cheat Sheet Span indices are exclusive. So doc[2:4] is a span starting at token 2, up to – but not including! – token 4.
>>> doc = nlp("This is a text")

>>> from spacy import displacy

>>> span = doc[2:4]


Visualize dependencies
Learn spaCy online at www.DataCamp.com >>>
'a
span.text

text'
>>> doc = nlp("This is a sentence")

>>> displacy.render(doc, style="dep")

Creating a span manually

spaCy >>> from spacy.tokens import Span #Import the Span object

>>> doc = nlp("I live in New York") #Create a Doc object

>>> span = Span(doc, 3, 5, label="GPE") #Span for "New York" with label GPE (geopolitical)

>>> span.text

spaCy is a free, open-source library for advanced Natural Language 'New York’

processing (NLP) in Python. It's designed specifically for production use and Visualize named entities
helps you build applications that process and "understand" large volumes
>>> doc = nlp("Larry Page founded Google")

of text. Documentation: spacy.io


>>> $ pip install spacy

> Linguistic features >>> displacy.render(doc, style="ent")

>>> import spacy Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_ .

Part-of-speech tags Predicted by Statistical model


> Statistical models >>> doc = nlp("This is a text.")

> Word vectors and similarity


>>> [token.pos_ for token in doc] #Coarse-grained part-of-speech tags

Download statistical models ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']


To use word vectors, you need to install the larger models ending in md or lg , for example en_core_web_lg .
>>> [token.tag_ for token in doc] #Fine-grained part-of-speech tags

['DT', 'VBZ', 'DT', 'NN', '.']


Predict part-of-speech tags, dependency labels, named entities

and more. See here for available models: spacy.io/models Comparing similarity
>>> $ python -m spacy download en_core_web_sm Syntactic dependencies Predicted by Statistical model
>>> doc1 = nlp("I like cats")

>>> doc2 = nlp("I like dogs")

Check that your installed models are up to date >>> doc = nlp("This is a text.")

>>> [token.dep_ for token in doc] #Dependency labels

>>>
>>>
doc1.similarity(doc2) #Compare 2 documents

doc1[2].similarity(doc2[2]) #Compare 2 tokens

['nsubj', 'ROOT', 'det', 'attr', 'punct']


>>> doc1[0].similarity(doc2[1:3]) # Comparetokens and spans
>>> $ python -m spacy validate >>> [token.head.text for token in doc] #Syntactic head token (governor)

['is', 'is', 'text', 'is', 'is']


Accessing word vectors
Loading statistical models
Named entities Predicted by Statistical model
>>> doc = nlp("I like cats") #Vector as a numpy array

>>> import spacy


>>> doc[2].vector #The L2 norm of the token's vector

>>> nlp = spacy.load("en_core_web_sm") # Load the installed model "en_core_web_sm" >>> doc = nlp("Larry Page founded Google")
>>> doc[2].vector_norm
>>> [(ent.text, ent.label_) for ent in doc.ents] #Text and label of named entity span

[('Larry Page', 'PERSON'), ('Google', 'ORG')]

> Documents and tokens > Syntax iterators


Processing text
> Pipeline components
Sentences Ususally needs the dependency parser
Functions that take a Doc object, modify it and return it.
Processing text with the nlp object returns a Doc object that holds all
>>> doc = nlp("This a sentence. This is another one.")

information about the tokens, their linguistic features and their relationships >>> [sent.text for sent in doc.sents] #doc.sents is a generator that yields sentence spans

['This is a sentence.', 'This is another one.']


>>> doc = nlp("This is a text")

Accessing token attributes Base noun phrases Needs the tagger and parser

>>> doc = nlp("This is a text")

Pipeline information >>> doc = nlp("I have a red car")

#doc.noun_chunks is a generator that yields spans

>>>[token.text for token in doc] #Token texts

>>> nlp = spacy.load("en_core_web_sm")


>>> [chunk.text for chunk in doc.noun_chunks]

['This', 'is', 'a', 'text']


>>> nlp.pipe_names
['I', 'a red car']
['tagger', 'parser', 'ner']

>>> nlp.pipeline

[('tagger', <spacy.pipeline.Tagger>),

> Label explanations ('parser', <spacy.pipeline.DependencyParser>),

('ner', <spacy.pipeline.EntityRecognizer>)]

>>> spacy.explain("RB")

'adverb'

Custom components
>>> spacy.explain("GPE")
Learn Data Skills Online at
'Countries, cities, states' def custom_component(doc): #Function that modifies the doc and returns it

print("Do something to the doc here!")


www.DataCamp.com
return doc

nlp.add_pipe(custom_component, first=True) #Add the component first in the pipeline


Components can be added first , last (default), or before or after an existing component.
> Extension attributes > Rule-based matching > Glossary
Custom attributes that are registered on the global Doc, Token and Span classes and become available as ._ .
>>> from spacy.tokens import Doc, Token, Span

Using the matcher Tokenization


>>> doc = nlp("The sky over New York is blue")
# Matcher is initialized with the shared vocab
Segmenting text into words, punctuation etc
>>> from spacy.matcher import Matcher

Attribute extensions With default value # Each dict represents one token and its attributes

>>> matcher = Matcher(nlp.vocab)


Lemmatization
# Add with ID, optional callback and pattern(s)

# Register custom attribute on Token class


>>> pattern = [{"LOWER": "new"}, {"LOWER": "york"}]

>>> Token.set_extension("is_color", default=False)


>>> matcher.add("CITIES", None, pattern)
Assigning the base forms of words, for example:

# Overwrite extension attribute with default value


# Match by calling the matcher on a Doc object

doc[6]._.is_color = True "was" → "be" or "rats" → "rat".


>>> doc = nlp("I live in New York")

>>> matches = matcher(doc)

# Matches are (match_id, start, end) tuples

Property extensions With getter and setter >>> for match_id, start, end in matches:
Sentence Boundary Detection
# Get the matched span by slicing the Doc

span = doc[start:end]

# Register custom attribute on Doc class


print(span.text)
Finding and segmenting individual sentences.
>>> get_reversed = lambda doc: doc.text[::-1]
'New York'
>>> Doc.set_extension("reversed", getter=get_reversed)

# Compute value of extension attribute with getter


Part-of-speech (POS) Tagging
>>> doc._.reversed

'eulb si kroY weN revo yks ehT'


Token patterns
Assigning word types to tokens like verb or noun.
# "love cats", "loving cats", "loved cats"

Method extensions Callable Method >>> pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]

# "10 people", "twenty people"

>>> pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]

Dependency Parsing
# Register custom attribute on Span class
# "book", "a cat", "the sea" (noun + optional article)

>>> has_label = lambda span, label: span.label_ == label


>>> pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
>>> Span.set_extension("has_label", method=has_label)
Assigning syntactic dependency labels,

# Compute value of extension attribute with method


describing the relations between individual

>>> doc[3:5].has_label("GPE")

True
Operators and quantifiers tokens, like subject or object.
Can be added to a token dict as the "OP" key
Named Entity Recognition (NER)
! Negate pattern and match exactly 0 times

Labeling named "real-world" objects,

? Make pattern optional and match 0 or 1 times


like persons, companies or locations.
+ Require pattern to match 1 or more times

Allow pattern to match 0 or more time


Text Classification
*

Assigning categories or labels to a whole

document, or parts of a document.

Statistical model
Process for making predictions based on examples.

Training
Updating a statistical model with new examples.

Learn Data Skills Online at


www.DataCamp.com

You might also like