Spacy Cheat Sheet Python For Data Science: Spans Visualizing

> Spans > Visualizing
Python For Data Science

Accessing spans If you're in a Jupyter notebook, use displacy.render otherwise,
use displacy.serve to start a web server and show the visualization in your browser.
spaCy Cheat Sheet Span indices are exclusive. So doc[2:4] is a span starting at token 2, up to – but not including! – token 4.
>>> doc = nlp("This is a text")
>>> from spacy import displacy
>>> span = doc[2:4]

Visualize dependencies
Learn spaCy online at www.DataCamp.com >>>
'a
span.text
text'
>>> doc = nlp("This is a sentence")
>>> displacy.render(doc, style="dep")
Creating a span manually
spaCy >>> from spacy.tokens import Span #Import the Span object
>>> doc = nlp("I live in New York") #Create a Doc object
>>> span = Span(doc, 3, 5, label="GPE") #Span for "New York" with label GPE (geopolitical)
>>> span.text
spaCy is a free, open-source library for advanced Natural Language 'New York’
processing (NLP) in Python. It's designed specifically for production use and Visualize named entities
helps you build applications that process and "understand" large volumes
>>> doc = nlp("Larry Page founded Google")
of text. Documentation: spacy.io

>>> $ pip install spacy
> Linguistic features >>> displacy.render(doc, style="ent")
>>> import spacy Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_ .
Part-of-speech tags Predicted by Statistical model

> Statistical models >>> doc = nlp("This is a text.")
> Word vectors and similarity

>>> [token.pos_ for token in doc] #Coarse-grained part-of-speech tags
Download statistical models ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']

To use word vectors, you need to install the larger models ending in md or lg , for example en_core_web_lg .
>>> [token.tag_ for token in doc] #Fine-grained part-of-speech tags
['DT', 'VBZ', 'DT', 'NN', '.']

Predict part-of-speech tags, dependency labels, named entities
and more. See here for available models: spacy.io/models Comparing similarity
>>> $ python -m spacy download en_core_web_sm Syntactic dependencies Predicted by Statistical model
>>> doc1 = nlp("I like cats")
>>> doc2 = nlp("I like dogs")
Check that your installed models are up to date >>> doc = nlp("This is a text.")
>>> [token.dep_ for token in doc] #Dependency labels
>>>
>>>
doc1.similarity(doc2) #Compare 2 documents
doc1[2].similarity(doc2[2]) #Compare 2 tokens
['nsubj', 'ROOT', 'det', 'attr', 'punct']

>>> doc1[0].similarity(doc2[1:3]) # Comparetokens and spans
>>> $ python -m spacy validate >>> [token.head.text for token in doc] #Syntactic head token (governor)
['is', 'is', 'text', 'is', 'is']

Accessing word vectors
Loading statistical models
Named entities Predicted by Statistical model
>>> doc = nlp("I like cats") #Vector as a numpy array
>>> import spacy

>>> doc[2].vector #The L2 norm of the token's vector
>>> nlp = spacy.load("en_core_web_sm") # Load the installed model "en_core_web_sm" >>> doc = nlp("Larry Page founded Google")
>>> doc[2].vector_norm
>>> [(ent.text, ent.label_) for ent in doc.ents] #Text and label of named entity span
[('Larry Page', 'PERSON'), ('Google', 'ORG')]
> Documents and tokens > Syntax iterators

Processing text
> Pipeline components
Sentences Ususally needs the dependency parser
Functions that take a Doc object, modify it and return it.
Processing text with the nlp object returns a Doc object that holds all
>>> doc = nlp("This a sentence. This is another one.")
information about the tokens, their linguistic features and their relationships >>> [sent.text for sent in doc.sents] #doc.sents is a generator that yields sentence spans
['This is a sentence.', 'This is another one.']

Accessing token attributes Base noun phrases Needs the tagger and parser
Pipeline information >>> doc = nlp("I have a red car")
#doc.noun_chunks is a generator that yields spans
>>>[token.text for token in doc] #Token texts
>>> nlp = spacy.load("en_core_web_sm")

>>> [chunk.text for chunk in doc.noun_chunks]
['This', 'is', 'a', 'text']

>>> nlp.pipe_names
['I', 'a red car']
['tagger', 'parser', 'ner']
>>> nlp.pipeline
[('tagger', <spacy.pipeline.Tagger>),
> Label explanations ('parser', <spacy.pipeline.DependencyParser>),
('ner', <spacy.pipeline.EntityRecognizer>)]
>>> spacy.explain("RB")
'adverb'
Custom components
>>> spacy.explain("GPE")
Learn Data Skills Online at
'Countries, cities, states' def custom_component(doc): #Function that modifies the doc and returns it
print("Do something to the doc here!")

www.DataCamp.com
return doc
nlp.add_pipe(custom_component, first=True) #Add the component first in the pipeline

Components can be added first , last (default), or before or after an existing component.
> Extension attributes > Rule-based matching > Glossary
Custom attributes that are registered on the global Doc, Token and Span classes and become available as ._ .
>>> from spacy.tokens import Doc, Token, Span
Using the matcher Tokenization

>>> doc = nlp("The sky over New York is blue")
# Matcher is initialized with the shared vocab
Segmenting text into words, punctuation etc
>>> from spacy.matcher import Matcher
Attribute extensions With default value # Each dict represents one token and its attributes
>>> matcher = Matcher(nlp.vocab)

Lemmatization
# Add with ID, optional callback and pattern(s)
# Register custom attribute on Token class

>>> pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
>>> Token.set_extension("is_color", default=False)

>>> matcher.add("CITIES", None, pattern)
Assigning the base forms of words, for example:
# Overwrite extension attribute with default value

# Match by calling the matcher on a Doc object
doc[6]._.is_color = True "was" → "be" or "rats" → "rat".

>>> doc = nlp("I live in New York")
>>> matches = matcher(doc)
# Matches are (match_id, start, end) tuples
Property extensions With getter and setter >>> for match_id, start, end in matches:
Sentence Boundary Detection
# Get the matched span by slicing the Doc
span = doc[start:end]
# Register custom attribute on Doc class

print(span.text)
Finding and segmenting individual sentences.
>>> get_reversed = lambda doc: doc.text[::-1]
'New York'
>>> Doc.set_extension("reversed", getter=get_reversed)
# Compute value of extension attribute with getter

Part-of-speech (POS) Tagging
>>> doc._.reversed
'eulb si kroY weN revo yks ehT'

Token patterns
Assigning word types to tokens like verb or noun.
# "love cats", "loving cats", "loved cats"
Method extensions Callable Method >>> pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]
# "10 people", "twenty people"
>>> pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]
Dependency Parsing
# Register custom attribute on Span class
# "book", "a cat", "the sea" (noun + optional article)
>>> has_label = lambda span, label: span.label_ == label

>>> pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
>>> Span.set_extension("has_label", method=has_label)
Assigning syntactic dependency labels,
# Compute value of extension attribute with method

describing the relations between individual
>>> doc[3:5].has_label("GPE")
True
Operators and quantifiers tokens, like subject or object.
Can be added to a token dict as the "OP" key
Named Entity Recognition (NER)
! Negate pattern and match exactly 0 times
Labeling named "real-world" objects,
? Make pattern optional and match 0 or 1 times

like persons, companies or locations.
+ Require pattern to match 1 or more times
Allow pattern to match 0 or more time

Text Classification
*
Assigning categories or labels to a whole
document, or parts of a document.
Statistical model
Process for making predictions based on examples.
Training
Updating a statistical model with new examples.
Learn Data Skills Online at

www.DataCamp.com

Spacy Cheat Sheet Python For Data Science: Spans Visualizing

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Spacy Cheat Sheet Python For Data Science: Spans Visualizing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spacy Cheat Sheet Python For Data Science: Spans Visualizing

Uploaded by

Copyright:

Available Formats

> Spans > Visualizing

Python For Data Science

>>> from spacy import displacy

>>> span = doc[2:4]

>>> displacy.render(doc, style="dep")

Creating a span manually

>>> doc = nlp("I live in New York") #Create a Doc object

of text. Documentation: spacy.io

> Linguistic features >>> displacy.render(doc, style="ent")

Part-of-speech tags Predicted by Statistical model

> Word vectors and similarity

Download statistical models ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']

['DT', 'VBZ', 'DT', 'NN', '.']

>>> doc2 = nlp("I like dogs")

>>> [token.dep_ for token in doc] #Dependency labels

doc1[2].similarity(doc2[2]) #Compare 2 tokens

['nsubj', 'ROOT', 'det', 'attr', 'punct']

['is', 'is', 'text', 'is', 'is']

>>> import spacy

[('Larry Page', 'PERSON'), ('Google', 'ORG')]

> Documents and tokens > Syntax iterators

['This is a sentence.', 'This is another one.']

>>> doc = nlp("This is a text")

Pipeline information >>> doc = nlp("I have a red car")

#doc.noun_chunks is a generator that yields spans

>>>[token.text for token in doc] #Token texts

>>> nlp = spacy.load("en_core_web_sm")

['This', 'is', 'a', 'text']

> Label explanations ('parser', <spacy.pipeline.DependencyParser>),

print("Do something to the doc here!")

nlp.add_pipe(custom_component, first=True) #Add the component first in the pipeline

Using the matcher Tokenization

>>> matcher = Matcher(nlp.vocab)

# Register custom attribute on Token class

>>> Token.set_extension("is_color", default=False)

# Overwrite extension attribute with default value

doc[6]._.is_color = True "was" → "be" or "rats" → "rat".

>>> matches = matcher(doc)

# Matches are (match_id, start, end) tuples

# Register custom attribute on Doc class

# Compute value of extension attribute with getter

'eulb si kroY weN revo yks ehT'

# "10 people", "twenty people"

>>> pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]

>>> has_label = lambda span, label: span.label_ == label

# Compute value of extension attribute with method

Labeling named "real-world" objects,

? Make pattern optional and match 0 or 1 times

Allow pattern to match 0 or more time

Assigning categories or labels to a whole

document, or parts of a document.

Learn Data Skills Online at

You might also like