Text tutorial from fastai
- Packages and data

Text tutorial from fastai

The material can be found on the fastai website

Packages and data

We begin by importing all the required packages from the fastai text module.

from fastai.text.all import *

In fastai, each module has an all script which allows us to import everything that is necessesary from that module, in the knowledge that it is safe to do so.

We will be using the IMDB dataset to fine-tune a sentiment analysis model, so lets download it now.

path = untar_data(URLs.IMDB)
path.ls()

(#8) [Path('/Users/ifanjohnston/.fastai/data/imdb/test'),Path('/Users/ifanjohnston/.fastai/data/imdb/tmp_clas'),Path('/Users/ifanjohnston/.fastai/data/imdb/imdb.vocab'),Path('/Users/ifanjohnston/.fastai/data/imdb/unsup'),Path('/Users/ifanjohnston/.fastai/data/imdb/models'),Path('/Users/ifanjohnston/.fastai/data/imdb/README'),Path('/Users/ifanjohnston/.fastai/data/imdb/tmp_lm'),Path('/Users/ifanjohnston/.fastai/data/imdb/train')]

We see train and test folders, so let's check what is inside both of those.

(path/'train').ls(),(path/'test').ls()

((#4) [Path('/Users/ifanjohnston/.fastai/data/imdb/train/neg'),Path('/Users/ifanjohnston/.fastai/data/imdb/train/pos'),Path('/Users/ifanjohnston/.fastai/data/imdb/train/unsupBow.feat'),Path('/Users/ifanjohnston/.fastai/data/imdb/train/labeledBow.feat')],
 (#3) [Path('/Users/ifanjohnston/.fastai/data/imdb/test/neg'),Path('/Users/ifanjohnston/.fastai/data/imdb/test/pos'),Path('/Users/ifanjohnston/.fastai/data/imdb/test/labeledBow.feat')])

Both have subfolders containing positive and negative comments (and also some Bow related files, which I guess is something to do with a bag of words model). This is a standard structure for datasets, and fastai has a built in method to deal with importing the files using the folder names as labels.

So we create a DataLoaders (which is just a collection of DataLoader objects)

dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')

Let's check out a batch of these reviews

dls.show_batch()

	text	category
0	xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero	pos
1	xxbos xxmaj polish film maker xxmaj walerian xxmaj borowczyk 's xxmaj la xxmaj bête ( french , 1975 , aka xxmaj the xxmaj beast ) is among the most controversial and brave films ever made and a very excellent one too . xxmaj this film tells everything that 's generally been hidden and denied about our nature and our sexual nature in particular with the symbolism and silence of its images . xxmaj the images may look wild , perverse , " sick " or exciting , but they are all in relation with the lastly mentioned . xxmaj sex , desire and death are very strong and primary things and dominate all the flesh that has a human soul inside it . xxmaj they interest and xxunk us so powerfully ( and by our nature ) that they are considered scary , unacceptable and something too wild to be	pos
2	xxbos xxmaj berlin - born in 1942 xxmaj margarethe von xxmaj trotta was an actress and now she is a very important director and writer . xxmaj she has been described , perhaps even unfairly caricatured , as a director whose commitment to bringing a woman 's sensibility to the screen outweighs her artistic strengths . " rosenstrasse , " which has garnered mixed and even strange reviews ( the xxmaj new xxmaj york xxmaj times article was one of the most negatively aggressive reviews xxmaj i 've ever read in that paper ) is not a perfect film . xxmaj it is a fine movie and a testament to a rare xxunk of successful opposition to the genocidal xxmaj nazi regime by , of all peoples , generically powerless xxmaj germans demonstrating in a xxmaj berlin street . \n\n xxmaj co - writer von xxmaj trotta uses the actual	pos
3	xxbos xxmaj here are the matches . . . ( adv . = advantage ) \n\n xxmaj the xxmaj warriors ( ultimate xxmaj warrior , xxmaj texas xxmaj tornado and xxmaj legion of xxmaj doom ) v xxmaj the xxmaj perfect xxmaj team ( mr xxmaj perfect , xxmaj ax , xxmaj smash and xxmaj crush of xxmaj demolition ) : xxmaj ax is the first to go in seconds when xxmaj warrior splashes him for the pin ( 4 - 3 adv . xxmaj warriors ) . i knew xxmaj ax was n't a healthy man but if he was that unhealthy why bother have him on the card ? xxmaj this would be his last xxup ppv . xxmaj eventually , both xxmaj legion of xxmaj doom and xxmaj demolition job out cheaply via double disqualification ( 2 - 1 adv . xxmaj warriors ) . xxmaj perfect	neg
4	xxbos xxmaj in xxup nyc , seaman xxmaj michael o'hara ( orson xxmaj welles ) rescues xxmaj elsa xxmaj bannister ( rita xxmaj hayworth ) from a mugging & rape as she takes a horse & carriage through xxmaj central xxmaj park -and lives to regret it . xxmaj xxunk - haired xxmaj hayworth 's a platinum blonde in this one ; as dazzling as fresh - fallen snow -but nowhere near as pure … \n\n xxmaj to reveal any more of the convoluted plot in this seminal " noir " would be criminal . xxmaj it 's as deceptive as the mirrors used to cataclysmic effect in the final scenes -but the film holds far darker secrets : xxmaj from the xxup ny xxmaj times : " childhood xxmaj shadows : xxmaj the xxmaj hidden xxmaj story xxmaj of xxmaj the xxmaj black xxmaj dahlia xxmaj murder " by	pos
5	xxbos xxmaj i 've rented and watched this movie for the 1st time on xxup dvd without reading any reviews about it . xxmaj so , after 15 minutes of watching xxmaj i 've noticed that something is wrong with this movie ; it 's xxup terrible ! i mean , in the trailers it looked scary and serious ! \n\n i think that xxmaj eli xxmaj roth ( mr . xxmaj director ) thought that if all the characters in this film were stupid , the movie would be funny … ( so stupid , it 's funny … ? xxup wrong ! ) xxmaj he should watch and learn from better horror - comedies such xxunk xxmaj night " , " the xxmaj lost xxmaj boys " and " the xxmaj return xxmaj of the xxmaj living xxmaj dead " ! xxmaj those are funny ! \n\n "	neg
6	xxbos xxup myra xxup breckinridge is one of those rare films that established its place in film history immediately . xxmaj praise for the film was absolutely nonexistent , even from the people involved in making it . xxmaj this film was loathed from day one . xxmaj while every now and then one will come across some maverick who will praise the film on philosophical grounds ( aggressive feminism or the courage to tackle the issue of xxunk ) , the film has not developed a cult following like some notorious flops do . xxmaj it 's not hailed as a misunderstood masterpiece like xxup scarface , or trotted out to be ridiculed as a camp classic like xxup showgirls . \n\n xxmaj undoubtedly the reason is that the film , though outrageously awful , is not lovable , or even likable . xxup myra xxup breckinridge is just	neg
7	xxbos i felt duty bound to watch the 1983 xxmaj timothy xxmaj dalton / xxmaj zelah xxmaj clarke adaptation of " jane xxmaj eyre , " because xxmaj i 'd just written an article about the 2006 xxup bbc " jane xxmaj eyre " for xxunk . \n\n xxmaj so , i approached watching this the way xxmaj i 'd approach doing homework . \n\n i was irritated at first . xxmaj the lighting in this version is bad . xxmaj everyone / everything is washed out in a bright white xxunk light that , in some scenes , casts shadows on the wall behind the characters . \n\n xxmaj and the sound is poorly recorded . i felt like i was listening to a high school play . \n\n xxmaj and the pancake make - up is way too heavy . \n\n xxmaj and the sets do n't fully	pos
8	xxbos xxmaj pier xxmaj paolo xxmaj pasolini , or xxmaj pee - pee - pee as i prefer to call him ( due to his love of showing male genitals ) , is perhaps xxup the most overrated xxmaj european xxmaj marxist director - and they are thick on the ground . xxmaj how anyone can see " art " in this messy , cheap sex - romp concoction is beyond me . xxmaj some of the " stories " here could have come straight out of a soft - core porn film , and i am not even so much referring to the nudity but the simplistic and banal , often pointless stories . xxmaj anyone who enjoyed this relatively watchable but dumb oddity should really sink his teeth into the " der xxmaj xxunk " soft - porn xxmaj german 70s movie series , because that 's what	neg

Note that a fair amount of preprocessing has been done already on the reviews, and some extra tokens have been inserted:

xxbos indicates that it is the beginning of a review
xxmaj indicates that the following word should be capitalized

Next we can define a learner which is suitable for text classification:

learner = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

Note that we are using the AWD_LSTM architecture, which stands for Adjusted stochastic gradient descent Weight Decay, Long-Short-Term Memory. The AWD part basically just means that the way that weights are adjusted is modified, while the LSTM part means that it can deal with both long and short dependencies (more notes on LSTM architecture coming soon).

The drop_mult parameters just controls the magnitude of the dropouts in the model. For the metrics that we will be tracking, we just take accuracy.

Now we can fine-tune our model for a couple of epochs:

learner.fine_tune(2, 1e-2)

epoch	train_loss	valid_loss	accuracy	time
0	0.608062	0.428918	0.809920	2:52:42

0.00% [0/2 00:00<00:00]

epoch	train_loss	valid_loss	accuracy	time

90.26% [352/390 4:32:27<29:24 0.3171]

</div> </div> </div> </div> </div>

learner.show_results()

learner.save('2epochs')

</div>